jeudi 13 juillet 2017

Performance of resizing std::vector

The general conception seems to be that std::unique_ptr has no time overhead compared to properly used owning raw pointers, given sufficient optimization.

But what about using std::unique_ptr in compound data structures, in particular std::vector<std::unique_ptr<T>>? For instance, resizing the underlying data of a vector, which can happen during push_back. To isolate the performance, I loop around pop_back, shrink_to_fit, emplace_back:

#include <chrono>
#include <vector>
#include <memory>
#include <iostream>

constexpr size_t size = 1000000;
constexpr size_t repeat = 1000;
using my_clock = std::chrono::high_resolution_clock;

template<class T>
auto test(std::vector<T>& v) {
    v.reserve(size);
    for (size_t i = 0; i < size; i++) {
        v.emplace_back(new int());
    }
    auto t0 = my_clock::now();
    for (int i = 0; i < repeat; i++) {
        auto back = std::move(v.back());
        v.pop_back();
        v.shrink_to_fit();
        if (back == nullptr) throw "don't optimize me away";
        v.emplace_back(std::move(back));
    }
    return my_clock::now() - t0;
}

int main() {
    std::vector<std::unique_ptr<int>> v_u;
    std::vector<int*> v_p;

    auto millis_p = std::chrono::duration_cast<std::chrono::milliseconds>(test(v_p));
    auto millis_u = std::chrono::duration_cast<std::chrono::milliseconds>(test(v_u));
    std::cout << "raw pointer: " << millis_p.count() << " ms, unique_ptr: " << millis_u.count() << " ms\n";
    for (auto p : v_p) delete p; // I don't like memory leaks ;-)
}

Compiling the code with -O3 -o -march=native -std=c++14 -g with gcc 7.1.0, clang 3.8.0, and 17.0.4 on Linux on a Intel Xeon E5-2690 v3 @ 2.6 GHz (no turbo):

raw pointer: 2746 ms, unique_ptr: 5140 ms  (gcc)
raw pointer: 2667 ms, unique_ptr: 5529 ms  (clang)
raw pointer: 1448 ms, unique_ptr: 5374 ms  (intel)

The raw pointer version spends all it's time in an optimized memmove (intel seems to have a much better one than clang and gcc). The unique_ptr code seems to first copy over the vector data from one memory block to the other and assign the original one with zero - all in a horribly un-optimized loop. And then it loops over the original block of data again to see if any of those that were just zero'd are nonzero and need to be deleted.

Trying to understand how the compilers reason about handling std::unique_ptr, I was looking a bit more at isolated code. For instance:

a.release();
a = std::move(b);

or the similar

a.release();
a.reset(b.release());

none of the x86 compilers seem to be able to optimize away the senseless if (ptr) delete ptr;. The Intel compiler even gives the delete a 28 % chance. Surprisingly, the delete check is consistently omitted for:

auto tmp = b.release();
a.release();
a.reset(tmp);

These bits are not the main aspect of this question, but all of this makes me feel that I am missing something.

Why do various compilers fail to optimize reallocation within std::vector<std::unique_ptr<int>>? Is there anything in the standard that prevents generating code as efficient as with raw pointers? Is this an issue with the standard library implementation? Or are the compilers just not sufficiently clever enough (yet)?

What can one do to avoid performance impact compared to using raw pointers?

Note: Assume that T is polymorphic and expensive to move, so std::vector<T> is not an option.

Aucun commentaire:

Enregistrer un commentaire