jeudi 28 janvier 2016

Are std::get<> and std::tuple<> slower then raw pointers?

I have an C++11 application where I commonly iterate over several different structure of arrays for various algorithms. Raw CPU performance is important for this app.

The array elements are fundamental types (int, double, ..) or simple struct. The array are typically tens of thousands of elements long. I often need to iterate several arrays at once in a given loop. So typically I would need one pointer for each array of whatever type. So times I need to increment five individual pointers which is verbose.

Based on these answers about tuples, Why is std::pair faster than std::tuple C++11 tuple performance I hoped there was no overhead to using tuples to pack the pointers together into a single object.

I thought it might be nice to implement a cursor like object to assist in iterating, since missing the increment on a particular pointer would be an annoying bug.

auto pts = std::make_tuple(p1, p2, p3...); allow you to bundle a bunch of variables together in a typesafe way. Then you can implement a variadic template function to increment each pointer in the tuple in a type safe way.

However... When I measure performance, the tuple version was slower then using raw pointers. But when I look at the generated assembly I see additional mov instructions in the tuple loop increment. Maybe due to the fact the std::get<> returns a reference? I had hoped that would be compiled away...

Am I missing something or are raw pointers just going to beat tuples when used like this? Here is a simple test harness. I threw away the fancy cursor code and just use a std::tuple<> for this test

On my machine, the tuple loop is consistently twice as slow as the raw pointer version for various data sizes.

My system config is Visual C++ 2013 x64 on Windows 8 with a release build. I did try turning on various optimization in Visual Studio such as Inline Function Expansion : Any Suitable (/Ob2) but it did not seem to change the time result for my case.

I did need to do two extra things to avoid aggressive optimization by VS
1) I forced the test data array to allocated on the heap, not the stack. That made a big difference when I timed things, possibly due to memory cache effects.
2) I forced a side effect by writing to static variable at the end so the compiler would not just skip my loop.

struct forceHeap
{
    __declspec(noinline) int* newData(int M)
    {
        int* data = new int[M];
        return data;
    }    
};

void timeSumCursor()
{
    static int gIntStore; 
    int maxCount = 20;
    int M = 10000000; 
    // compiler might place array on stack which changes the timing
    // int* data = new int[N];         
    forceHeap fh;
    int* data = fh.newData(M);
    int *front = data; 
    int *end = data + M;

    int j = 0;
    for (int* p = front; p < end; ++p)
    {
        *p = (++j) % 1000;
    }

    {
        BEGIN_TIMING_BLOCK("raw pointer loop", maxCount);
        int* p = front;
        int sum = 0;
        int* cursor = front;
        while (++cursor != end)
        {
            sum += *cursor;
        }
        gIntStore = sum;// force a side effect
        END_TIMING_BLOCK(); 
    }
    printf("%d\n", gIntStore);

    {
        // just use a simple tuple to show the issue 
        // rather full blown cursor object
        BEGIN_TIMING_BLOCK("tuple loop", maxCount);
        int sum = 0;
        auto cursor = std::make_tuple(front);
        while (++std::get<0>(cursor) != end)
        {
            sum += *std::get<0>(cursor);
        }
        gIntStore = sum; // force a side effect
        END_TIMING_BLOCK();
    }
    printf("%d\n", gIntStore);

    delete[] data;
}

Aucun commentaire:

Enregistrer un commentaire