mardi 5 septembre 2017

GCC, std::array and for vs ranged-based for

I'm trying to use (large-ish) arrays of structs containing only a single small std::array each. Especially the K == 1 case should be well supported (see code). However, GCC seems to be unable to properly optimize these arrays in most cases, especially when using ranged-based for. Clang produces code that I don't fully understand but seems to be well optimized (and uses SSE, which GCC doesn't).

#include <vector>
#include <array>

template<int K>
struct Foo {
    std::array<int, K> vals;

    int bar() const {
        int sum = 0;
#if 0 // Foo Version A
        for (auto f : vals)
            sum += f;
#else // Foo Version B
        for (auto i = 0; i < K; ++i)
            sum += vals[K];
#endif
        return sum;
    }
};

int test1(std::vector<Foo<1>> const& foos)
{
    int sum = 0;
    for (auto const& f : foos)
        sum += f.bar();
    return sum;
}

// Version C
int test2(std::vector<std::array<int, 1>> const& foos)
{
    int sum = 0;
    for (auto const& f : foos)
        for (auto const& v : f)
            sum += v;
    return sum;
}

// Version D
int test3(std::vector<std::array<int, 1>> const& foos)
{
    int sum = 0;
    for (auto const& f : foos)
        for (auto i = 0; i < f.size(); ++i)
            sum += f[i];
    return sum;
}

Godbolt Code, gcc 7.2, flags -O2 -std=c++11 -march=native. Older gcc versions behave similarly.

If I'm not mistaken, then all four versions should have the same semantic.

Furthermore, I would expect all versions to compile to about the same assembly. The assembly should only have one frequently used conditional jump (for iterating over the vector).

However, the following happens:

  • Version A (ranged-based for, array-in-struct): 3 conditional jumps, one at the beginning for handling zero-length vectors. Then one for the vector (this is ok). But then another for iterating over the array? Why? It has constant size 1.

  • Version B (manual for, array-in-struct): Here, GCC actually recognizes that the array of length 1 can be optimized, the assembly looks good.

  • Version C (ranged-based for, direct array): The loop over the array is not optimized away, so again two conditional jumps for the looping. Also: this version contains more memory access that I thought would be required.

  • Version D (manual for, direct array): This one is the only version that looks sane to me. 11 instructions.

Clang creates way more assembly code (same flags) but it's pretty much the same for all versions and contains a lot of loop unrolling and SSE.

Is this a GCC-related problem? Should I file a bug? Is this something in my C++ code that I should/can fix?

Aucun commentaire:

Enregistrer un commentaire