jeudi 3 août 2017

GCC optimizes fixed range-based for loop as if it had longer, variable length

I have an array of POD structs and am trying to sum across one field. Here's a minimal example:

struct Item
{
    int x = 0;
    int y = 0;
};

typedef Item Items[2];

struct ItemArray
{
    Items items;

    int sum_x1() const;
    int sum_x2() const;
};

int ItemArray::sum_x1() const
{
    int total = 0;
    for (unsigned ii = 0; ii < 2; ++ii)
    {
        total += items[ii].x;
    }
    return total;
}

int ItemArray::sum_x2() const
{
    int total = 0;
    for (const Item& item : items)
    {
        total += item.x;
    }
    return total;
}

The two sum functions do the same thing. Clang compiles them identically. But GCC 6 with -O3 on x86_64 does not. Here's sum_x1(), looking good:

  mov eax, DWORD PTR [rdi+8]
  add eax, DWORD PTR [rdi]
  ret

Now look at sum_x2():

  lea rdx, [rdi+16]
  lea rcx, [rdi+8]
  xor eax, eax
  add eax, DWORD PTR [rdi]
  cmp rdx, rcx
  je .L12
  lea rcx, [rdi+16]
  add eax, DWORD PTR [rdi+8]
  cmp rdx, rcx
  je .L2
  lea rcx, [rdi+24]
  add eax, DWORD PTR [rdi+16]
  cmp rdx, rcx
  je .L2
  lea rcx, [rdi+32]
  add eax, DWORD PTR [rdi+24]
  cmp rdx, rcx
  je .L2
  lea rcx, [rdi+40]
  add eax, DWORD PTR [rdi+32]
  cmp rdx, rcx
  je .L2
  lea rcx, [rdi+48]
  add eax, DWORD PTR [rdi+40]
  cmp rdx, rcx
  je .L2
  lea rcx, [rdi+56]
  add eax, DWORD PTR [rdi+48]
  cmp rdx, rcx
  je .L2
  lea rcx, [rdi+64]
  add eax, DWORD PTR [rdi+56]
  cmp rdx, rcx
  je .L2
  lea rcx, [rdi+72]
  add eax, DWORD PTR [rdi+64]
  cmp rdx, rcx
  je .L2
  add eax, DWORD PTR [rdi+72]
  ret
.L2:
  rep ret
.L12:
  rep ret

Why does GCC emit an unrolled loop of variable length up to 10 when there are the loop length is fixed at 2? It only does this in a member function--making sum_x2 a free function fixes it.

ICC also optimizes sum_x2() very strangely, though the generated code is totally different. Unlike GCC, it doesn't matter whether sum_x2() is a member function or a free function--both are bad.

I'm using GCC 6, but all versions of GCC seem to have problems with this code. Adding -march=haswell makes it even worse, adding iterations for up to 15 elements in the array of size 2. GCC 5 and 7 generate even more complex code, adding SIMD instructions.

Try it: http://ift.tt/2vmVmnw

Aucun commentaire:

Enregistrer un commentaire