jeudi 16 novembre 2023

Code runs slower on Intel XEON than Intel i7?

I have some code in which the following function is responsible for 95% of computation

void processServerData(uint32_t partIndex, uint32_t dataOffset, uint64_t *outputData, uint32_t dataSize, uint32_t partSize) {
    auto bytesInEntry = dataSize / sizeof(uint64_t);

    __m256i a, b, r;
    int outputIndex, dataIndex;
    uint32_t rotationOffset;

    int k = 0;
    for (int i = 0; i < partSize; i = i + 1)
    {
        k = i;
        rotationOffset = (k + dataOffset) & (partSize - 1);

        outputIndex = rotationOffset * bytesInEntry;
        dataIndex = partIndex + k * bytesInEntry;

        a = _mm256_loadu_si256((__m256i *)(outputData + outputIndex));
        b = _mm256_loadu_si256((__m256i *)(DB + dataIndex));
        r = _mm256_xor_si256(a, b);
        _mm256_storeu_si256((__m256i *)(outputData + outputIndex), r);
    } }

Here outputData is pointing to exactly same array for each function call. As you can see at high level the code is just XOR two arrays at some offset and storing result into outputData.

I am using following flags

-mavx2 -march=native -O3 

Now here is the question

I ran this code on two instances.

  • Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz running MacOS 13.6, 16 GB DDR4 RAM, compiler Apple clang version 15.0.0 (clang-1500.0.40.1)
  • AWS r7i.2xlarge Intel(R) Xeon(R) Platinum 8488C with Ubuntu, 64 GB DDR5 RAM, compiler g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

I have observed following behavior

  • For smaller database of 32 MB AWS is at least 2x faster than MAC
  • Although for larger database of 8 GB MAC is 2x faster than AWS
  • For AWS I tried doing avx512 and the code got 2x more slower

I assume that this is because Mac has faster processor?

But as this is memory heavy code, AWS's large cache is not helpful?

Is there any optimization that I could do that will help with AWS?

I am very new to such optimizations for any guidance will be highly appriciated

Aucun commentaire:

Enregistrer un commentaire