dimanche 26 février 2017

Dedicated SSE4.2 C++ Stream will it be faster?

I'm busy testing SSE4.2 string instructions using C++11 (MS VS 2015). Because MS VC++ doesn't support inline assembly I use the intrinsic functions.

The test case is simple: Counting lines in Huge (12,5+M lines) text file. I do this by counting the number of '\n' LF.

My code so fare:

#include "nmmintrin.h"
#include <iostream>
#include <fstream>
#include <string>
#include <chrono>

static inline long long popcnt128(__m128i n) 
{
    return _mm_popcnt_u64(n.m128i_u64[0])
    +_mm_popcnt_u64(n.m128i_u64[1]);
}

static inline size_t sse4_strChrCount(const char* pcStr, size_t iStrLen, const char chr) 
{
    const __m128i   mSet = _mm_set1_epi8(chr);
    const int       iMode = _SIDD_CMP_EQUAL_EACH;
    size_t          iResult = 0;

    for (size_t i = 0; i < iStrLen; i += 16)
    {
        const __m128i   data = _mm_loadu_si128(reinterpret_cast<const __m128i*>(pcStr + i));
        __m128i         ret = _mm_cmpistrm(data, mSet, iMode);
        iResult += popcnt128(ret);
    }

    return iResult;
}



int main(int argc, char** argv)
{
    // NOTE: NO CHECKS FOR SSE4.2 SUPPORT! So be carefull!
    const int bufSize = 4096 * 128; // +/- 5Mb on Heap
    char* buf = new char[bufSize];

    if (argc <= 1)
    {
        std::cerr << "Provide filename to count newlines on!" << std::endl;
        exit(0);
    }

    std::string fileName(argv[1]);
    std::cout << "C++ LineCounter for " << fileName << " with bufSize: " << bufSize << std::endl;

    std::chrono::steady_clock::time_point begin =  std::chrono::steady_clock::now();

    size_t lineCount = 0;
    std::ifstream inFile;
    inFile.open(fileName, std::ios_base::in | std::ios_base::binary);

    while (inFile.good())
    {
        inFile.read(buf, bufSize);

        if (inFile || inFile.gcount() > 0)
        {
            lineCount += sse4_strChrCount(buf, inFile.gcount(), '\n');
        }
    }
    inFile.close();

    std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();

    std::cout << "Find newline char using SSE4.2 intrinsic functions: Counted " << lineCount << " lines in " <<         std::chrono::duration_cast<std::chrono::milliseconds>(end - begin).count() << "(ms) " << std::endl;

    return 0;
}

Result is:

C++ LineCounter for ..\HugeLogfile.txt with bufSize: 524288
Find newline char using SSE4.2 intrinsic functions: Counted 12867995 lines in 11568(ms)

My questions:

  • Is it possible to write a __m128i C++ basic_streambuf<__m128i> with char_traits<__m128i> etc..?
  • Would it be faster? E.g. by omiting the high level buffer?

Testing on my Macbook Pro with Windows 10 under Parallels.

CPU Specification: Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz.

Thanks for any input and feedback!

Aucun commentaire:

Enregistrer un commentaire