I'm busy testing SSE4.2 string instructions using C++11 (MS VS 2015). Because MS VC++ doesn't support inline assembly I use the intrinsic functions.
The test case is simple: Counting lines in Huge (12,5+M lines) text file. I do this by counting the number of '\n' LF.
My code so fare:
#include "nmmintrin.h"
#include <iostream>
#include <fstream>
#include <string>
#include <chrono>
static inline long long popcnt128(__m128i n)
{
return _mm_popcnt_u64(n.m128i_u64[0])
+_mm_popcnt_u64(n.m128i_u64[1]);
}
static inline size_t sse4_strChrCount(const char* pcStr, size_t iStrLen, const char chr)
{
const __m128i mSet = _mm_set1_epi8(chr);
const int iMode = _SIDD_CMP_EQUAL_EACH;
size_t iResult = 0;
for (size_t i = 0; i < iStrLen; i += 16)
{
const __m128i data = _mm_loadu_si128(reinterpret_cast<const __m128i*>(pcStr + i));
__m128i ret = _mm_cmpistrm(data, mSet, iMode);
iResult += popcnt128(ret);
}
return iResult;
}
int main(int argc, char** argv)
{
// NOTE: NO CHECKS FOR SSE4.2 SUPPORT! So be carefull!
const int bufSize = 4096 * 128; // +/- 5Mb on Heap
char* buf = new char[bufSize];
if (argc <= 1)
{
std::cerr << "Provide filename to count newlines on!" << std::endl;
exit(0);
}
std::string fileName(argv[1]);
std::cout << "C++ LineCounter for " << fileName << " with bufSize: " << bufSize << std::endl;
std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
size_t lineCount = 0;
std::ifstream inFile;
inFile.open(fileName, std::ios_base::in | std::ios_base::binary);
while (inFile.good())
{
inFile.read(buf, bufSize);
if (inFile || inFile.gcount() > 0)
{
lineCount += sse4_strChrCount(buf, inFile.gcount(), '\n');
}
}
inFile.close();
std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
std::cout << "Find newline char using SSE4.2 intrinsic functions: Counted " << lineCount << " lines in " << std::chrono::duration_cast<std::chrono::milliseconds>(end - begin).count() << "(ms) " << std::endl;
return 0;
}
Result is:
C++ LineCounter for ..\HugeLogfile.txt with bufSize: 524288
Find newline char using SSE4.2 intrinsic functions: Counted 12867995 lines in 11568(ms)
My questions:
- Is it possible to write a __m128i C++ basic_streambuf<__m128i> with char_traits<__m128i> etc..?
- Would it be faster? E.g. by omiting the high level buffer?
Testing on my Macbook Pro with Windows 10 under Parallels.
CPU Specification: Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz.
Thanks for any input and feedback!
Aucun commentaire:
Enregistrer un commentaire