I'm trying to parallelise a long running function in C++ and using std::async it only uses one core.
It's not the running time of the function is too small, as I'm currently using test data that takes about 10 mins to run.
From my logic I create NThreads worth of Futures (each taking a proportion of the loop rather than an individual cell so it is a nicely long running thread), each of which will dispatch an async task. Then after they've been created the program spin locks waiting for them to complete. However it always uses one core?!
This isn't me looking at top either and saying it looks roughly like one CPU, my ZSH config outputs the CPU % of the last command, and it always exactly 100%, never above
auto NThreads = 12;
auto BlockSize = (int)std::ceil((int)(NThreads / PathCountLength));
std::vector<std::future<std::vector<unsigned __int128>>> Futures;
for (auto I = 0; I < NThreads; ++I) {
std::cout << "HERE" << std::endl;
unsigned __int128 Min = I * BlockSize;
unsigned __int128 Max = I * BlockSize + BlockSize;
if (I == NThreads - 1)
Max = PathCountLength;
Futures.push_back(std::async(
[](unsigned __int128 WMin, unsigned __int128 Min, unsigned__int128 Max,
std::vector<unsigned __int128> ZeroChildren,
std::vector<unsigned __int128> OneChildren,
unsigned __int128 PathCountLength)
-> std::vector<unsigned __int128> {
std::vector<unsigned __int128> LocalCount;
for (unsigned __int128 I = Min; I < Max; ++I)
LocalCount.push_back(KneeParallel::pathCountOrStatic(
WMin, I, ZeroChildren, OneChildren, PathCountLength));
return LocalCount;
},
WMin, Min, Max, ZeroChildInit, OneChildInit, PathCountLength));
}
for (auto &Future : Futures) {
Future.get();
}
Does anyone have any insight.
I'm compiling with clang and LLVM on Arch Linux. Are there any compile flags I need, but from what I can tell C++11 standardised the thread library?
Aucun commentaire:
Enregistrer un commentaire