samedi 2 juillet 2016

C++11 multithreading - read multiple files into one place

I've got 10000+ text files on disk. My task is, to find most common 3-word sequence of words that appear in those files. I read file word by word and increment global std::map<std::string, int> for each sequence in each file. Finally, I sort the map and pick the most common ones. I managed to write code for it, but I read that I can increase the speed by reading each file in another thread.

My first thought was to run as many threads as there are files but my program slowed down from 6s to 80s.

My second thought was to make some kind of thread-pool that runs for example 20 threads, waits for them to finish (.join()) and start next 20 threads (over and over again). That made my program run time improved from 6s to 4s.

But that's not the fastest way, since main thread waits till all 20 threads finish their job, we've got space for those threads that finished working as 1-19th.

My question is, how do I implement such a thread pool that starts working on next file as soon as it finishes working on previous one?

Code for my second thought:

std::vector<std::thread> threads;

char byThreadPool = 20;
int nFileCount = 10495;

for (int i = 0; i < nFileCount; i += byThreadPool)
{
    for (int j = i; j < i+byThreadPool && j < nFileCount; j++)
    {
        std::string fileName = path + std::to_string(j) + PAGE_EXTENSION;
        threads.push_back(std::thread(&CWordParserFileSystem::FetchFile, this, fileName));
    }

    for (int j = 0; j < threads.size(); j++)
        threads[j].join();

    threads.clear();
}

Aucun commentaire:

Enregistrer un commentaire