vendredi 26 février 2016

Why will for-loop with multithreading not have as great performance as with single-thread?

I believed it was better to process simple and heavy works (ex. matrix-calculation) with multi-threading than with single-thread, so I tested the following code :

int main()
{
    constexpr int N = 100000;

    std::random_device rd;
    std::mt19937 mt(rd());
    std::uniform_real_distribution<double> ini(0.0, 10.0);

    // single-thread
    {
        std::vector<int> vec(N);
        for(int i = 0; i < N; ++i)
        {
            vec[i] = ini(mt);
        }

        auto start = std::chrono::system_clock::now();

        for(int i = 0; i < N; ++i)
        {
            vec[i] = 2 * vec[i] + 3 * vec[i] - vec[i];
        }

        auto end = std::chrono::system_clock::now();
        auto dur = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
        std::cout << "single : " << dur << " ms."<< std::endl;
    }

    // multi-threading (Th is the number of threads)
    for(int Th : {1, 2, 4, 8, 16})
    {
        std::vector<int> vec(N);
        for(int i = 0; i < N; ++i)
        {
            vec[i] = ini(mt);
        }

        auto start = std::chrono::system_clock::now();

        std::vector<std::future<void>> fut(Th);
        for(int t = 0; t < Th; ++t)
        {
            fut[t] = std::async(std::launch::async, [t, &vec, &N, &Th]{
                for(int i = t*N / Th; i < (t + 1)*N / Th; ++i)
                {
                    vec[i] = 2 * vec[i] + 3 * vec[i] - vec[i];
                }
            });
        }
        for(int t = 0; t < Th; ++t)
        {
            fut[t].get();
        }

        auto end = std::chrono::system_clock::now();
        auto dur = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
        std::cout << "Th = " << Th << " : " << dur << " ms." << std::endl;
    }

    return 0;
}

The execution environment :

OS : Windows 10 64-bit
Build-system : Visual Studio Community 2015
CPU : Core i5 4210U

When building this program in the Debug mode, the result was as I expected :

single : 146 ms.
Th = 1 : 140 ms.
Th = 2 : 71 ms.
Th = 4 : 64 ms.
Th = 8 : 61 ms.
Th = 16 : 68 ms.

This says that the code not using std::async justly has same performance as one using one-thread and when using 4 or 8 threads I can get great performance.

However, when in the Release mode, I got a different result (N : 100000 -> 100000000) :

single : 54 ms.
Th = 1 : 443 ms.
Th = 2 : 285 ms.
Th = 4 : 205 ms.
Th = 8 : 206 ms.
Th = 16 : 221 ms.

I'm wondering this result. Just for the latter half codes, multi-threading just has better performance than single. But the fastest one is the first half codes, which do not use std::async. I know the fact that optimization and overhead around multithreading has much effect on the performance. However,

  • The process is just calculation of the vector, so what can be optimized not in the multi-thread codes but in the single-thread codes?
  • This program contains nothing about mutex or atomic etc, and data conflict might not occur. I think overheads around multithreading would be relatively small.
  • CPU utilization in the codes not using std::async is smaller than in the multi-threading codes. Is it efficient to use the large part of CPU?

Aucun commentaire:

Enregistrer un commentaire