c++11: Why will for-loop with multithreading not have as great performance as with single-thread?

vendredi 26 février 2016

Why will for-loop with multithreading not have as great performance as with single-thread?

I believed it was better to process simple and heavy works (ex. matrix-calculation) with multi-threading than with single-thread, so I tested the following code :

int main()
{
    constexpr int N = 100000;

    std::random_device rd;
    std::mt19937 mt(rd());
    std::uniform_real_distribution<double> ini(0.0, 10.0);

    // single-thread
    {
        std::vector<int> vec(N);
        for(int i = 0; i < N; ++i)
        {
            vec[i] = ini(mt);
        }

        auto start = std::chrono::system_clock::now();

        for(int i = 0; i < N; ++i)
        {
            vec[i] = 2 * vec[i] + 3 * vec[i] - vec[i];
        }

        auto end = std::chrono::system_clock::now();
        auto dur = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
        std::cout << "single : " << dur << " ms."<< std::endl;
    }

    // multi-threading (Th is the number of threads)
    for(int Th : {1, 2, 4, 8, 16})
    {
        std::vector<int> vec(N);
        for(int i = 0; i < N; ++i)
        {
            vec[i] = ini(mt);
        }

        auto start = std::chrono::system_clock::now();

        std::vector<std::future<void>> fut(Th);
        for(int t = 0; t < Th; ++t)
        {
            fut[t] = std::async(std::launch::async, [t, &vec, &N, &Th]{
                for(int i = t*N / Th; i < (t + 1)*N / Th; ++i)
                {
                    vec[i] = 2 * vec[i] + 3 * vec[i] - vec[i];
                }
            });
        }
        for(int t = 0; t < Th; ++t)
        {
            fut[t].get();
        }

        auto end = std::chrono::system_clock::now();
        auto dur = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
        std::cout << "Th = " << Th << " : " << dur << " ms." << std::endl;
    }

    return 0;
}

The execution environment :

OS : Windows 10 64-bit
Build-system : Visual Studio Community 2015
CPU : Core i5 4210U

When building this program in the Debug mode, the result was as I expected :

single : 146 ms.
Th = 1 : 140 ms.
Th = 2 : 71 ms.
Th = 4 : 64 ms.
Th = 8 : 61 ms.
Th = 16 : 68 ms.

This says that the code not using std::async justly has same performance as one using one-thread and when using 4 or 8 threads I can get great performance.

However, when in the Release mode, I got a different result (N : 100000 -> 100000000) :

single : 54 ms.
Th = 1 : 443 ms.
Th = 2 : 285 ms.
Th = 4 : 205 ms.
Th = 8 : 206 ms.
Th = 16 : 221 ms.

I'm wondering this result. Just for the latter half codes, multi-threading just has better performance than single. But the fastest one is the first half codes, which do not use std::async. I know the fact that optimization and overhead around multithreading has much effect on the performance. However,

The process is just calculation of the vector, so what can be optimized not in the multi-thread codes but in the single-thread codes?
This program contains nothing about mutex or atomic etc, and data conflict might not occur. I think overheads around multithreading would be relatively small.
CPU utilization in the codes not using std::async is smaller than in the multi-threading codes. Is it efficient to use the large part of CPU?

c++11

vendredi 26 février 2016

Why will for-loop with multithreading not have as great performance as with single-thread?

Aucun commentaire:

Enregistrer un commentaire