dimanche 13 février 2022

Why does isolating tasks to task arenas linked to NUMA nodes for memory locality slow down my embarassingly parallel TBB application?

I have this self-contained example of a TBB application that I run on a 2-NUMA-node CPU that performs a simple vector addition repeatedly on dynamic arrays that recreates an issue that I am having with a bit more complicated example. I am trying to divide the tasks cleanly between the available NUMA nodes by initializing the data in parallel with 2 task_arenas that are linked to separate NUMA nodes through TBB's NUMA API. A control example uses a simple parallel_for with a static_partitioner to perform the computation while my intended example invokes per task_arena a task which invokes a parallel_for to compute the vector addition of the designated region, i.e. the half of the dynamic arena that was initialized before in the corresponding NUMA node. This example always takes twice as much time to perform the vector addition compared to the control example. It cannot be the overhead of creating the tasks for the task_arenas that will invoke the parallel_for algorithms, because the performance degradation only occurs when the tbb::task_arena::constraints are applied. Could anyone explain to me what happens and why this performance penalty is so harsh. A direction to resources would also be helpful as I am doing this for a university project.

#include <iostream>
#include <iomanip>
#include <tbb/tbb.h>
#include <vector>



int main(){
    
    std::vector<int> numa_indexes = tbb::info::numa_nodes();
    std::vector<tbb::task_arena> arenas(numa_indexes.size());
    std::size_t numa_nodes = numa_indexes.size();
    for(unsigned j = 0; j < numa_indexes.size(); j++){
        arenas[j].initialize( tbb::task_arena::constraints(numa_indexes[j]));
    }

    std::size_t size = 10000000;
    std::size_t part_size = std::ceil((float)size/numa_nodes);
    double * A = (double *) malloc(sizeof(double)*size);
    double * B = (double *) malloc(sizeof(double)*size);
    double * C = (double *) malloc(sizeof(double)*size);
    double * D = (double *) malloc(sizeof(double)*size);


    //DATA INITIALIZATION
    for(unsigned k = 0; k < numa_indexes.size(); k++)
        arenas[k].execute(
        [&](){
            std::size_t local_start = k*part_size;
            std::size_t local_end = std::min(local_start + part_size, size);
            tbb::parallel_for(static_cast<std::size_t>(local_start), local_end,
                [&](std::size_t i)
            { 
                C[i] = D[i] = 0;
                A[i] = B[i] = 1;
            }, tbb::static_partitioner());
        });

    //PARALLEL ALGORITHM
    tbb::tick_count t0 = tbb::tick_count::now();
    for(int i = 0; i<100; i++)
        tbb::parallel_for(static_cast<std::size_t>(0), size,
            [&](std::size_t i)
            { 
                C[i] = A[i] + B[i];
            }, tbb::static_partitioner());
    tbb::tick_count t1 = tbb::tick_count::now();
    std::cout << "Time 1: " << (t1-t0).seconds() << std::endl;
        

    //TASK ARENA & PARALLEL ALGORITHM
    t0 = tbb::tick_count::now();
    for(int i = 0; i<100; i++)
    for(unsigned k = 0; k < numa_indexes.size(); k++)
        arenas[k].execute(
        [&](){
            std::size_t local_start = k*part_size;
            std::size_t local_end = std::min(local_start + part_size, size);
            tbb::parallel_for(static_cast<std::size_t>(local_start), local_end,
                [&](std::size_t i)
            { 
                D[i] = A[i] + B[i];
            }, tbb::static_partitioner());
        });
    t1 = tbb::tick_count::now();
    std::cout << "Time 2: " << (t1-t0).seconds() << std::endl;

    double sum1 = 0;
    double sum2 = 0;
    for(int i = 0; i<size; i++){
        sum1 += C[i];
        sum2 += D[i];
    }

    std::cout << sum1 << std::endl;
    std::cout << sum2 << std::endl;

    
    return 0;
}

Performance with:

for(unsigned j = 0; j < numa_indexes.size(); j++){
        arenas[j].initialize( tbb::task_arena::constraints(numa_indexes[j]));
    }
$ taskset -c 0,1,8,9 ./RUNME
Time 1: 0.896496
Time 2: 1.60392
2e+07
2e+07

Performance without constraints:

$ taskset -c 0,1,8,9 ./RUNME
Time 1: 0.652501
Time 2: 0.638362
2e+07
2e+07

Aucun commentaire:

Enregistrer un commentaire