Let's consider a parallelized for
loop using omp
called scenario A.
#pragma omp parallel for
for (size_t i = 0; i < 4; ++i) {
// do blocking CUDA/Thrust API call 1
// do blocking CUDA/Thrust API call 2
// do blocking CUDA/Thrust API call 3
// do blocking CUDA/Thrust API call 4
}
Now, let's compare it with scenario B where the parallelized for
loop is split up.
#pragma omp parallel for
for (size_t i = 0; i < 4; ++i) {
// do blocking CUDA/Thrust API call 1
}
#pragma omp parallel for
for (size_t i = 0; i < 4; ++i) {
// do blocking CUDA/Thrust API call 2
}
#pragma omp parallel for
for (size_t i = 0; i < 4; ++i) {
// do blocking CUDA/Thrust API call 3
}
#pragma omp parallel for
for (size_t i = 0; i < 4; ++i) {
// do blocking CUDA/Thrust API call 4
}
I run this code on an IBM Power System AC922 with 2 NUMA sockets, 16 cores per socket, and 4 threads per core.
omp
is configured as follows.
omp_set_schedule(omp_sched_static, 1);
omp_set_nested(1);
omp_set_num_threads(cpus);
I would assume that scenario A would perform faster as there's no additional thread spawning and synchronization overhead involved between the tasks // do blocking CUDA/Thrust API call 1
and // do blocking CUDA/Thrust API call 4
. However, it turns out that scenario B performs consistently faster in a benchmark with 100
repetitions.
My question is therefore: Why do multiple smaller omp parallel for
loops perform better than one big loop?
Aucun commentaire:
Enregistrer un commentaire