dimanche 6 septembre 2020

Why are multiple parallel for loops using openMP faster than one?

Let's consider a parallelized for loop using omp called scenario A.

#pragma omp parallel for
for (size_t i = 0; i < 4; ++i) {
  // do blocking CUDA/Thrust API call 1
  // do blocking CUDA/Thrust API call 2
  // do blocking CUDA/Thrust API call 3
  // do blocking CUDA/Thrust API call 4
}

Now, let's compare it with scenario B where the parallelized for loop is split up.

#pragma omp parallel for
for (size_t i = 0; i < 4; ++i) {
  // do blocking CUDA/Thrust API call 1
}

#pragma omp parallel for
for (size_t i = 0; i < 4; ++i) {
  // do blocking CUDA/Thrust API call 2
}

#pragma omp parallel for
for (size_t i = 0; i < 4; ++i) {
  // do blocking CUDA/Thrust API call 3
}

#pragma omp parallel for
for (size_t i = 0; i < 4; ++i) {
  // do blocking CUDA/Thrust API call 4
}

I run this code on an IBM Power System AC922 with 2 NUMA sockets, 16 cores per socket, and 4 threads per core.

omp is configured as follows.

omp_set_schedule(omp_sched_static, 1);
omp_set_nested(1);
omp_set_num_threads(cpus);

I would assume that scenario A would perform faster as there's no additional thread spawning and synchronization overhead involved between the tasks // do blocking CUDA/Thrust API call 1 and // do blocking CUDA/Thrust API call 4. However, it turns out that scenario B performs consistently faster in a benchmark with 100 repetitions.

My question is therefore: Why do multiple smaller omp parallel for loops perform better than one big loop?

Aucun commentaire:

Enregistrer un commentaire