dimanche 6 septembre 2020

Why do more openMP threads increase performance for a fixed number of loop iterations?

Let's consider a simple parallelized for loop using omp.

#pragma omp parallel for
for (size_t i = 0; i < 4; ++i) {
  // do single-threaded CUDA/Thrust API call
}

I run this code on an IBM Power System AC922 with 2 NUMA sockets, 16 cores per socket, and 4 threads per core.

omp is configured as follows.

omp_set_schedule(omp_sched_static, 1);
omp_set_nested(1);
omp_set_num_threads(cpus);

Given that the for loop has exactly 4 iterations, I would assume that setting omp_set_num_threads(4) (i.e., cpus == 4) yields the best possible performance. However, it turns out that omp_set_num_threads(16) (i.e., cpus == 16) reproducibly performs better in terms or runtime (i.e., in a benchmark with 100 repetitions); not by much but still considerably.

Therefore, I'd like to ask: How can this can happen?

Aucun commentaire:

Enregistrer un commentaire