Let's consider a simple parallelized for
loop using omp
.
#pragma omp parallel for
for (size_t i = 0; i < 4; ++i) {
// do single-threaded CUDA/Thrust API call
}
I run this code on an IBM Power System AC922 with 2 NUMA sockets, 16 cores per socket, and 4 threads per core.
omp
is configured as follows.
omp_set_schedule(omp_sched_static, 1);
omp_set_nested(1);
omp_set_num_threads(cpus);
Given that the for
loop has exactly 4
iterations, I would assume that setting omp_set_num_threads(4)
(i.e., cpus == 4
) yields the best possible performance. However, it turns out that omp_set_num_threads(16)
(i.e., cpus == 16
) reproducibly performs better in terms or runtime (i.e., in a benchmark with 100
repetitions); not by much but still considerably.
Therefore, I'd like to ask: How can this can happen?
Aucun commentaire:
Enregistrer un commentaire