Let's consider a parallelized for
loop using omp
called scenario A.
const auto start = std::chrono::high_resolution_clock::now();
#pragma omp parallel for
for (size_t i = 0; i < 4; ++i) {
// do single-threaded operation 1
// do single-threaded operation 2
// do single-threaded operation 3
// do single-threaded operation 4
}
const auto end = std::chrono::high_resolution_clock::now();
Now, let's compare it with scenario B where the parallelized for
loop is split up.
const auto start = std::chrono::high_resolution_clock::now();
#pragma omp parallel for
for (size_t i = 0; i < 4; ++i) {
// do single-threaded operation 1
}
#pragma omp parallel for
for (size_t i = 0; i < 4; ++i) {
// do single-threaded operation 2
}
#pragma omp parallel for
for (size_t i = 0; i < 4; ++i) {
// do single-threaded operation 3
}
#pragma omp parallel for
for (size_t i = 0; i < 4; ++i) {
// do single-threaded operation 4
}
const auto end = std::chrono::high_resolution_clock::now();
I run this code on an IBM Power System AC922 with 2 NUMA sockets, 16 cores per socket, and 4 threads per core.
omp
is configured as follows.
omp_set_schedule(omp_sched_static, 1);
omp_set_nested(1);
omp_set_num_threads(cpus);
I use gcc (GCC) 9.3.1
and compile with -O3
.
I would assume that scenario A would perform faster as there's no additional thread spawning and synchronization overhead involved between the tasks // do single-threaded operation 1
and // do single-threaded operation 4
. However, it turns out that scenario B performs consistently faster in runtime (end - start
) in a benchmark where I run the scenarios 100
times.
Since the // do single-threaded operation <N>
statements do not benefit from additional parallelism, I can't explain this behavior.
My question is therefore: Why do multiple smaller omp parallel for
loops perform better than one big loop?
Aucun commentaire:
Enregistrer un commentaire