lundi 7 septembre 2020

Why are multiple small openMP for loops faster than one big loop?

Let's consider a parallelized for loop using omp called scenario A.

const auto start = std::chrono::high_resolution_clock::now();

#pragma omp parallel for
for (size_t i = 0; i < 4; ++i) {
  // do single-threaded operation 1
  // do single-threaded operation 2
  // do single-threaded operation 3
  // do single-threaded operation 4
}

const auto end = std::chrono::high_resolution_clock::now();

Now, let's compare it with scenario B where the parallelized for loop is split up.

const auto start = std::chrono::high_resolution_clock::now();

#pragma omp parallel for
for (size_t i = 0; i < 4; ++i) {
  // do single-threaded operation 1
}

#pragma omp parallel for
for (size_t i = 0; i < 4; ++i) {
  // do single-threaded operation 2
}

#pragma omp parallel for
for (size_t i = 0; i < 4; ++i) {
  // do single-threaded operation 3
}

#pragma omp parallel for
for (size_t i = 0; i < 4; ++i) {
  // do single-threaded operation 4
}

const auto end = std::chrono::high_resolution_clock::now();

I run this code on an IBM Power System AC922 with 2 NUMA sockets, 16 cores per socket, and 4 threads per core.

omp is configured as follows.

omp_set_schedule(omp_sched_static, 1);
omp_set_nested(1);
omp_set_num_threads(cpus);

I use gcc (GCC) 9.3.1 and compile with -O3.

I would assume that scenario A would perform faster as there's no additional thread spawning and synchronization overhead involved between the tasks // do single-threaded operation 1 and // do single-threaded operation 4. However, it turns out that scenario B performs consistently faster in runtime (end - start) in a benchmark where I run the scenarios 100 times.

Since the // do single-threaded operation <N> statements do not benefit from additional parallelism, I can't explain this behavior.

My question is therefore: Why do multiple smaller omp parallel for loops perform better than one big loop?

Aucun commentaire:

Enregistrer un commentaire