vendredi 13 septembre 2019

Nested Parallelism: OpenMP parallelization of loop over vector elements of type

Assuming that different callable objects (tasks) are stored as elements of a std::vector<> of the type std::function<>, each of these tasks can be called within a for loop. If the order of execution of the tasks is unimportant, the loop that calls these tasks can be parallelized trivially with OpenMP. However, when the tasks are called in parallel, any OpenMP parallelization in the individual tasks seems to be lost or ineffective.

Here is a code illustrating the situation:

#include <iostream>
#include <functional>
#include <vector>
#include <cmath>

void task_1() {
  std::cout << "starting task 1"<< std::endl;
  int n = 1e8;
  double sum = 0;
  #ifdef _OPENMP
  #pragma omp parallel for reduction (+:sum)
  #endif
  for (int i = 0; i < n; ++i) {
    sum += std::sin((double)i);
  }
  std::cout << "result of task 1: sum = " << sum << std::endl;
}

void task_2() {
  std::cout << "starting task 2" << std::endl;
  int n = 1e8;
  double sum = 0;
  #ifdef _OPENMP
  #pragma omp parallel  for reduction (+:sum)
  #endif  
  for (int i = 0; i < n; ++i) {
    sum += std::cos((double)i);
  }
  std::cout << "result of task 2: sum = " << sum << std::endl;
}
void task_3() {
  std::cout << "starting task 3" << std::endl;
  int n = 1e8;
  double sum = 0;
  #ifdef _OPENMP
  #pragma omp parallel  for reduction (+:sum)
  #endif  
  for (int i = 0; i < n; ++i) {
    sum += std::sin((double)(i + 0.5) );
  }
  std::cout << "result of task 3: sum = " << sum << std::endl;
}

int main() {
  std::vector <std::function <void(void)> > myTasks;
  myTasks.push_back(&task_1);
  myTasks.push_back(&task_2);
  myTasks.push_back(&task_3);
  std::cout << "total tasks: " << myTasks.size()  << std::endl;
// #ifdef _OPENMP
// #pragma omp parallel for // this executes tasks in parallel, but kills parallelism within the indiviual tasks
// #endif
  for (int t = 0; t < myTasks.size(); t++) {
     myTasks[t]();
  }
  std::cout << "finished all tasks. " << std::endl;
}

The difference in computation times can be large depending on the choice of parallelization. On my not-so-powerful machine, the serial code requires about 18s:

$ time ./taskTest 
total tasks: 3
starting task 1
result of task 1: sum = 0.78201
starting task 2
result of task 2: sum = 1.53437
starting task 3
result of task 3: sum = 1.42189
finished all tasks. 

real    0m18.346s
user    0m18.324s
sys     0m0.004s

The version with a OpenMP parallelization of the individual tasks, each of which is called in a serial sequence, is the fastest:

real    0m1.843s
user    0m18.480s
sys     0m0.000s

while an additional parallelization of the loop over the tasks in main() slows down the code significantly:

real    0m6.221s
user    0m18.136s
sys     0m0.004s

This latter version shows the same performance as when only the main loop is parallelized, without using OpenMP in the individual tasks. This means that only one thread is used within each task if the parallel version of the loop in main is used.

Is there any way to obtain a nested parallel execution such that the tasks are called in parallel in the main loop while preserving multi-threaded operations in the individual tasks?

Aucun commentaire:

Enregistrer un commentaire