Assuming that different callable objects (tasks) are stored as elements of a std::vector<> of the type std::function<>, each of these tasks can be called within a for loop. If the order of execution of the tasks is unimportant, the loop that calls these tasks can be parallelized trivially with OpenMP. However, when the tasks are called in parallel, any OpenMP parallelization in the individual tasks seems to be lost or ineffective.
Here is a code illustrating the situation:
#include <iostream>
#include <functional>
#include <vector>
#include <cmath>
void task_1() {
std::cout << "starting task 1"<< std::endl;
int n = 1e8;
double sum = 0;
#ifdef _OPENMP
#pragma omp parallel for reduction (+:sum)
#endif
for (int i = 0; i < n; ++i) {
sum += std::sin((double)i);
}
std::cout << "result of task 1: sum = " << sum << std::endl;
}
void task_2() {
std::cout << "starting task 2" << std::endl;
int n = 1e8;
double sum = 0;
#ifdef _OPENMP
#pragma omp parallel for reduction (+:sum)
#endif
for (int i = 0; i < n; ++i) {
sum += std::cos((double)i);
}
std::cout << "result of task 2: sum = " << sum << std::endl;
}
void task_3() {
std::cout << "starting task 3" << std::endl;
int n = 1e8;
double sum = 0;
#ifdef _OPENMP
#pragma omp parallel for reduction (+:sum)
#endif
for (int i = 0; i < n; ++i) {
sum += std::sin((double)(i + 0.5) );
}
std::cout << "result of task 3: sum = " << sum << std::endl;
}
int main() {
std::vector <std::function <void(void)> > myTasks;
myTasks.push_back(&task_1);
myTasks.push_back(&task_2);
myTasks.push_back(&task_3);
std::cout << "total tasks: " << myTasks.size() << std::endl;
// #ifdef _OPENMP
// #pragma omp parallel for // this executes tasks in parallel, but kills parallelism within the indiviual tasks
// #endif
for (int t = 0; t < myTasks.size(); t++) {
myTasks[t]();
}
std::cout << "finished all tasks. " << std::endl;
}
The difference in computation times can be large depending on the choice of parallelization. On my not-so-powerful machine, the serial code requires about 18s:
$ time ./taskTest
total tasks: 3
starting task 1
result of task 1: sum = 0.78201
starting task 2
result of task 2: sum = 1.53437
starting task 3
result of task 3: sum = 1.42189
finished all tasks.
real 0m18.346s
user 0m18.324s
sys 0m0.004s
The version with a OpenMP parallelization of the individual tasks, each of which is called in a serial sequence, is the fastest:
real 0m1.843s
user 0m18.480s
sys 0m0.000s
while an additional parallelization of the loop over the tasks in main() slows down the code significantly:
real 0m6.221s
user 0m18.136s
sys 0m0.004s
This latter version shows the same performance as when only the main loop is parallelized, without using OpenMP in the individual tasks. This means that only one thread is used within each task if the parallel version of the loop in main is used.
Is there any way to obtain a nested parallel execution such that the tasks are called in parallel in the main loop while preserving multi-threaded operations in the individual tasks?
Aucun commentaire:
Enregistrer un commentaire