jeudi 31 mars 2016

Parallel optimization - multithreading performance hit

I want to run several runs of simulated annealing with different parameters many times to tune the parameters. It seemed like this would be an obvious use case for multiple threads since each run might take several seconds for a large number of iterations and can do its work independently of the others. A sketch of the relevant portion of the code is:

// I have some vector 'results' with 4 elements that will hold the
// optimization results (call this result_t) and a vector 'probs' of 
// 4 functors of type 'prob_t'. 'cost_t' is some integral type.

auto driver = [=] (result_t& r, prob_t p) 
{ r = genericSimulatedAnneal<result_t, cost_t, prob_t> 
        (guess, params, p); };

std::vector<std::thread> threads;
for (int t_num = 0; t_num < 4; ++t_num) {
    threads.push_back(std::thread(driver, std::ref(results[t_num]),
                                  probs[t_num]));
}

for (auto& t: threads) t.join();

The signature of genericSimulatedAnneal is

template<class solution_t, class cost_t, class acceptance_p>
genericSimulatedAnneal(const solution_t&, 
                       const GenSimAnnealParams<solution_t,cost_t>&,
                       const acceptance_p&);

The problem is that when I run the 4 cases in parallel this way, it actually ends up taking twice as much CPU time per iteration as running them sequentially. My question is, where is the performance hit coming from? The driver lambda copies the guess and parameters, so threads should each be getting their own copy, and each thread's work is done entirely internally and only returned when the annealing finishes. I can't see why this wouldn't be (almost) a 4 times speedup since the vast majority of the program's work should be private to each thread.

Aucun commentaire:

Enregistrer un commentaire