I am attempting to write a code which optimizes a function of a large amount of data (~1Tb) within a high-dimensional parameter space. In order to accomplish this, I have distributed my data amongst multiple workers, with a central Root handling the gradient descent algorithm, and distributing the information necessary for the calculations using the C++ MPI implementation.
The Root process needs to broadcast the position vector within this high dimensional space (a std::vector<double> object with N >> 100,000 elements) to the workers, as well as some basic information about how to execute the function (parameterised as two integers).
The code snippet looks like this:
int circuitBreaker = 10;
int effectiveBatches = 8;
std::vector<double> TransformedPosition = std::vector<double>(VeryLargeNumber,0.0);
MPI_Bcast(&circuitBreaker, 1, MPI_INT, RunningID, MPI_COMM_WORLD);
MPI_Bcast(&effectiveBatches, 1, MPI_INT, RunningID, MPI_COMM_WORLD);
MPI_Bcast(&TransformedPosition[0], VeryLargeNumber, MPI_DOUBLE, RunningID, MPI_COMM_WORLD);
The workers have corresponding MPI_Bcast statements, and everything works fine. We execute this code on a loop, calling the BCasts once every ~3 seconds over the course of several days, on a computing node with 20 cores.
I recently tried to neaten this code up a bit, by simplifying it to the following:
int circuitBreaker = 10;
int effectiveBatches = 8;
std::vector<double> TransformedPosition = std::vector<double>(VeryLargeNumber,0.0);
std::vector<int> info = {circuitBreaker,effectiveBatches};
MPI_Bcast(&info[0],info.size(), MPI_INT, RunningID, MPI_COMM_WORLD);
MPI_Bcast(&TransformedPosition[0], VeryLargeNumber, MPI_DOUBLE, RunningID, MPI_COMM_WORLD);
The logic being I could add more information to the broadcast without making the code harder to read.
To my shock, this resulted in the slowdown of up to a factor of 5 or 6 from the previous speed. My immediate intuition was that MPI_BCast-ing a vector induces significant overhead compared to a single value, so my next test was to try broadcasting my honking big vector as individual values:
int circuitBreaker = 10;
int effectiveBatches = 8;
std::vector<double> TransformedPosition = std::vector<double>(VeryLargeNumber,0.0);
MPI_Bcast(&circuitBreaker, 1, MPI_INT, RunningID, MPI_COMM_WORLD);
MPI_Bcast(&effectiveBatches, 1, MPI_INT, RunningID, MPI_COMM_WORLD);
for (int i =0 ; i < VeryLargeNumber; ++i)
{
MPI_Bcast(&TransformedPosition[i], 1 MPI_DOUBLE, RunningID, MPI_COMM_WORLD);
}
This ran slower than my original code, but only by a factor of around 2.
The only way that I can wrap my head around this is that MPI_BCast with a count value > 1 has a significant overhead induced by it, such that for small counts, it is much more efficient to hand the values over with individual BCast calls. For larger vectors, however, this overhead is less significant.
To what extent is this true, and what would the most efficient set of MPI operations to get the data from my root process onto the workers?
Aucun commentaire:
Enregistrer un commentaire