jeudi 30 juin 2022

parallelizing C++ code with MPI_send and MPI_recv

I have a parallel code, but I don't understand if it works correctly in parallel. I have two vectors A and B whose elements are matrices defined with a proper class. Since the matrices in the vectors are not primitive type I can't send these vectors to other ranks through MPI_Scatter, so I have to use MPI_Send and MPI_Recv. Also, rank 0 has only a coordination role: it sends to the other ranks the blocks they should work with and collects the results at the end, but it does not participate to the computation.

The solution of the exercise is the following:

// rank 0 sends the blocks to the other ranks, which compute the local
// block products, then receive the partial results and prints the global
// vector
if (rank == 0)
{
    // send data
    for (unsigned j = 0; j < N_blocks; ++j) {
        int dest = j / local_N_blocks + 1;
        // send number of rows
        unsigned n = A[j].rows();
        MPI_Send(&n, 1, MPI_UNSIGNED, dest, 1, MPI_COMM_WORLD);
        // send blocks
        MPI_Send(A[j].data(), n*n, MPI_DOUBLE, dest, 2, MPI_COMM_WORLD);
        MPI_Send(B[j].data(), n*n, MPI_DOUBLE, dest, 3, MPI_COMM_WORLD);}
    {"for loop for rank 0 to receive the results from others ranks"}
    
    // all the other ranks receive the blocks and compute the local block
    // products, then send the results to rank 0
    else
    {
        // local vector
        std::vector<dense_matrix> local_C(local_N_blocks);
        // receive data and compute products
        for (unsigned j = 0; j < local_N_blocks; ++j) {
            // receive number of rows
            unsigned n;
            MPI_Recv(&n, 1, MPI_UNSIGNED, 0, 1, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
            // initialize blocks
            dense_matrix local_A(n,n); dense_matrix local_B(n,n);
            // receive blocks
            MPI_Recv(local_A.data(), n*n, MPI_DOUBLE, 0, 2, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
            MPI_Recv(local_B.data(), n*n, MPI_DOUBLE, 0, 3, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
            // compute product
            local_C[j] = local_A * local_B; }
        {"for loop for ranks != 0 to send the results to rank 0"}
    }

In my opinion, if local_N_blocks= N_blocks / (size - 1); is different from 1, the variable dest doesn't change value at every loop iteration. So, after the first iteration of the "sending loop", the second time that rank 0 faces

MPI_Send(A[j].data(), n*n, MPI_DOUBLE, dest, 2, MPI_COMM_WORLD);
MPI_Send(B[j].data(), n*n, MPI_DOUBLE, dest, 3, MPI_COMM_WORLD);

it has to wait that the operation local_C[j] = local_A * local_B of the previous j has been completed so the code doesn't seem to me well parallelized. What do you think?

Aucun commentaire:

Enregistrer un commentaire