dimanche 30 décembre 2018

Strange benchmark results for open mp methods

My benchmark results are very strange. On one hand I have the serial function for calculating the quadratic form. On the other hand I wrote two parallel versions. For one thread all functions should need more or less the same running time. But one parallel function just needs half of the time. Is there a "hidden" optimization?

Serial version:

double quadratic_form_serial(const std::vector<double> & A,const std::vector<double> & v, const std::vector<double> & w){
    int N= v.size();
    volatile double q=0.0;

    for(int i=0; i<N; ++i)
        for(int j=0; j<N; ++j)
            q +=v[i]*A[i*N+j]*w[j];

return q;
}

Parallel version 1:

double quadratic_form_parallel(const std::vector<double> & A,const std::vector<double> & v, const std::vector<double> & w, const int threadnum){
int N= v.size();

omp_set_num_threads(threadnum);
volatile double q[threadnum];
volatile double val = 0.0; 

#pragma omp parallel
{
    int me = omp_get_thread_num(); 
    q[me] = 0.0;
    #pragma omp for collapse(2)
    for(int i=0; i<N; ++i)
        for(int j=0; j<N; ++j)
            q[me]+=v[i]*A[i*N+j]*w[j];
    #pragma omp atomic
    val+=q[me];
}
return val;
}

Parallel version 2:

double quadratic_form_parallel2(const std::vector<double> & A,const std::vector<double> & v, const std::vector<double> & w, const int threadnum){
    int N= v.size();
    volatile double result =0.0;
    omp_set_num_threads(threadnum);

    #pragma omp parallel for reduction(+: result)
    for (int i=0; i<N; ++i)
        for (int j=0; j<N; ++j)
            result += v[i] * A[i*N + j] * w[j];
    return result;
}

I run the code for N=10000 and I flushed the cache before I call the function. The function quadratic_form_parallel2 needs with one thread less than the half of the time the two other function needed:

threads    serial       Parallel1       Parallel2
1          0.0882503    0.0875649       0.0313441   

Aucun commentaire:

Enregistrer un commentaire