samedi 2 mai 2020

RcppParallel fast vector square algorithm

Following the guide I have written a function allowing to pow the NumericVector via RcppParallel package. The code is as follows:

#include <RcppArmadillo.h>
#include <RcppParallel.h>

using namespace Rcpp;
using namespace RcppArmadillo;
using namespace RcppParallel;

double sqr(double x)
{
  return(x * x);
}

// Parallel pow of vectors
struct ParallelVectorPowStruct : public Worker
{
  // source matrix
  const RVector<double> input;

  // powers matrix
  const RVector<double> input_powers;

  // destination matrix
  RVector<double> output;

  // type of power (0 - general, 1 - square, 2 - square root)
  int pow_type;

  // initialize with source and destination
  ParallelVectorPowStruct(const NumericVector input, NumericVector output, NumericVector input_powers, int pow_type) 
    : input(input), output(output), input_powers(input_powers), pow_type(pow_type) {}

  // take the powers of the range of elements requested
  void operator()(std::size_t begin, std::size_t end) {

    if (pow_type == 0)
    {
      // perform pow operation
      std::transform(input.begin() + begin, 
                     input.begin() + end, 
                     input_powers.begin(),
                     output.begin() + begin, 
                     ::pow);
    }

    if (pow_type == 1)
    {
      std::transform(input.begin() + begin, 
                     input.begin() + end,
                     output.begin() + begin, 
                     sqr);
    }

    if (pow_type == 2)
    {
      // perform pow operation
      std::transform(input.begin() + begin, 
                     input.begin() + end,
                     output.begin() + begin, 
                     ::sqrt);
    }

  }

};

// [[Rcpp::export]]
NumericVector ParallelVectorPow(NumericVector x, double value) {

  int pow_type = 0;

  if (value == 0.5)
  {
    pow_type = 2;
  }

  if (value == 2)
  {
    pow_type = 1;
  }

  // allocate the output matrix
  NumericVector output(x.size());

  //  allocate the powers matrix
  NumericVector input_powers(x.size());
  std::fill(input_powers.begin(), input_powers.end(), value);

  // ParallelVectorPowStruct functor
  ParallelVectorPowStruct parallelVectorPowStruct(x, output, input_powers, pow_type);

  // call parallelFor to do the work
  parallelFor(0, x.length(), parallelVectorPowStruct);

  // return the output matrix
  return output;
}

Note that I have added two additional distinct calculation options (when pow_type = 1 and when pow_type = 2) for the power of 0.5 (square root) and for the power of 2 (square). My motivation for making this stuff has been as follows.

First I have analyzed the code performance without introducing these two additional types of powers and have found that parallelized version works much faster (usually about 8 times on my 8-core 3800-mhz CPU)

For example the following code:

m <- runif(1000000, 1, 100)
library(rbenchmark)
res <- benchmark(m ^ 3,
                 ParallelVectorPow(m, 3),
                 order = "relative")
res[,1:4]

yields:

                     test replications elapsed relative
2 ParallelVectorPow(m, 3)          100    0.45    1.000
1                     m^3          100    4.22    9.378

However the things are different when it comes to power of 0.5 and 2. Parallel version of code used to work much slower then not parallelized for these powers. I have solved the problem for the power of 0.5 using ::sqrt function following the approach mentioned in guide referenced above (see the code for pow_type == 2). However I can't figure out how to improve the code for the power of 2.

So consider the following code:

m <- runif(1000000, 1, 100)
library(rbenchmark)
res <- benchmark(m ^ 2,
                 ParallelVectorPow(m, 2),
                 order = "relative")
res[,1:4]

It yields:

                     test replications elapsed relative
1                     m^2          100    0.14    1.000
2 ParallelVectorPow(m, 2)          100    0.36    2.571

Is it possible to speed it up via RcppParallel framework?

Will be very greatfull for help!

P.S.

  1. I think that m ^ 2 and m ^ 0.5 make some special calls to some fast c-language functions while m ^ other_power makes some general alternative routine.

  2. I also have tested these implementation against various RcppArmadillo based square approaches and have found that they are slower then both listed above.

Aucun commentaire:

Enregistrer un commentaire