mercredi 29 août 2018

Duplicate variadic template parameter


I am a Jr. Software Engineer, hopefully I am not reinventing the wheel, please let me know. I'd like to create a template function which wraps and calls another function element wise. For example:

// returns a*x + y
__device__ float saxpy(float a, float x, float y) {
  return a*x + y;

int main() {
  int A[4] = { 1,2,3,4 };
  int X[4] = { 1,2,3,4 };
  int Y[4] = { 1,1,1,1 };

  // A*X   = 1,4,9,16
  // A*X+Y = 2,5,10,17
  float *C = cudaReduce(saxpy, A, X, Y);

  for (int i = 0; i < 4; i++)
    printf("%d, ", C[i]); // should print "2, 5, 10, 17, "

  return 0;

Importantly, I want to create this wrapper so that cuda calls are nicely wrapped when I perform element-wise operations. Though very incomplete, here is my pseudo-code attempt at the function wrapper.

I'd like to provide a minimal example; however, I have very little idea how to go about certain aspects of C++, so please forgive the large amounts of commented pseudocode:

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <iostream>

// returns a*x + y
__device__ float saxpy(float a, float x, float y) {
  return a*x + y;

// finds return type of function pointer
template<typename R, typename... A>
R ret(R(*)(A...));
template<typename C, typename R, typename... A>
R ret(R(C::*)(A...));

template<typename F, size_t N, typename... Args>
auto cudaReduce(F &f, Args(&...argsarray)[N]) {

  // ret is function f's return type
  typedef decltype(ret(f)) ret;
  ret d_out[N], h_out[N];
  // cudaMalloc((void**)&d_out, sizeof(d_out));
  sendToCuda(argsarray...); // allocates and copies all contents of argsarray to cuda

  // reduceKernel<<<1, N>>>(f, d_out, dev_argsarray...);

  // cudaDeviceSynchronize();
  // cudaMemcpy(h_out, d_out, sizeof(h_out), cudaMemcpyDeviceToHost);
  // cudaFree(d_out);

  // for d_args in d_argsarray
  //   cudaFree(d_args);

  return h_out;

template<typename F, size_t N, typename Out, typename... Args>
__global__ void cudaReduceKernel(F &f, Out(&out)[N], Args(&...argsarray)[N]) {
  int tid = threadIdx.x;
  int i = tid + blockIdx.x * blockDim.x;

  // Below is invalid syntax; however, the 'pseudo-code' is what I'd like to achieve.
  // out[i] = f(argsarray[i]...);

// cuda malloc and memcpy
template<typename Arg, size_t N>
void sendToCuda(Arg(&args)[N]) {
  size_t buffer = sizeof(args);
  //cudaMalloc((void**)&dev_arg[ ??? ], buffer);
  //cudaMemcpy((void**)&dev_arg[ ??? ], args, buffer, cudaMemcpyHostToDevice);
template<typename Arg, size_t N, typename... Args>
void sendToCuda(Arg(&args)[N], Args(&...argsarray)[N]) {

int main() {
  int A[4] = { 1,2,3,4 };
  int X[4] = { 1,2,3,4 };
  int Y[4] = { 1,1,1,1 };

  // A*X   = 1,4,9,16
  // A*X+Y = 2,5,10,17
  float *C = cudaReduce(saxpy, A, X, Y);

  for (int i = 0; i < 4; i++)
    printf("%d, ", C[i]); // should print "2, 5, 10, 17, ", currently prints undefined behaviour

  return 0;

I realize not everyone has time to completely review the code, so I will boil down the key problems into several points:

1. Is it possible to duplicate variadic template inputs, if so how? EX (not real code):

template<typename... Args>
void foo(Args... args) {
  Args2... args;

This is needed so that I can duplicate my input parameters to input parameters for my cuda malloc() and memcpy().

2. How would I go about the ith tuple of a variadic array parameter, like zipping in python. EX (not real code):

template<typename... Args, size_t N>
void bar(Args(&...argsarray)[N]) {
  // (python) ithvariadic = zip(*argsarray)[i]
  auto ithvariadic = argsarray[i]...;

