dimanche 3 septembre 2017

nccl - can we sum up all the values of an array on 1 device GPU to obtain the sum?

I have a single GPU (e.g. GeForce GTX 980Ti). I have a single float array, for example, cudaMalloc'ed (allocated on that single device GPU) of length 128, with all values being 1.f. I want to use nccl to sum them up to obtain 128, i.e. (1+1+...+1)=128.

However, I read on the NCCL Developer's documentation that the reduction is only across devices, NOT across a single device, if I interpreted it correctly:

cf. http://ift.tt/2ewcCg5

From there (quoting),

"AllReduce starts with independent arrays Vk of N values on each of K ranks and ends with identical arrays S of N values, where S[i] = V0 [i]+V1 [i]+…+Vk-1 [i], for each rank k ."

I want to confirm that I cannot do a reduction of an array on the device GPU (summation), on a single GPU.

My full code (and how to compile) is here as reference/context:

http://ift.tt/2iTk49U

the "meat" of the code is here; the "prep" before (declarations) should be correct:

ncclCommCount(*comm.get(),&count);

ncclAllReduce( d_in.get(), d_out.get(), size, 
                ncclFloat, ncclSum, *comm.get(), *stream.get() );

// size is 128 for the 128 elements in both the (pointers to) float arrays 
// d_in and d_out

I had "wrapped" my pointers in C++11 smart pointers, but I have tried my code with raw pointers as well with the same result; I can post that version if you'd like.

Please confirm that I cannot use nccl to do parallel reduce on a single device, across a single array on the single device GPU, or show me how I can. Thanks!

Aucun commentaire:

Enregistrer un commentaire