jeudi 6 décembre 2018

CUDA: how to use barrier.sync

I have read https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions-bar which details about PTX synchronization function.

  1. It says there are 16 "barrier logical resource", and you can specify which barrier to use with the parameter "a". What is a barrier logical resource?

  2. I have a piece of code from an outside source, which I know works. However, I cannot understand the syntax used inside "asm" and what "memory" does. I assume "name" replaces "%0" and "numThreads" replace "%1", but what is "memory" and what are the colons doing?

    __device__ __forceinline__ void namedBarrierSync(int name, int numThreads) {
    asm volatile("bar.sync %0, %1;" : : "r"(name), "r"(numThreads) : "memory");}
    
    
  3. In a block of 256 threads, I only want threads 64 ~ 127 to synchronize. Is this possible with barrier.sync function? ( for an example, say I have a grid of 1 block, block of 256 threads. we split the block into 3 conditional branches s.t. threads 0 ~ 63 go into kernel1, threads 64 ~ 127 go into kernel 2, and threads 128 ~ 255 go into kernel 3. I want threads in kernel 2 to only synchronize among themselves. So if I use the "namedBarrierSync" function defied above: "namedBarrierSync( 1, 64)". Then does it synchronize only threads 64 ~ 127, or threads 0 ~ 63?

  4. I have tested with below code ( assume that gpuAssert is an error checking function defined somewhere in the file ):

for some reason, I can't format the code correctly, so I took a screenshot.

the code

Aucun commentaire:

Enregistrer un commentaire