lundi 25 avril 2016

How to properly implement an execute-function-on-each-element with CUDA?

I have a class representing one or several containers of objects. The class offers a function to run a callback for each of the elements. A simple implementation could look like:

struct MyData{
    Foo* foo;
    void doForAllFoo(std::function<void(Foo)> fct){
       for( /* all indices i in foo */){
         fct(f[i]); 
       }
    } 
}

Driving code:

MyData d = MyData(...);
TypeX param1 = create_some_param();
TypeY param2 = create_some_more_param();
d.doForAll([&](Foo f) {my_function(f, param1, param2);});

I think this is a good solution for flexible callbacks on a container.

Now I'd like to parallelize this with CUDA. I'm not quite sure about what is allowed with lambdas in CUDA and I'm also not sure about compilation for __device__ and __host__.

I can (and will probably have to) change MyData, but I'd like to have no trace of the CUDA background in the driving code, except that I have to allocate memories in a CUDA-accessible way of course.

I think a minimal example would be very helpful.

Aucun commentaire:

Enregistrer un commentaire