I’m trying to learn OpenACC but all the example I’ve studied are quite basics, like move a vector to the GPU, compute some sums or multiplication and bring the results back. What happens if in the code there are other data structures involved? Like for example in C++ I would like to create a class to compute then some data analysis on the toys used by the cats of the people involved in a survey. “Person” is a class with some details and a vector of all the cats owned (a lot):
class Person {
std::string name;
std::string surname;
std::vector<Cat> cats;
}
Each cat keeps track of all the toys gotten (a lot)
class Cat {
std::string name;
std::vector<Toy> toys;
}
now I want to iterate on all the toys to save data in the class this file is part of for latter analysis, but I’m interested only in the cats owned by the people I got (only reference I have at this point):
for(const auto& cat: person.cats){
insert_cat(cat.name);
for(const auto& toy: cat.toys){
insert_brand(toy.brand)
insert_price(toy.price)
}
}
How can I write this loop using OpenACC and #pragma to run on GPU, where insert_cat, insert_brand and insert_price are functions that given the string it’s then added in some data structure part of this class? What happens if insert_brand is part of another file, let’s say brands_analysis.hpp, where I want to collect all the brands but it's not part of the current class? How I need to modify the OpenACC code to save that data in an external classes?
You can use the "routine" directive on the Class method's prototype to instruct the compiler to generate device callable functions. The function definitions do not need to be in the same source provided relocatable device code (RDC) is enabled (RDC is enabled by default with nvc++) allowing for device code linking.
Note that insert operations are generally not parallelizable given multiple threads can be inserting at the same time and could cause a race condition if inserting in the same spot.
Also presumably you're using a vector "push" to do the insert. "push" can trigger a reallocation which given the device memory is separate and lead to a mismatch in the data structure on the host and device. I'm also presuming you're using CUDA unified memory, in which case allocation can only be done from the host.
Lists should be built serially on the host and then can be used for computation on the device.