Can NPP functions, more concrete npps (https://docs.nvidia.com/cuda/npp/group__npps.html) be called as a device function?
If I create a global function can I inside call npps functions as nppsMaxIndx_32f (to compute max of a vector)?
Example: I have 100 vectors of 10000 floats each, if I do it in host code I have to make 100 calls to npp function
If I make a global function of 100 threads and inside call the npp function for each vector so they launch simultaneously, will this work? nppsMaxIndx_32f can be called as a device function?
This is not possible -- NPP functions are host only functions. Trying will produce errors:
However, making the call in host code without a synchronization of the GPU will call them almost simultaneously without waiting for the previous one to finish, but this can only be done safely if there is no requirement for ordering of the calls and the data for overlapping calls is fully independent.