I wrote a toy code to test some ideas
#include <thrust/transform.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/reduce.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/execution_policy.h>
#include <iostream>
#include <array>
#include <vector>
#define N 20
struct func {
__host__ __device__
float operator()(float x) { return x*2; }
};
template <typename S>
struct O {
const std::array<float,2> a;
O(std::array<float,2> a): a(a) {}
S f;
__host__ __device__
float operator()(float &v) {
std::array<int,3> b = {2,3,4};
int tmp;
for (int i=0; i<3; i++) {
tmp = thrust::reduce(thrust::device,b.begin(),b.end(),0);
printf("%d",tmp);
}
return a[0]*v + a[1] + f(a[0]);
}
};
int main(void) {
thrust::host_vector<float> _v1(N);
thrust::device_vector<float> v1 = _v1, v2;
thrust::fill(v1.begin(),v1.end(),12);
v2.resize(N);
std::array<float,2> a{1,2};
auto c_itor = thrust::make_counting_iterator(0);
thrust::transform(v1.begin(),v1.end(),v2.begin(),O<func>(a));
thrust::copy(v2.begin(),v2.end(),std::ostream_iterator<float>(std::cout," "));
}
This code runs perfectly when using nvcc --expt-relaxed-constexpr -std=c++17. One can see that there are a lot of std containers like std::array occur in a __host__ __device__ functor, what I want to know is
- is this writing legitimate? (in term of efficiency, not grammar validity)
- since the code runs correctly, where do the std objects store? (device or host)
The special case of using
std::arraywith C++17 or higher and--expt-relaxed-constexprworks becausestd::arrayis a very thin wrapper around a C-style array and with C++17 all member functions that you used areconstexpr. I think all member functions butstd::array::fillandstd::array::swapare constexpr by C++17. These two got theconstexprtreatment with C++20.So for performance considerations your code should perform the same as when using
float a[2]andint b[3]. This means that the values are stored in registers if possible (this depends on loop-unrolling forband generally register pressure). This is fine as long as you don't go overboard with the size of the arrays. See e.g. this answer for a deeper discussion of arrays, registers and local memory.Other Containers / Alternatives:
For other STL containers using dynamic memory you probably wont be as lucky in terms of member functions being
constexpr. The HPCnvc++compiler (former PGI C++ compiler) does not need__device__markers, so in theory one can use a lot more STL functionality in device code but in most cases that is a bad idea in terms of performance. STL functions must also still conform to CUDA's C++ Language Restrictions.Nvidia is developing its own C++ standard library implementation with its own device extensions in libcu++. There are no containers yet, but they might come in the future. For hash tables there is the cuCollections library (WIP).