I want to store partial reduction results in an array.
Say I have data[8] = {10,20,30,40,50,60,70,80}.
And if I divide the data with the chunk_size of 2, the chunks will be {10,20}, {30,40}, ... , {70,80}.
If I target the summation, the reduction in total will be 360 but I want to get an array of partial_sums = {30,70,110,150} which is storing the partial sum of each block.
So far, what I have in mind is to construct an iterator strided_iterator, that will access 0, 2, ... th index of data[8] = {10,20,30,40,50,60,70,80} and something like
thrust::reduce(stride_iterator, stride_iterator + 2,
partial_sums.begin(),
thrust::plus<int>());
giving the desired result, but have no idea how could this be done efficiently.
For strided access, thrust/examples/strided_range.cu has a solution but this seems to be not applicable to store segmented reductions.
Of course I can brutally do it with a loop like this,
for (int i = 0; i<4; i++) {
partial_sums[i] = thrust::reduce(data+2*i, data+2*i+2, 0, thrust::plus<int>());
}
But this kind of practice is what CUDA thrust is trying to avoid as much as possible, right? Somehow I should be able to put it all in a single Thrust call.
Based on the useful answer in Reduce multiple blocks of equal length that are arranged in a big vector Using CUDA, what I come up with so far is like follows.
In fact I wanted to get min or max values in each chunk.
The results are shown below with the estimated runtime measured via sdkTimer. As can be seen, method 1 with
reduce_by_keyis much~ faster than the second one with for loop.