Is there a way to do a reduce_by_key operation and a reduce (or ideally another reduce_by_key) operation in only one kernel call in Thrust? Besides gaining computational speed, let us say I want to do this because the number of output values from the first reduce_by_key operation is too large to be stored in memory.
I have been wondering if transform_output_iterator could help here but have not found a solution.
A simple demonstration, but not my real use case, could be to find the minimum of the maximums of each row in a matrix, where that matrix is flattened and stored in a device_vector.
The following code computes the minimum of all row maximums with a fixed amount of temporary storage to store a limited number of minima. Afterwards, a min reduce is performed to find the global minimum
The idea is to directly update the minimum value via transform_output_iterator. This can be done via atomics (in case of raw pointers for temp minima) or via locks (in case of iterators for temp minima. not shown in this answer).
To avoid atomic contention, the number of temporary minima should not be too small.
For 1G segments of size 1,i.e. there will be an atomic operation for each input element, I observe the following timings on an A100 GPU.