What is the standard approach for doing a reduction operation such as computing the maximum, on an entire structured buffer in HLSL?
Context:
I have a HLSL RWStructuredBuffer which I want to normalize (corresponding to a ComputeBuffer in Unity). The values are in the range [0, M] where M is unknown. So my guess is that I need to retrieve the maximum of the buffer and divide each element by it.
Now comes the tricky part. I'm thinking that somehow I need to take a divide and conquer approach to this problem in order to parallelize the (max) reduction operation. But I can't seem to find a way to do that on the entire buffer at once.
I can do it on all the elements addressed by a single thread group as follow:
StructuredBuffer<float> inputBuffer;
RWStructuredBuffer<float> outputBuffer;
groupshared float sharedBuffer[THREAD_GROUP_SIZE];
[numthreads(THREAD_GROUP_SIZE, 1, 1)]
void MaxReduce(uint3 id : SV_DispatchThreadID, uint3 groupThreadId : SV_GroupThreadID, uint3 groupId : SV_GroupID)
{
// init shared buffer for this thread group
sharedBuffer[groupThreadId.x] = inputBuffer[id.x];
GroupMemoryBarrierWithGroupSync();
// We work our way through the sharedBuffer comparing half of the buffer to the other half recursively
for (int stride = THREAD_GROUP_SIZE / 2; stride > 0; stride >>= 1) {
if (groupThreadId.x < stride) {
sharedBuffer[groupThreadId.x] = max(sharedBuffer[groupThreadId.x], sharedBuffer[groupThreadId.x + stride]);
}
GroupMemoryBarrierWithGroupSync();
}
if (groudId.x == 0) {
outputBuffer[groupId.x] = sharedBuffer[0];
}
}
At this stage I have computed the max of each block of 64 elements.
So I would imagine that I now need to run this in a loop on the output buffer like so (This is C# in unity:
float Normalize(ref ComputeBuffer toNormalize, int bufferSize) {
myShader.SetBuffer(0, "inputBuffer", toNormalize);
var outputBuffer = new ComputeBuffer(bufferSize / THREAD_GROUP_SIZE, sizeof(float));
myShader.SetBuffer(0, "outputBuffer", outputBuffer);
while(bufferSize / THREAD_GROUP_SIZE >= 1) {
myShader.Dispatch(0, bufferSize / THREAD_GROUP_SIZE, 1, 1);
bufferSize /= THREAD_GROUP_SIZE;
myShader.SetBuffer(0, "inputBuffer", outputBuffer);
}
// at this point the max should be in the first element of outputBuffer.
// We can call another kernel that divides each element in the input by the first
// element of the outputBuffer
...
}
But the C# part looks odd and unnatural (especially the looped Dispatch call). I would imagine it would be faster if the GPU could choose by itself to keep running without the host having to tell it to keep going.
Is that actually the right general approach (regardless of optimizations optimizations of the hlsl kernel of course) or is there a better way to do that, for example by doing the recursive reduction directly in HLSL?
Thanks for your help