Firstly, please note this could equally be applied to writing graphics shaders as to GPGPU code, though my interest is GPGPU which is why the example code is "compute-like".
We all know GPUs cannot really "do" branching code efficiently, due to inherent limitations of SIMD processors.
A pattern I have used several times, and I am surely not alone, is to refactor the kernel into multiple kernels, the first of which determines which code branch a particular work item is on, and the remainder of which execute (non-branching) code for one particular code branch only.
I have in the past always used non-device code after the first kernel runs in order to partition the work items into separate arrays (according to code branch id) before invoking the branchless kernels, though in what I'm presenting here I'm using atomically incrementing indices to do the partitioning as part of the first (branch-determining) kernel. I'm calling this the "n + 1 kernels" pattern but I imagine it probably has another name already.
The following is generic GPU compute pseudocode (though it's close to OpenCL, and some terminology is probably from OpenCL-land):
// n+1 kernels pattern
// device code (generic gpgpu language) without n+1 kernels pattern:
kernel void oneKernel(const int[] intArray, const float[] floatArray, float[] out) {
int globalId = getId();
int thisInt = intArray[globalId];
float thisFloat = floatArray[globalId];
if (someCondition(thisInt, thisFloat)) {
if (someOtherCondition(thisInt)) {
out[globalId] = expensiveCalc0(thisInt, thisFloat);
} else {
out[globalId] = expensiveCalc1(thisInt, thisFloat);
}
} else {
out[globalId] += expensiveCalc2(thisInt, thisFloat);
}
}
// Device code using n+1 kernels pattern: first call createBranchMappings():
kernel void createBranchMappings(const int[] intArray, const float[] floatArray, int[][] globalIdsByBranch, volatile int[] nextIndexByBranch) {
int globalId = getId();
int thisInt = intArray[globalId];
float thisFloat = floatArray[globalId];
int branchId;
// someCondition is evaluated by all threads in lock step...
if (someCondition(thisInt, thisFloat)) {
// ... but someOtherCondition is not. Would be possible but very convoluted (using even more kernels) to avoid losing
// lockstep here, and unlikely I'd guess to be beneficial unless someOtherCondition were itself very expensive.
// Also note use of atomics later on breaks true lockstep in any case
if (someOtherCondition(thisInt)) {
branchId = 0;
} else {
branchId = 1;
}
} else {
branchId = 2;
}
globalIdsByBranch[branchId][atomic_add(&nextIndexByBranch[branchId], 1)] = globalId;
}
// Then call branch0()
// getId() ranges from 0 to number of elements for which branchId == 0
// (which we can get from nextIndexByBranch after calling createBranchMappings()).
// There is no branching at all while running this kernel, threads remain in lockstep.
kernel void branch0(const int[] intArray, const float[] floatArray, const int[][] globalIdsByBranch, float[] out) {
int indexInThisBranch = getId();
int globalIndex = globalIdsByBranch[0][indexInThisBranch];
int thisInt = intArray[globalIndex];
float thisFloat = floatArray[globalIndex];
out[globalIndex] = expensiveCalc0(thisInt, thisFloat);
}
// Then call branch1() ... then branch2() and we are done. branch1() and branch2() look just like branch0() except they
// call expensiveCalc1 and expensiveCalc2 rather than expensiveCalc0.
Some questions.
- Assuming this approach is nothing new, does it have a name?
- Is there a better way than the one presented here for partitioning the work items for presentation to the branchN() kernels? In particular, is it possible to avoid the cost of using atomics? I suspect there's a better approach in which each workgroup does this without using atomics and atomics are only needed (after a barrier) at the global level... but I'm not very au fait with doing workgroup level stuff.
- Anecdotally, this approach has worked for me (improved execution times); is there some reason to assume I have been lucky and that this approach is not always useful?
- It seems to me that a GPGPU language compiler could actually do this code transformation for you (internally and transparently) for all branching kernels. But given that the advice always seems to be that GPUs cannot do branching efficiently, it would seem that they aren't doing this. So why not?
Thanks
You can get similar (if not exact same) effect by just sorting the global id values on their branch-id values and running all in same kernel (you'd lose performance only on those transition point in a single warp, 1 warp max per branch id). This is called sorting. Processing a sorted array is faster. But since you are also separating the sorted chunks into their own containers, its called binning.