I'm trying to use cuda to make a basic fragment shader, and I have found that actually executing the kernel takes over a second, which is unacceptable for a shader that I'm trying to run in real time. I found using the synchronize method and by commenting some of the kernel that it is the memory accesses to the output array that are what's causing it to be so slow. I haven't really tried anything to solve the problem because I can't even fathom where to start. This is in PyCUDA, which I don't think really matters, but here's the kernel code:
__global__ void fragment_shader(int palette_lim,float *palette, float *input, float *output) {
int fragment_idx = (3*gridDim.y*blockIdx.x)+(3*blockIdx.y);
float min_dist = sqrtf(3);
float color_dist;
int best_c = 0;
for (int c=0;c<palette_lim;c++) {
color_dist = sqrtf(pow(input[fragment_idx]-palette[c*3],2)+pow(input[fragment_idx+1]-palette[c*3+1],2)+pow(input[fragment_idx+2]-palette[c*3+2],2));
if (color_dist < min_dist) {
min_dist = color_dist;
best_c = c;
}
}
//These are the lines that make it slow. If these lines get commented out, it runs in a time that would be acceptable
output[fragment_idx] = palette[best_c*3];
output[fragment_idx+1] = palette[best_c*3+1];
output[fragment_idx+2] = palette[best_c*3+2];
}
EDIT: After playing around with it a bit more, I found that it also has to do with what's being assigned to the output array, because when I had it write some constants rather than something from the palette it also worked just fine, it just didn't do anything useful then.
First some remarks on your actual computation:
sqrtf(x) < sqrtf(3). Roots are expensive. Just comparex < 3.fsqrt(pow(x, 2)+...), for that matter don't usepowjust for squaring. Usehypotffor 2D ornorm3dffor 3D vectorsNow let's analyze your memory accesses:
fragment index
Let's look at
fragment_idx = 3*gridDim.y*blockIdx.x+3*blockIdx.y: You're not takingthreadIdx.xandthreadIdx.yinto account. This is your main problem: Many threads act on the same input and output data. You likely want this:fragment_idx = 3 * (threadIdx.y * blockDim.x + threadIdx.x)input
So you load 3 floats. For starters, why do you reload it inside the loop when it isn't dependent on the loop iteration? I assume the compiler saves you from that access but don't get in the habit of doing that.
Second, your access pattern isn't properly coalesced since a) these are 3 independent accesses and b) CUDA cannot coalesce accesses to
float3vectors even if you did it properly. Please read section 9.2.1 Coalesced Access to Global Memory of the Best Practices Guide. For better performance you have two options:fragment_idxso you can load the whole thing as afloat4palette
Same problem with the access of 3 floats. Plus, now every thread reads the same values since
cdoesn't depend on the thread index. At the very least, the access should go through the__ldgfunction to use the L1 cache. Preferably you prefetch the palette into shared memoryoutput
The write access has the same issue with uncoalesced access. Plus, since
best_cvaries among threads, the read accesses topaletteare random. You had to load thepalettevalues before in your loop. Just save the best palette value in a local variable and reuse it to store the output in the end.Methodology
Two remarks:
fragment_idxMinimal fix
This is the simplest code to rectify the issues. It doesn't solve the issues with loading vector3 variables and it doesn't use shared memory. That requires more involved changes
Extensive fix
Here is a more extensive rewrite:
float4(RGBA instead of RGB). The extra channel is ignored for distance computation but it is propagated. Typically one tries to use the value for something, e.g. you could store the distance value in thereAlgorithmic improvements
For larger palettes, a brute force search like this is suboptimal. Spatial index algorithms can do the same thing but faster. The classic structure for this would be a KD tree. If you search for this, you will find some CUDA implementations that might be worth checking out.