I have an array like this:
data[16] = {10,1,8,-1,0,-2,3,5,-2,-3,2,7,0,11,0,2}
I want to compute the reduction of this array using shared memory on a G80 GPU.
The kernel as cited in the NVIDIA document is like that:
__global__ void reduce1(int *g_idata, int *g_odata) {
extern __shared__ int sdata[];
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
sdata[tid] = g_idata[i];
__syncthreads();
// here the reduction :
for (unsigned int s=1; s < blockDim.x; s *= 2) {
int index = 2 * s * tid;
if (index < blockDim.x) {
sdata[index] += sdata[index + s];
}
__syncthreads();
}
The author of the paper said that there is a problem of bank conflict in this method. I tried to understand but I couldn't figure out why? I know the definition of the bank conflict and broadcast access but still can't understand this.
The G80 processor is a very old CUDA capable GPU, in the first generation of CUDA GPUs, with a compute capability of 1.0. These devices are no longer supported by recent CUDA versions (after 6.5) so the online documentation no longer contains the necessary information to understand the bank structure in these devices.
Therefore I will excerpt the necessary info for cc 1.x devices from the CUDA 6.5 C programming guide here:
In these devices, shared memory has a 16 bank structure, such that each bank has a "width" of 32-bits or 4-bytes. Each bank has the same width as an
intorfloatquantity, for example. Therefore lets envision the first 32 4-byte quantities that might be stored in this kind of shared memory, and their corresponding banks (usingfinstead ofsdatafor the name of the array):The first 16
intquantities in shared memory belong to banks 0 to 15, and the next 16intquantities in shared memory also belong to banks 0 to 15 (and so on, if we had more data in ourintarray).Now let's look at the lines of code that will trigger a bank conflict:
Let's consider the first pass through the above loop, where
sis 1. That meansindexis2*1*tid, so for each thread,indexis just double the value ofthreadIdx.x:so for this read operation:
we have:
So, within the first 16 threads, we have two threads that want to read from bank 1, two that want to read from bank 3, two that want to read from bank 5, etc. This read cycle therefore encounters 2-way bank conflicts across the first 16-thread group. Note that the other read and write operations on the same line of code are similarly bank-conflicted:
as this will read, and then write, to banks 0, 2, 4, etc. twice per group of 16 threads.
Note to others who may be reading this example: as written, it pertains to cc 1.x devices only. The methodology to demonstrate bank conflicts on cc 2.x and newer devices may be similar, but the specifics are different, due to warp execution differences and the fact that these newer devices have a 32-way bank structure, not a 16-way bank structure.