I'm reading the book "Programming Massively Parallel Processor" (3rd edition) that presents an implementation of the Kogge-Stone parallel scan algorithm. This algorithm is meant to be run by a single block (this is just a preliminary simplification) and what follows is the implementation.
// X is the input array, Y is the output array, InputSize is the size of the input array
__global__ void Kogge_Stone_scan_kernel(float* X, float* Y, int InputSize) {
__shared__ float XY[SECTION_SIZE]; // SECTION_SIZE is basically blockDim.x
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < InputSize)
XY[threadIdx.x] = X[i];
for (unsigned int stride = 1; stride < blockDim.x; stride *= 2) {
__syncthreads();
if (threadIdx.x >= stride)
XY[threadIdx.x] += XY[threadIdx.x - stride]; // Race condition here?
}
Y[i] = XY[threadIdx.x];
}
Regardless of the way the algorithm works, I'm a bit puzzled by the line
XY[threadIdx.x] += XY[threadIdx.x - stride]. Say stride = 1, then the thread with threadIdx.x = 6 will perform the operation XY[6] += XY[5]. However, at the same time the thread with threadIdx.x = 5 will be performing XY[5] += XY[4]. The question is: is there any guarantee that the thread 6 will read the original value of XY[5] instead of XY[5] + XY[4]?. Note that this is not limited to a single warp in which lockstep execution may prevent the race condition.
Thanks
No, CUDA provides no guarantee of thread execution order (lockstep or otherwise) and there is nothing in the code to sort that out either.
By the way,
cuda-memcheckandcompute-sanitizerare pretty good at identifying shared memory race conditions:As you have probably already surmised, you can sort this out by breaking up the read and write operations in the offending line, with a barrier in-between: