I've been working on optimizing some code and ran into an issue with the shared memory bank conflict report from the CUDA Nsight performance analysis. I was able to reduce it to a very simple piece of code that Nsight reports as having a bank conflict, when it doesn't seem there should be one. Below is the kernel:
__global__ void conflict() {
__shared__ double values[33];
values[threadIdx.x] = threadIdx.x;
values[threadIdx.x+1] = threadIdx.x;
}
And the main function to call it:
int main() {
conflict<<<1,32>>>();
}
Note that I am using a single warp to really reduce this to the bare minimum. When I run the code, Nsight says there is 1 bank conflict, but according to everything I have read, there should not be any. For each access to the shared memory array, each thread is accessing consecutive values, each belonging to separate banks.
Has anyone else experienced issues with the reporting of Nsight or am I just missing something with the functioning of bank conflicts? I would appreciate any feedback!
Btw, I am running the following setup:
- Windows 8
- GTX 770
- Visual Studio Community 2013
- CUDA 7
- Nsight Visual Studio Edition Version 4.5
If the intent is to run the posted code as-is, with
doubledata type, and no bank conflicts, I believe it's possible with appropriate use ofcudaDeviceSetSharedMemConfig(on cc3.x devices). Here's a test case:With specification of
EightByteMode, the shared memory replay overhead is zero.