I am trying to transpose a matrix. It works as expected for some values and starts crashing with bigger ones or even between executions of the program.
What I am trying to make is to split the matrix in n by n blocks, each one launches SEGMENT_SIZE threads and each thread processes a column of its block. For now I am assuming that MATRIX_SIZE is divisible by SEGMENT_SIZE.
Invocation and parameters:
#define MATRIX_SIZE 64
#define SEGMENT_SIZE 32
int n_block = (MATRIX_SIZE % SEGMENT_SIZE) ? MATRIX_SIZE / SEGMENT_SIZE + 1 : MATRIX_SIZE / SEGMENT_SIZE;
dim3 blocksPerGrid(n_block, n_block);
dim3 threadsPerBlock(SEGMENT_SIZE);
transposeMatrix<<<blocksPerGrid, threadsPerBlock, threadsPerBlock.x * threadsPerBlock.x * sizeof(float)>>>(d_data, MATRIX_SIZE);
Kernel:
__global__ void transposeMatrix(float *d_data, int mat_dim)
{
// Array in Shared Memory
extern __shared__ float sdata[];
for (int i = 0; i < blockDim.x; i++)
{
sdata[blockDim.x * i + threadIdx.x] = d_data[(i + blockIdx.y * blockDim.x) * mat_dim + (blockDim.x * blockIdx.x + threadIdx.x)];
}
__syncthreads();
for (int i = 0; i < blockDim.x; i++)
{
d_data[(i + blockIdx.y * blockDim.x) + (threadIdx.x + blockIdx.x * blockDim.x) * mat_dim] = sdata[threadIdx.x + i * blockDim.x];
}
}
For MATRIX_SIZE 64 and SEGMENT_SIZE 32 works as expected and never fails, but for bigger values of MATRIX_SIZE (like 480) it starts behaving unexpectedly and on some executions the resulting matrix is not correct.
It's doubtful that an in-place transpose could work the way you have it designed here.
__syncthreads()does not synchronize the entire grid, only each block. Therefore, in a larger test case, you will have some threadblocks writing their output over input data that has not yet been read in by some other threadblock.Two options would be to:
or:
(A third option not covered here would be to replace your usage of
__syncthreads()with an appropriate cooperative kernels grid-wide sync.)Here is a worked example demonstrating both ideas:
this in-place approach has each block handling 2 tiles. Therefore we only need blocks on or on one side of the diagonal to act. Those blocks each load 2 tiles into shared memory, and then write out those two tiles. Therefore twice as much shared mem per block is needed. For the blocks on the diagonal (
blockIdx.x == blockIdx.y) we only need one tile operation per block not two.Also note that the matrix size (4000) here is chosen carefully to avoid overflowing the integer-storage capacity of a
floatquantity. If you make this test case much larger, it will probably fail due to the larger integers involved in the test case, not because the method is flawed.