Memory padding vs coalesced access

341 Views Asked by At

I have a little confusion about bank conflicts, avoiding them using memory padding and coalesced memory access. What I've read so far: Coalesced memory access from global memory is optimal. If it isn't achievable shared memory might be used to reorder the data needed by the current block and thus making coalesced access possible. However when using shared memory one has to look out for bank conflicts. One strategy to avoid bank conflicts is to pad the arrays stored in shared memory by 1. Consider the example from this blog post where each row of a 16x16 matrix is padded by 1 making it a 16x17 matrix in shared memory.

Now I understand that using memory padding might avoid bank conflicts but doesn't that also mean the memory is not aligned anymore? E.g. if I shift global memory by 1 thus misaligning it one warp would need to access two memory lanes instead of one because of the one last number not being in the same lane as all other numbers. So for my understanding coalesced memory access and memory padding are contradicting concepts, aren't they? Some clarification is appreciated very much!

1

There are 1 best solutions below

0
SimonH On

Too long for a comment so I'm putting it here. Still not a complete answer though.

By the time I found this post by Mark Harris which demonstrates the usage of shared memory to faciliate coalesced memory access. The important takeaway for this question seems to be:

The reason shared memory is used in this example is to facilitate global memory coalescing on older CUDA devices (Compute Capability 1.1 or earlier). Optimal global memory coalescing is achieved for both reads and writes because global memory is always accessed through the linear, aligned index t. The reversed index tr is only used to access shared memory, which does not have the sequential access restrictions of global memory for optimal performance. The only performance issue with shared memory is bank conflicts, which we will discuss later.

My initial understanding was that if coalesced access to global memory is not possible then it is read uncoalesced and then reordered in shared memory to achieve further coalesced accesses from shared memory. But instead data is read in a continous fashion from global memory and then the actual data needed can be read from shared memory in a non-coalesced way. Harris also states that uncoalesced access from shared memory is not a problem but unfortunately the post doesn't explain why.