For a perfectly coalesced accesses to an array of 4096 doubles, each 8 bytes, nvprof reports the following metrics on a Nvidia Tesla V100:
global_load_requests: 128
gld_transactions: 1024
gld_transactions_per_request: 8.000000
I cannot find a specific definition of what a transaction and a request to global memory are exactly, so I am having trouble understanding these metrics. Therefore my questions:
- How is a memory request defined?
- How is a memory transaction defined?
- Does
gld_transactions_per_request = 8.00000indicate perfectly coalesced accesses to doubles?
In an attempt to explain it to myself, this what I have come up with:
- Request: a load on the warp-level, i.e. one warp-level instruction merged from 32 threads. In this scenario a
32 threads * 8 bytes = 256 byteload. -- Is this correct? - Transaction: a
32 byteload instruction. In this scenario one transaction defined in this way is able to load32 bytes / 8 bytes = 4doubles. -- Is this correct? If so, is this the largest load instruction Cuda implements?
Using these definitions, I arrive at the same values as nvprof: Accessing 4096 array items requires 128 warp-level instructions (=requests) with 32 threads each. Using 32 byte loads (=transactions) results in the 1024 transactions.
A memory "request" is an instruction which accesses memory, and a "transaction" is the movement of a unit of data between two regions of memory.