Wer are using dynamic shared memory in our CUDA kernels. We are setting the size of the shared memory for each kernel using the driver API cuFuncSetAttribute and CU_FUNC_ATTRIBUTE_MAX_DYNAMIC_SHARED_SIZE_BYTES.
The kernel is then launched using cuLaunchKernel where in the docs one of the parameter is unsigned int sharedMemBytes. This parameter is defined to set
Dynamic shared-memory size per thread block in bytes
This means I can set the dynamic memory size per kernel attribute and additionally I can set the shared memory size per kernel call.
Does this mean I can override the kernel attribute? Which one wins?
Says so right in the name:
MAX_DYNAMIC_SHARED_SIZE_BYTESvssharedMemBytes. Note the MAX prefix :-)Setting a different maximum value may effect the GPU's behavior when running the kernel, e.g. the allocation of regular L1 cache for use by the kernel (as in some/most NVIDIA GPU micro-architectures, shared memory is repurposed L1 cache, and their total amount is fixed but the proportions aren't; see also §16.6.4 of the CUDA C++ Programming Guide).
Now, it's true that passing a specific amount of shared memory could have implicitly done whatever setting maximum does; but - either that has somewhat of an overhead, or - it's just how NVIDIA has chosen to do things.