CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing
CUDA_LAUNCH_BLOCKING=1. Compile withTORCH_USE_CUDA_DSAto enable device-side assertions.
I try the enter the command in the bash: export CUDA_LAUNCH_BLOCKING=1 and it worked. However, I am not sure whether this way affects the GPU(s) efficiency or not. Can I change some setting to make it the default such that I do not have to type this command line every time?