Suppose we have two CUDA streams running two CUDA kernels on a GPU at the same time. How can I pause the CUDA kernel running with the instruction I putting in the host code and resume it with the instruction in the host code? I have no idea how to write a sample code in this case, for example, to continue this question.
Exactly my question is whether there is an instruction in CUDA that can pause a CUDA kernel running in a CUDA stream and then resume it?
You can use dynamic parallelism with parameters for communication with host for the signals. Then launch a parent kernel with only 1 cuda thread and let it launch child kernels continuously until work is done or signal is received. If child kernel does not fully occupy the GPU, then it will lose performance.
There is no command to pause a stream. For multiple GPUs, you should use unified memory allocation for the communication (between GPUs).
To overcome the gpu utilization issue, you may invent a task queue for child kernels. It pushes work N times (roughly enough to keep GPU efficient in power/compute), then for every completed child kernel it increments a dedicated counter in the parent kernel and pushes a new work, until all work is complete (while trying to keep concurrent kernels at N).
Maybe something like this:
If total work takes more than a few seconds, these atomic value updates shouldn't be a performance problem but if you have way too many child kernels to launch then you can launch more producer/consumer (parent) cuda-threads.