I'm trying to use torch.autograd.profiler to profile the run time of different steps in a multi head attention block. I added profiler.record_function to different places. But the run time changes every time I added record_function.
For example, I added one "with profiler.record_function("SOFTMAX PASS"):" to the softax step, and I run the profiling and printed the results using:
with profiler.profile(use_cuda=True, record_shapes=True, profile_memory=True) as prof:
#with record_function("model_inference"):
a=multihead_attention(queries, keys, values)
print(prof.key_averages().table(sort_by="self_cpu_time_total", row_limit=10))
This profiler prints both CPU and CUDA run time. However, only the step with record_function will take CUDA run time, all other steps exists in CPU doesn't run on CUDA. Because I got the results that SOFTMAX PASS takes 100% of the CUDA run time and it takes 295us. Other events like cudaDeviceSynchronize takes 0us in CUDA
After I added "with profiler.record_function("QK MATMUL PASS"):" to one matrix multiplication step, the results changed that: QK MATMUL PASS takes 79.2% of total CUDA time (811us), SOFTMAX PASS takes 20.80% (213us).
So I would like to know why the cuda time profile changes with different record function added. Does it only shows cuda run time for steps with record_function added? The most important question is: What should I do if I want to get all the steps' run time on CUDA? Is there any example or tutorial on it? I tried with the Pytorch document examples but it's not very detailed. Also, what does this events mean? Where can I find documents for it

My program is in colab here: https://colab.research.google.com/drive/1FGQ_tHNDxEDAGiaYilGIj5HPlEMiT5pp#scrollTo=AIHpP5g1SXSS