I'm trying to use torch.autograd.profiler to profile the run time of different steps in a multi head attention block. I added profiler.record_function to different places. But the run time changes every time I added record_function.

For example, I added one "with profiler.record_function("SOFTMAX PASS"):" to the softax step, and I run the profiling and printed the results using:

with profiler.profile(use_cuda=True, record_shapes=True, profile_memory=True) as prof:
    #with record_function("model_inference"):
        a=multihead_attention(queries, keys, values)

print(prof.key_averages().table(sort_by="self_cpu_time_total", row_limit=10))

This profiler prints both CPU and CUDA run time. However, only the step with record_function will take CUDA run time, all other steps exists in CPU doesn't run on CUDA. Because I got the results that SOFTMAX PASS takes 100% of the CUDA run time and it takes 295us. Other events like cudaDeviceSynchronize takes 0us in CUDA

After I added "with profiler.record_function("QK MATMUL PASS"):" to one matrix multiplication step, the results changed that: QK MATMUL PASS takes 79.2% of total CUDA time (811us), SOFTMAX PASS takes 20.80% (213us).

So I would like to know why the cuda time profile changes with different record function added. Does it only shows cuda run time for steps with record_function added? The most important question is: What should I do if I want to get all the steps' run time on CUDA? Is there any example or tutorial on it? I tried with the Pytorch document examples but it's not very detailed. Also, what does this events mean? Where can I find documents for it

enter image description here

My program is in colab here: https://colab.research.google.com/drive/1FGQ_tHNDxEDAGiaYilGIj5HPlEMiT5pp#scrollTo=AIHpP5g1SXSS

0

There are 0 best solutions below