I want to know about how many clock instruction cost in nvidia gpu, such as add, mul,ld/st and so on, How can I do ?
I had wrote some code to test and run in 2080Ti
const int test_cnt = 1000;
auto lastT = clock64();
uint32_t res;
#pragma unroll
for (int i = 0; i<test_cnt; ++i) {
asm volatile("mul.lo.u32 %0, %0, %1;"
: "+r"(res)
: "r"(i));
asm volatile("mul.hi.u32 %0, %0, %1;"
: "+r"(res)
: "r"(i));
}
printf("in gpu phase 1 :%lld %ld\n", clock64() - lastT, res);
But the result make me a little confused, the result output is:
in gpu phase 1 :6 0
Why so many times mul instruction, the clock cost is just 6 ? Is there some optimization in nvcc compiler?
I enter command cuobjdump --dump-ptx ./cutest
get the assmble instruction:
mov.u64 %rd2, %clock64;
mov.u32 %r38002, 0;
mul.lo.u32 %r7, %r7, %r38002;
mul.hi.u32 %r7, %r7, %r38002;
mov.u32 %r38005, 1;
mul.lo.u32 %r7, %r7, %r38005;
mul.hi.u32 %r7, %r7, %r38005;
mov.u32 %r38008, 2;
mul.lo.u32 %r7, %r7, %r38008;
mul.hi.u32 %r7, %r7, %r38008;
mov.u32 %r38011, 3;
mul.lo.u32 %r7, %r7, %r38011;
mul.hi.u32 %r7, %r7, %r38011;
mov.u32 %r38014, 4;
mul.lo.u32 %r7, %r7, %r38014;
mul.hi.u32 %r7, %r7, %r38014;
mov.u32 %r38017, 5;
mul.lo.u32 %r7, %r7, %r38017;
...
...
...
...
...
...
...
The above assmble instruction code show all is right, It is not optimization。 So why the clock cost output is so little?
And then, Is there other way to get instruction cost in NVIDIA GPU? Is there some document specify these detail?
Probably the most important takeaway here is do not use PTX for this kind of analysis.
When I compile the code you have shown, the SASS code (what the GPU actually executes) doesn't have much resemblance to the PTX code you have shown:
The SASS code shows no evidence of your loop, nor any unrolling.
Yes, the tool that converts PTX to SASS is an optimizing compiler. You now have an example of this above.
The biggest reason is that the code has been optimized to remove the loop, entirely.
For the most part, NVIDIA doesn't publish anything like that.
People who are interested in these things usually end up writing microbenchmarking codes, something like the one you wrote. Some notable example reports are published by citadel group, here is one. That one covers T4 GPU, which, for instruction latencies, should be similar to your 2080Ti.
I made a simple change to your code that "breaks" the compiler's ability to optimize:
You now know how to compare the PTX and SASS. If you study the SASS for the above case, you will observe the existence of a loop in the SASS code, consistent with the loop in your source code.
As an aside, your initial code had arithmetic being done and results being printed based on an uninitialized variable:
AFAIK that invokes UB in C++. My general understanding is that if your code contains UB, the results may be unpredictable or confusing. My general understanding is that a compiler may make "unexpected" optimizations in the presence of UB, although I'm not stating that is happening in this case. So my suggestion is to make sure your code is not invoking UB before you start microbenchmarking.