I am using CUDA 11.2 and I use the __half type to do operations on 16 bit floating point values.
I am surprised that the nvcc compiler will not properly invoke fused multiply add instructions when I do:
__half a,b,c;
...
__half x = a * b + c;
Instead of emitting a fused multiply add, it emits separate mul and add instructions.
mul.f16 %rs164,%rs1,%rs306;
add.f16 %rs167,%rs164,%rs65;
Note that this is despite using the --fmad=true compiler option.
Whereas an explicit __hfma( a,b,c ) will emit:
fma.rn.f16 %rs164,%rs1,%rs300,%rs65;
Is the only way to utilize 16 bit floating point multiply-add to use explicit intrinsics?
The instructions that are actually executed by the GPU are SASS, not PTX. PTX is an intermediate format, and the tool that converts PTX to SASS is an optimizing compiler.
When I perform an operation as you suggest, and study the SASS, I witness a fused-multiply-add instruction being generated:
(CUDA 11.1)
I don't recommend PTX analysis to answer questions like this.