CUDA half float operations without explicit intrinsics

Question

CUDA half float operations without explicit intrinsics

738 Views Asked by Bram At 07 January 2021 at 19:46

I am using CUDA 11.2 and I use the __half type to do operations on 16 bit floating point values.

I am surprised that the nvcc compiler will not properly invoke fused multiply add instructions when I do:

__half a,b,c;
...
__half x = a * b + c;

Instead of emitting a fused multiply add, it emits separate mul and add instructions.

mul.f16 %rs164,%rs1,%rs306;
add.f16 %rs167,%rs164,%rs65;

Note that this is despite using the --fmad=true compiler option.

Whereas an explicit __hfma( a,b,c ) will emit:

fma.rn.f16 %rs164,%rs1,%rs300,%rs65;

Is the only way to utilize 16 bit floating point multiply-add to use explicit intrinsics?

Original Q&A

There are 1 best solutions below

**Robert Crovella** · Accepted Answer · 2021-01-07T20:24:15.573000

The instructions that are actually executed by the GPU are SASS, not PTX. PTX is an intermediate format, and the tool that converts PTX to SASS is an optimizing compiler.

When I perform an operation as you suggest, and study the SASS, I witness a fused-multiply-add instruction being generated:

$ cat t111.cu
#include <cuda_fp16.h>
__global__ void k(__half *x, __half a, __half b, __half c){
        *x = a*b+c;
}
$ nvcc -arch=sm_75 -c t111.cu
$ cuobjdump -ptx t111.o

Fatbin elf code:
================
arch = sm_75
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit

Fatbin ptx code:
================
arch = sm_75
code version = [7,1]
producer = <unknown>
host = linux
compile_size = 64bit
compressed

.version 7.1
.target sm_75
.address_size 64



.visible .entry _Z1kP6__halfS_S_S_(
.param .u64 _Z1kP6__halfS_S_S__param_0,
.param .align 2 .b8 _Z1kP6__halfS_S_S__param_1[2],
.param .align 2 .b8 _Z1kP6__halfS_S_S__param_2[2],
.param .align 2 .b8 _Z1kP6__halfS_S_S__param_3[2]
)
{
.reg .b16 %rs<7>;
.reg .b64 %rd<3>;


ld.param.u64 %rd1, [_Z1kP6__halfS_S_S__param_0];
ld.param.u16 %rs2, [_Z1kP6__halfS_S_S__param_1];
ld.param.u16 %rs3, [_Z1kP6__halfS_S_S__param_2];
ld.param.u16 %rs6, [_Z1kP6__halfS_S_S__param_3];
cvta.to.global.u64 %rd2, %rd1;

        {mul.f16 %rs1,%rs2,%rs3;
}


        {add.f16 %rs4,%rs1,%rs6;
}

        st.global.u16 [%rd2], %rs4;
ret;
}


$ cuobjdump -sass t111.o

Fatbin elf code:
================
arch = sm_75
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit

        code for sm_75
                Function : _Z1kP6__halfS_S_S_
        .headerflags    @"EF_CUDA_SM75 EF_CUDA_PTX_SM(EF_CUDA_SM75)"
        /*0000*/                   MOV R1, c[0x0][0x28] ;                                /* 0x00000a0000017a02 */
                                                                                         /* 0x000fd00000000f00 */
        /*0010*/                   LDC.U16 R0, c[0x0][0x168] ;                           /* 0x00005a00ff007b82 */
                                                                                         /* 0x000e220000000400 */
        /*0020*/                   ULDC.64 UR4, c[0x0][0x160] ;                          /* 0x0000580000047ab9 */
                                                                                         /* 0x000fce0000000a00 */
        /*0030*/                   LDC.U16 R3, c[0x0][0x16a] ;                           /* 0x00005a80ff037b82 */
                                                                                         /* 0x000e240000000400 */
        /*0040*/                   HFMA2 R0, R0.H0_H0, R3.H0_H0, c[0x0] [0x16c].H0_H0 ;  /* 0x20005b0000007631 */
                                                                                         /* 0x001fd00000040803 */
        /*0050*/                   STG.E.U16.SYS [UR4], R0 ;                             /* 0x00000000ff007986 */
                                                                                         /* 0x000fe2000c10e504 */
        /*0060*/                   EXIT ;                                                /* 0x000000000000794d */
                                                                                         /* 0x000fea0003800000 */
        /*0070*/                   BRA 0x70;                                             /* 0xfffffff000007947 */
                                                                                         /* 0x000fc0000383ffff */
                ..........



Fatbin ptx code:
================
arch = sm_75
code version = [7,1]
producer = <unknown>
host = linux
compile_size = 64bit
compressed
$

(CUDA 11.1)

I don't recommend PTX analysis to answer questions like this.

CUDA half float operations without explicit intrinsics

There are 1 best solutions below

Related Questions in CUDA

Related Questions in INTRINSICS

Related Questions in NVCC

Related Questions in FMA

Related Questions in HALF-PRECISION-FLOAT

Trending Questions

Popular # Hahtags

Popular Questions