Cublas gemms not respecting NaN inputs

110 Views Asked by At

I came across a strange behaviour with cublas.

Consider the following code:

void test(float matrixValue) {
    cublasHandle_t handle;
    cublasCreate(&handle);

    float *A, *B, *C;
    cudaMallocHost(&A, sizeof(float) * 4);
    cudaMallocHost(&B, sizeof(float) * 4);
    cudaMallocHost(&C, sizeof(float) * 4);

    for (int i = 0; i < 4; i++) {
        A[i] = matrixValue;
        B[i] = matrixValue;
        C[i] = 123.0f;
    }

    float *dA, *dB, *dC;
    cudaMalloc(&dA, sizeof(float) * 4);
    cudaMalloc(&dB, sizeof(float) * 4);
    cudaMalloc(&dC, sizeof(float) * 4);
    
    cudaMemcpy(dA, A, sizeof(float) * 4, cudaMemcpyHostToDevice);
    cudaMemcpy(dB, B, sizeof(float) * 4, cudaMemcpyHostToDevice);
    cudaMemcpy(dC, C, sizeof(float) * 4, cudaMemcpyHostToDevice);

    float alpha = 0.0f;
    float beta = 0.0f;

    cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, 2, 2, 2, &alpha, dA, 2, dB, 2, &beta, dC, 2);

    cudaMemcpy(C, dC, sizeof(float) * 4, cudaMemcpyDeviceToHost);

    for (int i = 0; i < 4; i++) {
        printf("%f ", C[i]);
    }
    printf("\n");

    cublasDestroy(handle);
    cudaFreeHost(A);
    cudaFreeHost(B);
    cudaFreeHost(C);
    cudaFree(dA);
    cudaFree(dB);
    cudaFree(dC);
}

int main() {
    test(std::numeric_limits<float>::signaling_NaN());
    test(std::numeric_limits<float>::quiet_NaN());
}

We first allocate three buffers on the CPU (A, B, C), fill A and B with NANs, fill C with a random sentinel value (123), upload them to the GPU, perform a gemm on them using cublas, and download the result in the C buffer, then print the values in C. So far so good.

If we use alpha=1.0 when performing the gemm, the C buffer will be full of NaN, which is to be expected as any calculating involving a NaN will result in a NaN.

But if we use alpha=0, the C buffer will be full of zero even though 0 * NaN should result in NaN. The dC matrix is still overwritten, this means that the gemm (at least the accumulation part) is not skipped. I could not find in the nvidia doc where it states that the multiplication part of the gemm operation can be skipped when alpha is equal to zero. And in this case, does this make CuBLAS non-IEEE754 complient when either A or B contains a NaN and alpha is set to zero ?

1

There are 1 best solutions below

0
Homer512 On BEST ANSWER

This is not a bug as far as I can tell but expected behavior.

Note that the docs read

For references please refer to: sgemm, dgemm, cgemm, zgemm

Those links lead to the BLAS reference Fortran code. If you look at that, you will find lines such as

*
*     Quick return if possible.
*
      IF ((M.EQ.0) .OR. (N.EQ.0) .OR.
     +    (((ALPHA.EQ.ZERO).OR. (K.EQ.0)).AND. (BETA.EQ.ONE))) RETURN
*
*     And if  alpha.eq.zero.
*
      IF (ALPHA.EQ.ZERO) THEN
          IF (BETA.EQ.ZERO) THEN
              DO 20 J = 1,N
                  DO 10 I = 1,M
                      C(I,J) = ZERO
   10             CONTINUE
   20         CONTINUE
          ELSE
              DO 40 J = 1,N
                  DO 30 I = 1,M
                      C(I,J) = BETA*C(I,J)
   30             CONTINUE
   40         CONTINUE
          END IF
          RETURN
      END IF

I'm not good at reading Fortran but that very much seems like the observed behavior. So Cuda just follows the defacto standard.

Personally, I also find this behavior appropriate since it is the same you expect for beta. For beta you definitely need this special case because otherwise you could never reuse memory without explicitly zeroing it, otherwise some random previous data could be interpreted as NaN or Inf.