Decreasing performance by using cuda kernel inside loop

Question

Decreasing performance by using cuda kernel inside loop

316 Views Asked by eldev09 At 12 May 2019 at 20:17

I was trying to improve the performance of a slow code. That code used cblas and i was trying to upgrade the performance by using magma and cuda. First i just passed cblas calls to magma. But it needs CPU <-> GPU copies inside the loop and so it made the program run even slower than the cblas version. Then, and thanks to a suggestion of a stackoverflow member, i started using a cuda kernel because this way i could have 1 copy less, which improved the performance a bit. However, my code is still much slower than the CPU code. Is it caused by calling the kernel inside the loop? Is there a way to avoid all CPU <-> GPU copies that are inside the loop? I'm starting to think that maybe this code is just not worth to parelelize.

Here is my code:

__global__ void calculateGamma(double* d_delta, double *d_gamma_xi, double *dotresult, double* gamma_output) {

  int index= blockIdx.x;
  gamma_output[index] = -(*d_gamma_xi + *dotresult)/ *d_delta;
}

for (i=0;i<m-1;i++) {
      if (i==0) {
        gamma = -gamma_x[i+1]/delta;
        cudaMemcpy(d_gammaOutput, &gamma, sizeof(double), cudaMemcpyHostToDevice);
      } else {

        cublasDdot(h, i, &d_gamma_x[1], 1, &(d_l2)[1], 1, dotresult);
        cudaDeviceSynchronize();
        cublasSetPointerMode(h, CUBLAS_POINTER_MODE_HOST);

        calculateGamma<<<1,1>>>(d_delta, &d_gamma_x[i+1], dotresult, d_gammaOutput);
        cudaMemcpy(get_gamma_output, d_gammaOutput, sizeof(double), cudaMemcpyDeviceToHost);

        gamma = *get_gamma_output;
        magma_dcopy(i, &(d_l2)[1], 1, &(d_l1)[2], 1, queue);
        magma_daxpy(i, gamma, &(d_l2)[1], -1, &(d_l1)[2], 1, queue);

        magma_dswap(ny, d_l1, 1, d_l2, 1, queue);
      }
      magma_dcopy(1, d_gammaOutput, 1, &(d_l2)[1], 1, queue);
      delta = gamma_x[0] + magma_ddot(i+1,&d_gamma_x[1],1,&(d_l2)[1],-1, queue);      

      ln_determinant_C += log(delta);
}

Original Q&A

There are 1 best solutions below

**eldev09** · Answer 1 · 2019-05-13T19:31:37.837000

eldev09 On 13 May 2019 at 19:31

Update: this code was slower due to my poor GPU. Running it on GPU with better performance made it run much faster than the CBlas version of the code.

Decreasing performance by using cuda kernel inside loop

There are 1 best solutions below

Related Questions in CUDA

Related Questions in MAGMA

Trending Questions

Popular # Hahtags

Popular Questions