Decreasing performance by using cuda kernel inside loop

316 Views Asked by At

I was trying to improve the performance of a slow code. That code used cblas and i was trying to upgrade the performance by using magma and cuda. First i just passed cblas calls to magma. But it needs CPU <-> GPU copies inside the loop and so it made the program run even slower than the cblas version. Then, and thanks to a suggestion of a stackoverflow member, i started using a cuda kernel because this way i could have 1 copy less, which improved the performance a bit. However, my code is still much slower than the CPU code. Is it caused by calling the kernel inside the loop? Is there a way to avoid all CPU <-> GPU copies that are inside the loop? I'm starting to think that maybe this code is just not worth to parelelize.

Here is my code:

__global__ void calculateGamma(double* d_delta, double *d_gamma_xi, double *dotresult, double* gamma_output) {

  int index= blockIdx.x;
  gamma_output[index] = -(*d_gamma_xi + *dotresult)/ *d_delta;
}

for (i=0;i<m-1;i++) {
      if (i==0) {
        gamma = -gamma_x[i+1]/delta;
        cudaMemcpy(d_gammaOutput, &gamma, sizeof(double), cudaMemcpyHostToDevice);
      } else {

        cublasDdot(h, i, &d_gamma_x[1], 1, &(d_l2)[1], 1, dotresult);
        cudaDeviceSynchronize();
        cublasSetPointerMode(h, CUBLAS_POINTER_MODE_HOST);

        calculateGamma<<<1,1>>>(d_delta, &d_gamma_x[i+1], dotresult, d_gammaOutput);
        cudaMemcpy(get_gamma_output, d_gammaOutput, sizeof(double), cudaMemcpyDeviceToHost);

        gamma = *get_gamma_output;
        magma_dcopy(i, &(d_l2)[1], 1, &(d_l1)[2], 1, queue);
        magma_daxpy(i, gamma, &(d_l2)[1], -1, &(d_l1)[2], 1, queue);

        magma_dswap(ny, d_l1, 1, d_l2, 1, queue);
      }
      magma_dcopy(1, d_gammaOutput, 1, &(d_l2)[1], 1, queue);
      delta = gamma_x[0] + magma_ddot(i+1,&d_gamma_x[1],1,&(d_l2)[1],-1, queue);      

      ln_determinant_C += log(delta);
}
1

There are 1 best solutions below

0
eldev09 On

Update: this code was slower due to my poor GPU. Running it on GPU with better performance made it run much faster than the CBlas version of the code.