GPU offloading of triangular loop

31 Views Asked by At

I have triangular loop to calculate forces, which I ported to GPU via OpenMP offloading:

!$omp target teams distribute parallel do simd &
!$omp        private(i,j,ii,jj,rinvdenom,tp,index,fix0) &
!$omp        map(to:ap1,ap2,ap2_2,ap1_2,ap1p2,ap2p2,ri_2,gi3,mass) &
!$omp        map(tofrom:ugrav,ugrav_aux)

                do kk=0,nn1*(nn1-1)/2-1
                  ii=kk/nn1
                  jj=mod(kk,nn1)
                  if(jj.le.ii)then
                    ii=nn1-ii-2
                    jj=nn1-jj-1
                  endif
                  i=ii+1
                  j=jj+1
                  rinvdenom=1.d0/(ap1_2(j)+ri_2(i)+ap2_2(j)-ap2p2(i)*ap2(j))
                  tp=min(0.9995d0,ap1p2(i)*ap1(j)*rinvdenom)*10000.d0
                  index=1+int(tp)
                  fix0=gi3(index)*sqrt(rinvdenom)
                  ugrav(i)=ugrav(i)+mass(j)*fix0
                  !$omp atomic
                  ugrav_aux(j)=ugrav_aux(j)+mass(i)*fix0
                enddo

!$omp end target teams distribute parallel do simd

                do i=1,n1
                  ugrav(i)=-g/2./pi*(ugrav(i)+ugrav_aux(i))
                enddo

The code works and yields the correct result in the ugrav array. Yet, it only works because of the atomic directive before calculating ugrav_aux(j) and it is very slow, because I'm serializing the loop.

I would think that a reduction on ugrav and ugrav_aux would be needed, such as:

!$omp target teams distribute parallel do simd &
!$omp        private(i,j,ii,jj,rinvdenom,tp,index,fix0) reduction(+:ugrav_aux,ugrav)&
!$omp        map(to:ap1,ap2,ap2_2,ap1_2,ap1p2,ap2p2,ri_2,gi3,mass) &
!$omp        map(tofrom:ugrav,ugrav_aux)

but if I do that I get an error:

libgomp: cuCtxSynchronize error: an illegal memory access was encountered

Any suggestion on how to get rid of the atomic to speed the calculation up?

NOTE: I compiled with: gfortran -fopenmp -foffload=nvptx-none <source>

0

There are 0 best solutions below