I have triangular loop to calculate forces, which I ported to GPU via OpenMP offloading:
!$omp target teams distribute parallel do simd &
!$omp private(i,j,ii,jj,rinvdenom,tp,index,fix0) &
!$omp map(to:ap1,ap2,ap2_2,ap1_2,ap1p2,ap2p2,ri_2,gi3,mass) &
!$omp map(tofrom:ugrav,ugrav_aux)
do kk=0,nn1*(nn1-1)/2-1
ii=kk/nn1
jj=mod(kk,nn1)
if(jj.le.ii)then
ii=nn1-ii-2
jj=nn1-jj-1
endif
i=ii+1
j=jj+1
rinvdenom=1.d0/(ap1_2(j)+ri_2(i)+ap2_2(j)-ap2p2(i)*ap2(j))
tp=min(0.9995d0,ap1p2(i)*ap1(j)*rinvdenom)*10000.d0
index=1+int(tp)
fix0=gi3(index)*sqrt(rinvdenom)
ugrav(i)=ugrav(i)+mass(j)*fix0
!$omp atomic
ugrav_aux(j)=ugrav_aux(j)+mass(i)*fix0
enddo
!$omp end target teams distribute parallel do simd
do i=1,n1
ugrav(i)=-g/2./pi*(ugrav(i)+ugrav_aux(i))
enddo
The code works and yields the correct result in the ugrav array. Yet, it only works because of the atomic directive before calculating ugrav_aux(j) and it is very slow, because I'm serializing the loop.
I would think that a reduction on ugrav and ugrav_aux would be needed, such as:
!$omp target teams distribute parallel do simd &
!$omp private(i,j,ii,jj,rinvdenom,tp,index,fix0) reduction(+:ugrav_aux,ugrav)&
!$omp map(to:ap1,ap2,ap2_2,ap1_2,ap1p2,ap2p2,ri_2,gi3,mass) &
!$omp map(tofrom:ugrav,ugrav_aux)
but if I do that I get an error:
libgomp: cuCtxSynchronize error: an illegal memory access was encountered
Any suggestion on how to get rid of the atomic to speed the calculation up?
NOTE: I compiled with: gfortran -fopenmp -foffload=nvptx-none <source>