Here is my problem: I have a fortran code with a certain amount of nested loops and I would like to know if it's possible to use OpenMP to parallelize them?
Here is the fortran code:
ia = 0
do i=1,nO
do a=nO+1,N
ia = ia + 1
jb = 0
do j=1,nO
do b=nO+1,N
jb = jb + 1
chi = 0d0
do kc=1,nS
eps = B(kc)**2 + eta**2
chi = chi + A(i,j,kc)*A(a,b,kc)*B(kc)/eps
enddo
C(ia,jb) = C(ia,jb) + D(i,b,j,a) - 4d0*chi
enddo
enddo
enddo
enddo
Where nS = nO*(N-nO). For the dimensions of the arrays: B(nS), A(N,N,nS), C(nS,nS), D(N,N,N,N) . An example for the different integers: nO=88, N=240, nS=13376. Maybe it is worth mentioning that I run it on the cluster of my lab and not my laptop.
I know nothing about OpenMP and I tried to do it myself with examples that I found but I did not succeed. I have seen many different commands with OpenMP like PRIVATE, SHARED, REDUCTION and I tried to use them without success.
Yes, it should be possible. But it will also be good to rewrite the code somewhat. Notice that you can always compute the value of
iafrom the known values ofianda. There is no need to do theia = ia + 1additions. Similarly, you can always computejbfromjandb.As Ian Bush reminded me in the comment, the memory access pattern is bad. In Fortran, the memory order of the storage of array elements is column-major. That means that the first index should be in the innermost loop and the last index in the outermost loop for best efficiency. That can be achieved by declaring your arrays transposed or by restructuring the loop nests - which may not always be possible.
In the code above you can switch the order of the
i,j,a,bloops without any change to the logicbut it would help even more if the
kcindex was the first one.If re-declaring your arrays - this may have a large impact on the run time of your program. This also concerns other parts of you whole program. You have to take the whole program into account.
After that you can apply OpenMP.
I would start with a simple:
Later, if it runs correctly, you try different
scheduleto find out what is faster and possibly also trycollapse(4).