Optimizing the addition of a symmetric banded matrix stored as a 1D array to a general 2D array

74 Views Asked by At

I have a matrix B, a symmetric banded matrix stored as a 1D array, that has to be added to a general matrix A. To illustrate the example let ccc=3, which would imply B is a pentadiagonal matrix, thus the (i,j) elements would be stored as follows:

matrix_B(i,j)  [(1,1), (1,2), (1,3), (2,2), (2,3), (2,4), (3,3), ...]
array_B(idx)   [    1,     2,     3,     4,     5,     6,     7, ...]

see here how there is no need to store (2,1), (3,1) and (3,2) since (i,j) = (j,i).

I have the following code, which is an oversimplified version of the actual working one:

subroutine iter(n,ccc,matrix_A,array_B,const)
    implicit none
    integer,intent(in)::n,ccc
    complex*16,intent(inout)::matrix_A(n,n)
    complex*16,intent(in)::array_B(n*ccc)
    complex*16,intent(in)::const
    integer::i,j,idx

    do i=1,n
        do j=max(i-ccc+1,1),min(i+ccc-1,n) ! elements outside this range are 0 and thus not stored
            matrix_A(i,j)=matrix_A(i,j)+array_B(idx(i,j,ccc))*const
        end do
    end do

end subroutine iter

function idx(i,j,ccc)
    integer idx,i,j,ccc
    ! if(abs(i-j).gt.(ccc-1)) then
    !     write(*,*) 'wrong values i,j',i,j
    ! end if
    if(i.le.j) then
        idx=(i-1)*(ccc)+1+j-i
    else
        idx=(j-1)*(ccc)+1+i-j
    endif
    return
end

The problem is, this is a rather slow routine, and it is the bottleneck of a large program. Besides looping over the fast index i (is not so trivial in the real code), I thought about doing something like this:

    do i=1,n
        j=i
        matrix_A(i,j)=matrix_A(i,j)+array_B((i-1)*(ccc)+1+j-i)*const
        do j=i+1,min(i+ccc-1,n)
            matrix_A(i,j)=matrix_A(i,j)+array_B((i-1)*(ccc)+1+j-i)*const
            matrix_A(j,i)=matrix_A(j,i)+array_B((i-1)*(ccc)+1+j-i)*const
        end do
    end do

Here the function is inlined manually (now i<=j always), and the inner loop is halved, but this does not give significant increase in performance. Is there a way to optimize this further? More specifically, knowing the same value has to be added to matrix_A(i,j) and matrix_A(j,i) (which is not symmetric).

Thank you.

Trying to improve this solution. Using fortran95 and gfortran-8.

1

There are 1 best solutions below

0
PierU On

I propose 2 things:

  • strictly process A column by column in order to minimize the cache misses if A is a large matrix
  • minimize the number of index calculations

I've tested this piece of code (your iter routine) and the results look OK, but I leave the benchmarking to you :)

    idx0 = 1;
    do j=1,n
        i1 = max(j - ccc + 1,1)
        matrix_A(i1:j,j) = matrix_A(i1:j,j) &
                         + array_B(idx0-(j-i1)*(ccc-1):idx0:ccc-1)
        i2 = min(j + ccc -1, n)
        matrix_A(j+1:i2,j) = matrix_A(j+1:i2,j) &
                           + array_B(idx0+1:idx0+(i2-j))
        idx0 = idx0 + ccc
    end do