I have some sparse system of which I know the structure. I implemented a naïve sparse matrix / vector product, which has no informations on the structure of the matrix, and a version of that product that use the knowledge of the structure. The naïve one ends up being 50% faster ! I am new to that type of optimisations, and to assembly interpretation, so I am having a hard time understanding why. godbolt link : https://godbolt.org/z/dEn9xEWvs
Hello ! godbolt link : https://godbolt.org/z/dEn9xEWvs I'm starting to look at optimising some c++ code. I first tried to use some SIMD instruction, but it quickly turned out that my compiler was probably more capable than me at handling that stuff. So I tried at least making things easy for it : I am solving some sparse linear system. I did use the Eigen library to test all my stuff, but now am trying (let's say it's an exercise) to re-implement some of it. I have some naïve sparse matrix and vector structures looking like this :
struct triplet
{
unsigned int i;
unsigned int j;
float val;
};
typedef std::vector<triplet> matrix;
typedef std::vector<float> vector;
I also have that naïve sparse matrix/vector multiplication :
void naiveMatMul(const matrix & A,const vector & B, vector & result)
{
for(auto & coef : A)
{
result[coef.i] += coef.val * B[coef.j];
}
}
This code runs fine, and for matrices of around 100k coefs, iterating around 3000 matrix/vector product takes around 200ms on my machine.
I actually know that this sparse system is composed of a set number of row of 8 coefs, then of another number of rows of 2 coefficients. So i tried implementing this to "help" my compiler :
struct packedMatrix
{
struct packedCoef{unsigned int col; float val;};
std::vector<packedCoef> coefs;
unsigned int row_8;
unsigned int row_2;
};
void packedMatMul( const packedMatrix &A, const vector &B, vector & result)
{
unsigned int currentLine=0;
unsigned int currentCoefIdx = 0;
while(currentLine<A.row_8)
{
for(int i=0;i<8;i++)
{
result[currentLine] += A.coefs[currentCoefIdx].val * B[A.coefs[currentCoefIdx].col];
currentCoefIdx++;
}
currentLine++;
}
while(currentLine<A.row_2)
{
for(int i=0;i<2;i++)
{
result[currentLine] += A.coefs[currentCoefIdx].val * B[A.coefs[currentCoefIdx].col];
currentCoefIdx++;
}
currentLine++;
}
}
I didn't take the time to unroll the inner for, as I believe the compiler should do that easily.
Not only that code wasn't faster in any way, it also ran around 50% slower (~300ms for the same number of matrix/vector multiplication). I'm surprised, especially as that naïve implementation does not presuppose anything about the order of the coefficient, while they are actually sorted and grouped by lines in the "packed" implementation. I was under the impression that this - especially the grouping of 8 values - could somehow help the automatic vectorisation of that code -even though it's more about horizontal summing than vertical.. Also, looking at the assembly (cf the godbolt link), the naïve version seems a lot shorter, but I believe it is just that the packed one indeed did unroll the "8" and "2" loops. Is there anything obvious I am missing ? Thank you for your help :)
EDIT : after reading the comments of Peter Cordes and Homer512 - thanks a lot !- , I did those two additional changes (and both were required for any vectorisation to appear)
- used -ffast-math
- and used temporary accumulator inside each inner for loop, then assigned it to result[currentLine] after each loop. I did that in godbolt, and afterward we do see assembly looking like this :
movss xmm0, dword ptr [r9 + 4*rdx] # xmm0 = mem[0],zero,zero,zero
lea edx, [rcx + 5]
movss xmm1, dword ptr [r9 + 4*rsi] # xmm1 = mem[0],zero,zero,zero
lea esi, [rcx + 6]
.
.
.
#then later
unpcklps xmm1, xmm0 # xmm1 = xmm1[0],xmm0[0],xmm1[1],xmm0[1]
unpcklps xmm2, xmm3 # xmm2 = xmm2[0],xmm3[0],xmm2[1],xmm3[1]
movlhps xmm2, xmm1 # xmm2 = xmm2[0],xmm1[0]
unpcklps xmm5, xmm4 # xmm5 = xmm5[0],xmm4[0],xmm5[1],xmm4[1]
.
.
which looks like vectorisation, so at least this is having an effect ! Unfortunately, I still don't see any improvements in the computation time in my program when making those changes. I'm using Visual studio, for which I believe the equivalent to -ffast-math should be /fp:fast ? But no luck so far. updated godbolt : https://godbolt.org/z/PnYjKGrd9