The following code is used to multiply a complex float matrix (separate Real,Imag) by float matrix.
I'm quite sure it can be optimized by reorder of the code due to latency of load,store and multiply. Can you tell if there are rules how to optimize the code to handle this latency ?
/***************************************************************************************/
void CVector::MatrixMultiply(float* pReA, float* pImA,
float* pTranB,
float* pOutRe, float* pOutIm,
uint32_t RowsA, uint32_t ColsA,
uint32_t RowsB, uint32_t ColsB)
{
float *pSrcReA;
float* pSrcImA;
float* pSrcB;
float* pDstRe = pOutRe;
float* pDstIm = pOutIm;
float* pRowReA, * pRowImA;
__m256 ReSum, ImSum, VecReA, VecImA;
__m256 *pAvec, *pBvec;
__m256 VecA, VecB;
__m128 Low, High, Sum128;
__m128 Zero128 = _mm_set_ps1(0);
uint32_t Offset;
for (int i = 0; i < RowsA; i++)
{
Offset = ColsA * i;
pSrcReA = pReA + Offset;
pSrcImA = pImA + Offset;
for (int j = 0; j < ColsB; j++)
{
ReSum = _mm256_set1_ps(0);
ImSum = ReSum;
pRowReA = pSrcReA;
pRowImA = pSrcImA;
pSrcB = pTranB + RowsB * j;
for (int k = 0; k < (ColsA >> 3); k++)
{
VecReA = _mm256_load_ps((float*)pRowReA);
VecImA = _mm256_load_ps((float*)pRowImA);
VecB = _mm256_load_ps((float*)pSrcB);
ReSum = _mm256_fmadd_ps (VecReA, VecB, ReSum);
ImSum = _mm256_fmadd_ps (VecImA, VecB, ImSum);
pRowReA += 8;
pRowImA += 8;
}
Low = _mm256_extractf128_ps(ReSum, 0);
High = _mm256_extractf128_ps(ReSum, 1);
Sum128 = _mm_add_ps(Low, High);
Sum128 = _mm_hadd_ps(Sum128, Zero128);
Sum128 = _mm_hadd_ps(Sum128, Zero128);
*pDstRe = _mm_cvtss_f32(Sum128);
Low = _mm256_extractf128_ps(ImSum, 0);
High = _mm256_extractf128_ps(ImSum, 1);
Sum128 = _mm_add_ps(Low, High);
Sum128 = _mm_hadd_ps(Sum128, Zero128);
Sum128 = _mm_hadd_ps(Sum128, Zero128);
*pDstIm = _mm_cvtss_f32(Sum128);
pDstRe++;
pDstIm++;
}
}
}
The biggest performance issue with your code is that (on most CPUs)
fmaddhas a latency of 4-5 cycles, but a reciprocal throughput of 0.5 (i.e., there can be two independent FMAs performed simultaneously) -- source: uops.info.To get the full throughput, you need to perform 8 (or on some CPUs 10) independent FMA operations inside the inner loop. E.g., have 8 independent
ReSum0..3,ImSum0..3accumulators and accumulate to them by 8{VecReA, VecImA} * VecB0..3products. I'm not writing this out, since I don't fully understand your code, e.g., why are you not incrementingpSrcBin thek-loop? And are you sure thatColsA==RowsBand that they are a multiple of 8?