x86 Intrinsic: optimize a Matrix multiply of complex floats

70 Views Asked by Zvi Vered At 23 December 2023 at 05:59

The following code is used to multiply a complex float matrix (separate Real,Imag) by float matrix.

I'm quite sure it can be optimized by reorder of the code due to latency of load,store and multiply. Can you tell if there are rules how to optimize the code to handle this latency ?

/***************************************************************************************/
void CVector::MatrixMultiply(float* pReA, float* pImA,
                            float* pTranB,
                            float* pOutRe, float* pOutIm,
                            uint32_t RowsA, uint32_t ColsA,
                            uint32_t RowsB, uint32_t ColsB)
{
    float *pSrcReA;
    float* pSrcImA;
    float* pSrcB;
    float* pDstRe = pOutRe;
    float* pDstIm = pOutIm;
    float* pRowReA, * pRowImA;

    __m256 ReSum, ImSum, VecReA, VecImA;
    __m256 *pAvec, *pBvec;
    __m256 VecA, VecB;
    __m128 Low, High, Sum128;
    __m128 Zero128 = _mm_set_ps1(0);

    uint32_t Offset;

    for (int i = 0; i < RowsA; i++)
    {
        Offset = ColsA * i;
        pSrcReA = pReA + Offset;
        pSrcImA = pImA + Offset;
        for (int j = 0; j < ColsB; j++)
        {
            ReSum = _mm256_set1_ps(0);
            ImSum = ReSum;
            pRowReA = pSrcReA;
            pRowImA = pSrcImA;
            pSrcB = pTranB + RowsB * j;

            for (int k = 0; k < (ColsA >> 3); k++)
            {
                VecReA = _mm256_load_ps((float*)pRowReA);
                VecImA = _mm256_load_ps((float*)pRowImA);
                VecB = _mm256_load_ps((float*)pSrcB);

                ReSum = _mm256_fmadd_ps (VecReA, VecB, ReSum);
                ImSum = _mm256_fmadd_ps (VecImA, VecB, ImSum);

                pRowReA += 8;
                pRowImA += 8;
            }

            Low = _mm256_extractf128_ps(ReSum, 0);
            High = _mm256_extractf128_ps(ReSum, 1);
            Sum128 = _mm_add_ps(Low, High);
            Sum128 = _mm_hadd_ps(Sum128, Zero128);
            Sum128 = _mm_hadd_ps(Sum128, Zero128);
            *pDstRe = _mm_cvtss_f32(Sum128);

            Low = _mm256_extractf128_ps(ImSum, 0);
            High = _mm256_extractf128_ps(ImSum, 1);
            Sum128 = _mm_add_ps(Low, High);
            Sum128 = _mm_hadd_ps(Sum128, Zero128);
            Sum128 = _mm_hadd_ps(Sum128, Zero128);
            *pDstIm = _mm_cvtss_f32(Sum128);

            pDstRe++;
            pDstIm++;
        }
    }
}

Original Q&A

There are 1 best solutions below

chtz On 23 December 2023 at 18:34

The biggest performance issue with your code is that (on most CPUs) fmadd has a latency of 4-5 cycles, but a reciprocal throughput of 0.5 (i.e., there can be two independent FMAs performed simultaneously) -- source: uops.info.

To get the full throughput, you need to perform 8 (or on some CPUs 10) independent FMA operations inside the inner loop. E.g., have 8 independent ReSum0..3, ImSum0..3 accumulators and accumulate to them by 8 {VecReA, VecImA} * VecB0..3 products. I'm not writing this out, since I don't fully understand your code, e.g., why are you not incrementing pSrcB in the k-loop? And are you sure that ColsA==RowsB and that they are a multiple of 8?

x86 Intrinsic: optimize a Matrix multiply of complex floats

There are 1 best solutions below

Related Questions in C

Related Questions in X86

Related Questions in MATRIX-MULTIPLICATION

Related Questions in COMPLEX-NUMBERS

Related Questions in AVX

Trending Questions

Popular # Hahtags

Popular Questions