How to Optimize 1024x1024 Matrix Multiplication in C to Match NumPy's Performance in M1 silicon

153 Views Asked by At

I am working on optimizing a 1024x1024 matrix multiplication in C to match the performance of NumPy, which executes at approximately 90 GFlop/s using a single thread. My current C implementation reaches about 55 GFlop/s. I'm using ARM NEON intrinsics for vectorization and various block sizes for better cache utilization.

Here's my C code:

#define N 1024
#define BLOCK_X 16 
#define BLOCK_K 8 
#define BLOCK_Y 8 
void matmul(float* const A, float* const B, float* const C){
    for(int by = 0; by < N; by+=BLOCK_Y) {
        for(int bx = 0; bx < N; bx+=BLOCK_X) {
            for(int bk = 0; bk < N; bk+=BLOCK_K) {
                for (int y=by; y<by+BLOCK_Y; y++){
                    for (int x=bx; x<bx+BLOCK_X; x+=4){
                        float32x4_t sum_vec = vld1q_f32(&C[y * N + x]);
                        __builtin_prefetch(&C[y * N + x + 4], 1, 0);
                        for (int k=bk; k<bk+BLOCK_K; k++){
                            float32x4_t a_vec = vdupq_n_f32(A[y * N + k]);
                            float32x4_t b_vec = vld1q_f32(&B[k*N+x]);
                            sum_vec = vmlaq_f32(sum_vec, a_vec, b_vec);
                        }
                        vst1q_f32(&C[y * N + x], sum_vec);
                    }
                }
            }
        }
    }
}

I expected the performance to be closer to NumPy's, but it's not there yet. What could be the bottlenecks or issues in my current implementation that are hindering its performance?

0

There are 0 best solutions below