I am working on optimizing a 1024x1024 matrix multiplication in C to match the performance of NumPy, which executes at approximately 90 GFlop/s using a single thread. My current C implementation reaches about 55 GFlop/s. I'm using ARM NEON intrinsics for vectorization and various block sizes for better cache utilization.
Here's my C code:
#define N 1024
#define BLOCK_X 16
#define BLOCK_K 8
#define BLOCK_Y 8
void matmul(float* const A, float* const B, float* const C){
for(int by = 0; by < N; by+=BLOCK_Y) {
for(int bx = 0; bx < N; bx+=BLOCK_X) {
for(int bk = 0; bk < N; bk+=BLOCK_K) {
for (int y=by; y<by+BLOCK_Y; y++){
for (int x=bx; x<bx+BLOCK_X; x+=4){
float32x4_t sum_vec = vld1q_f32(&C[y * N + x]);
__builtin_prefetch(&C[y * N + x + 4], 1, 0);
for (int k=bk; k<bk+BLOCK_K; k++){
float32x4_t a_vec = vdupq_n_f32(A[y * N + k]);
float32x4_t b_vec = vld1q_f32(&B[k*N+x]);
sum_vec = vmlaq_f32(sum_vec, a_vec, b_vec);
}
vst1q_f32(&C[y * N + x], sum_vec);
}
}
}
}
}
}
I expected the performance to be closer to NumPy's, but it's not there yet. What could be the bottlenecks or issues in my current implementation that are hindering its performance?