What is causing the store latency in this program?

137 Views Asked by At

Given the following C program (MSVC does not optimize away the "work" for me, for other compilers you may need to add an asm statement):

#include <inttypes.h>
#include <stdlib.h>

#define SIZE 10000

typedef struct {
    int32_t a, b, c;
} Struct;

void do_work(Struct* data) {
    int32_t* a = malloc(sizeof(int32_t) * SIZE),
        * b = malloc(sizeof(int32_t) * SIZE),
        * c = malloc(sizeof(int32_t) * SIZE);

    int32_t* a_ptr = a, * b_ptr = b, * c_ptr = c;
    for (size_t i = 0; i < SIZE; i++, a_ptr++, b_ptr++, c_ptr++, data++) {
        *a_ptr = data->a;
        *b_ptr = data->b;
        *c_ptr = data->c;
    }

    free(a);
    free(b);
    free(c);
}

int main() {
    Struct* data = malloc(sizeof(Struct) * SIZE);
    for (size_t i = 0; i < SIZE; i++) {
        data[i].a = i;
        data[i].b = i;
        data[i].c = i;
    }

    for (int i = 0; i < 500000; i++) {
        do_work(data);
    }

    free(data);
}

Edit: Disassembly of do_work():

do_work PROC                        ; COMDAT
$LN12:
    mov QWORD PTR [rsp+8], rbx
    mov QWORD PTR [rsp+16], rbp
    mov QWORD PTR [rsp+24], rsi
    push    rdi
    sub rsp, 32                 ; 00000020H
    mov rbx, rcx
    mov ecx, 40000              ; 00009c40H
    call    QWORD PTR __imp_malloc
    mov ecx, 40000              ; 00009c40H
    mov rsi, rax
    call    QWORD PTR __imp_malloc
    mov ecx, 40000              ; 00009c40H
    mov rbp, rax
    call    QWORD PTR __imp_malloc
    mov r10, rsi
    lea rcx, QWORD PTR [rbx+8]
    sub r10, rax
    mov r11, rbp
    sub r11, rax
    mov rdi, rax
    mov r8, rax
    mov r9d, 10000              ; 00002710H
    npad    6
$LL4@do_work:
    mov edx, DWORD PTR [rcx-8]
    lea rcx, QWORD PTR [rcx+12]
    mov DWORD PTR [r10+r8], edx
    lea r8, QWORD PTR [r8+4]
    mov eax, DWORD PTR [rcx-16]
    mov DWORD PTR [r11+r8-4], eax
    mov eax, DWORD PTR [rcx-12]
    mov DWORD PTR [r8-4], eax
    sub r9, 1
    jne SHORT $LL4@do_work
    mov rcx, rsi
    call    QWORD PTR __imp_free
    mov rcx, rbp
    call    QWORD PTR __imp_free
    mov rcx, rdi
    mov rbx, QWORD PTR [rsp+48]
    mov rbp, QWORD PTR [rsp+56]
    mov rsi, QWORD PTR [rsp+64]
    add rsp, 32                 ; 00000020H
    pop rdi
    rex_jmp QWORD PTR __imp_free
do_work ENDP

Here's a godbolt: https://godbolt.org/z/nWTPfKPd6.

(I have a similar program in Rust with the same conclusions).

Intel VTune reports that this program is 63.1% memory bound, and 52.4% store bound, with store latency of 26%. It recommends to search for false sharing, but I fail to see how there could be false sharing here. There is no concurrency, all data is owned by one core, the access patterns should be easily predicted and prefetched. I don't see why the CPU needs to stall on the stores here.

I thought that maybe the low and high bits of the addresses of the three allocations are the same and that causes them to be mapped to the same cache lines, but I remember reading that modern CPUs don't just drop some bits to assign a cache line but do more complex calculations.

Another thought was that maybe after the allocations are freed the CPU is still busy flushing the stores, and in the next run they are assigned the same address (or a close one) by the allocator and that brings problems for the CPU as it has to wait before storing new data. So I tried to not free the allocations, but that caused the code to be much slower.

I'm on Windows 11, laptop Intel Core i9-13900HX, 32 logical cores, 8 Performance Cores and 16 Efficient Cores.

2

There are 2 best solutions below

4
AudioBubble On

Most likely the int32_t * SIZE in the malloc call. If you use a bit-shift like SIZE << 2 instead, your code should be much faster and more efficient.

OK, in all seriousness, a_ptr, b_ptr, and c_ptr are all separate blocks of memory. In your loop you are accessing one, then the other, then the other, then repeating. I hear that is like bad for cache or something, because you are jumping around in memory, instead of accessing them sequentially.

I am skeptical that, even if a human could predict that pattern, a CPU could.

6
selbie On

Looks like there's an already accepted answer. But @user23919330 was onto something. The most obvious idea would be to convert this:

for (size_t i = 0; i < SIZE; i++, a_ptr++, b_ptr++, c_ptr++, data++) {
    *a_ptr = data->a;
    *b_ptr = data->b;
    *c_ptr = data->c;
}

To this:

for (size_t i = 0; i < SIZE; i++) {
    a_ptr[i] = data[i].a;
}

for (size_t i = 0; i < SIZE; i++) {
    b_ptr[i] = data[i].b;
}

for (size_t i = 0; i < SIZE; i++) {
    c_ptr[i] = data[i].c;
}

That is, let each read and write from sequential addresses. Take advantage of caching. Trying this out now....

Update. I modified OP's code such that it would run multiple times and it would print the timings (in milliseconds)

int testrun() {
    DWORD start, end;
    Struct* data = malloc(sizeof(Struct) * SIZE);
    for (size_t i = 0; i < SIZE; i++) {
        data[i].a = i;
        data[i].b = i;
        data[i].c = i;
    }

    start = GetTickCount();
    for (int i = 0; i < 500000; i++) {
        do_work(data);
    }
    end = GetTickCount();

    printf("%u\n", end - start); // print the number of milliseconds

    free(data);
}

int main() {
    for (int j = 0; j < 10; j++) {
        testrun();
    }
}

Compiled for release and got these run times:

5344 5953 5438 5718 5282 4531 4578 4609 4532 4500

Or just about 5 seconds on average per test test runs. Now let's see what happens when I change do_work to be this:

void do_work(Struct* data) {
    int32_t* a = malloc(sizeof(int32_t) * SIZE),
        *b = malloc(sizeof(int32_t) * SIZE),
        *c = malloc(sizeof(int32_t) * SIZE);

    int32_t* a_ptr = a, *b_ptr = b, *c_ptr = c;

    for (size_t i = 0; i < SIZE; i++) {
        a_ptr[i] = data[i].a;
    }

    for (size_t i = 0; i < SIZE; i++) {
        b_ptr[i] = data[i].b;
    }

    for (size_t i = 0; i < SIZE; i++) {
        c_ptr[i] = data[i].c;
    }


    free(a);
    free(b);
    free(c);
}

Recompiled and ran and got these times:

2406 2344 2328 2266 2344 2359 2266 2328 2359 2266

Or 2.3 seconds on average. I say there's a case to be made about taking advantage of sequential addresses.