Why do fast memory writes when run over multiple threads take much more time vs when they are run on a single thread?

87 Views Asked by At

I have a program which allocates some memory (200 million integers), does some quick compute and then writes the data to the allocated memory.

When run on a single thread the process takes about 1 second. When run across 10 threads the process takes about 77 seconds. There is no concurrency between the threads. Each thread allocates its own data and writes to it.

The cache lines on my machine are 64 byte aligned. I write the data into an 16 integer array (allocated on stack). And then flush this array to the memory via memcpy.

My hypothesis is along these lines :

The aggregate processing across CPUS is faster than the speed of memory. Hence, memory speed is the bottle neck here.

I am deliberately writing to memory in 64 bytes of data (tmp_arr) so that cache lines should not be shared across threads. But I suspect the cache lines maybe thrashing since multiple threads are trying to write into memory very quickly (cpu compute is low). When i wrote single integer at a time to memory instead of an array of 16 integers via memcpy, the time was 250 seconds. Hence pushing 16 integers (64 bits) at a time to memory via memcpy improved speed 3x. But the time of 77 seconds is nearly 77x more than that of single threaded execution. Hence, some thrashing must be going on.

Another observation is that single thread execution shows high stalled-cycles-backend (30%) whereas when run on ten threads we see high stalled-cycles-frontend (39%). Can someone please explain why there are backend stalls with singlethread while we see front end stalls with multiple threads?

I also tried to use _mm_store_si128 (vectorized instructions) but that didn't improve speed.

Here are screenshots of times taken along with hardware counters.

One thread timings , ten thread timings , machine diagram

I am running on linux and using AMD Ryzen 9 3900XT 12-Core Processor. I am using DDR4 memory operating at speed: 3600 MT/s

This code is written to understand how fast memory writes can be achieved in multithreaded application that works on large data sets.

Here is the code:

//Had tried vectorized instructions but they had no noticeable performance improvements
//__m128i tmp_arr ;
//__m128i b = _mm_set_epi32(1,2,3,4);
//_mm_store_si128( (__m128i*) (arr + (i*16)), tmp_arr);
void runLargeMemoryAlloOutside(benchmark::State & s) {
    long size = 200'000'000; //200 million integers
    //each array is about 0.8GB
    //setting a pragma to alignas(64).
    //giving this pragma or not didn't have any noticeable effect on performance.
    alignas(64) int * arr = new int [size];
    //will write to this temporary array on stack. and then will flush this to the above array via memcpy.
    int tmp_arr[16];
    for (auto _ : s) {
        for (long i = 0; i < size/16; i++) {
            for (int j =0; j < 16; j++) {
                //very small compute.
                int tmp = rand();
                tmp = log(tmp);
                //arr[i] = tmp;
                tmp_arr[j] = tmp;
            }
            memcpy (arr + (i*64), tmp_arr, sizeof(tmp_arr));
        }
    }
    delete [] arr;
}

//running on 10 threads via Google benchmark
BENCHMARK(runLargeMemoryAlloOutside)->Threads(10);

Code is compiled with gcc (c++ 20) using flags "-Ofast -O5".

I don't fully understand why time of the process when run on 10 threads increases to 77x that of single threaded execution. And hope that someone can shed some light on the increase in time here. And what else can be done to speed up memory writes.

I was expecting that multi threaded execution to not be as slow.

2

There are 2 best solutions below

1
Peter Cordes On

int tmp = rand(); - There's your problem. GNU/Linux (glibc) rand() shares one seed state program-wide, protected by a lock. So all your threads contend for that lock, making overall throughput much worse than with a single thread.

Uncontended locks are not good (when the cache line holding the lock stays in Modified state on the core doing the locking and unlocking), but contention for a lock is much slower.

Storing the result to memory vs. adding it to a per-thread temporary would run about the same speed because your loop body is way slower than memory bandwidth, especially with the contention of all threads competing for a lock.

Floating-point log is also slow-ish, but maybe you're trying to do something slow enough that one thread can't come close to saturating memory bandwidth, but maybe come close with all cores?

If you want random numbers, a simple cheap PRNG like xorshift* can be good. (https://en.wikipedia.org/wiki/Xorshift#xorshift*). Or some of the later variations. That should be enough to stop the compiler from auto-vectorizing the random-number generation, if you're trying to make your code intentionally slow instead of e.g. just filling with arr[i] = i;. If you want fast random numbers, xorshift+ is easy to vectorize.


     alignas(64) int * arr = new int [size];

I think that just aligns the storage for the pointer object itself, not the allocated space. (alignas() does count in bytes, not bits, so that is correct for the size of a cache line. You wrote "bits" in the text of your question, but 16x 32-bit ints are 64 bytes.)

Check the compiler-generated asm to see what the compiler did, whether it's actually storing to a local buffer and then copying to the big array, or whether it optimizes away the memcpy. (If it copies, it probably inlines the memcpy since the size is small and fixed. Like four xmm copies since you didn't use -march=native or -march=x86-64-v3 to let it use AVX 32-byte vectors.)

11
zcoderz On

Expanding on Peter's excellent response.

Thank you for pointing out about the rand() lock. Moving to xorshift for rand mostly solved the issue.

Time across 30 threads with the below code is between 3-4 seconds, and a single thread is around 0.8 seconds.

Regarding math log, i am deliberately trying to simulate some compute on cpu so as to mimic a realistic work load.

i had a typo in my text where i mentioned 64 bit for cache line instead of 64 bytes.

Yes, alignas on int *arr appears to just align the pointer itself. adding alignas here doesnt have any noticable effect on speed.

here are the assembly level instructions. i have not looked into assembly code since a very long time. suspect movq is used to move data to memory.

# main.cpp:79:                 tmp = log(tmp);
    cvttsd2sil  %xmm0, %eax # _1, tmp90
    movl    %eax, -108(%rbp)    # tmp90, tmp
# main.cpp:82:                 tmp_arr[j] = tmp;
    movl    -112(%rbp), %eax    # j, tmp92
    cltq
    movl    -108(%rbp), %edx    # tmp, tmp93
    movl    %edx, -80(%rbp,%rax,4)  # tmp93, tmp_arr[j_6]
# main.cpp:76:             for (int j =0; j < 16; j++) {
    addl    $1, -112(%rbp)  #, j
.L5:
# main.cpp:76:             for (int j =0; j < 16; j++) {
    cmpl    $15, -112(%rbp) #, j
    jle .L6 #,
# main.cpp:84:             memcpy (arr + (i*64), tmp_arr, sizeof(tmp_arr));
    movq    -104(%rbp), %rax    # i, i.0_2
    salq    $8, %rax    #, i.0_2
    movq    %rax, %rdx  # i.0_2, _3
    movq    -88(%rbp), %rax # arr, tmp94
    leaq    (%rdx,%rax), %rcx   #, _4
# main.cpp:84:             memcpy (arr + (i*64), tmp_arr, sizeof(tmp_arr));
    leaq    -80(%rbp), %rax #, tmp95
    movl    $64, %edx   #,
    movq    %rax, %rsi  # tmp95,
    movq    %rcx, %rdi  # _4,
    call    memcpy@PLT  #

i tried running with each of -march=native & -march=x86-64-v3, but don't see any change in execution times.

Here is the updated code:

thread_local unsigned int random_state = 1234;

inline int randomNumber(void)
{
    unsigned int r = random_state;
    r ^= r << 13;
    r ^= r >> 17;
    r ^= r << 5;
    random_state = r;
    return (int)(r & INT_MAX);
}



void runLargeMemoryAlloNonVectorized(benchmark::State & s) {
    const long size = 200'000'000; //200 million integers
    //each array is about 0.8GB
    //setting a pragma to alignas(64).
    //giving this pragma or not didn't have any noticeable effect on performance.
    //alignas(64)
    int * arr = new int [size];
    //alignas(64) std::array<int, size> arr{};
    //will write to this temporary array on stack. and then will flush this to the above array via memcpy.
    int tmp_arr[16];
    for (auto _ : s) {
        for (long i = 0; i < size/16; i++) {
            for (int j =0; j < 16; j++) {
                //very small compute.
                int tmp = randomNumber();
                tmp = log(tmp);
                //arr[i] = tmp;
                //int tmp = j+i/2;
                tmp_arr[j] = tmp;
            }
            memcpy (arr + (i*64), tmp_arr, sizeof(tmp_arr));
        }
    }
    delete [] arr;
}

//running on 10 threads via Google benchmark
BENCHMARK(runLargeMemoryAlloNonVectorized)->Threads(30);

Trying to understand what causes the 3x+ increase in time with 30 threads. There should not be any thrashing on cache lines here, given there is no shared data between threads, right? Is the increase in multithreaded time because the CPU is faster than the memory bandwidth? Other thoughts?