Why compiler does not optimise RAM lookups?

125 Views Asked by At

https://godbolt.org/z/dK9v7En5v

For following C++ code

#include <stdint.h>
#include <cstdlib>

void Send(uint32_t);

void SendBuffer(uint32_t* __restrict__ buff, size_t n)
{
    for (size_t i = 0; i < n; ++i)
    {
        Send(buff[0]);
        Send(buff[1]);  
        for (size_t j = 0; j < i; ++j) {
            Send(buff[j]);   
        }
    }
}

we have following assembler listing

SendBuffer(unsigned int*, unsigned long):
        test    rsi, rsi
        je      .L15
        push    r13
        mov     r13, rsi
        push    r12
        mov     r12, rdi
        push    rbp
        xor     ebp, ebp
        push    rbx
        sub     rsp, 8
.L5:
        mov     edi, DWORD PTR [r12]
        call    Send(unsigned int)
        mov     edi, DWORD PTR [r12+4]
        call    Send(unsigned int)
        test    rbp, rbp
        je      .L3
        xor     ebx, ebx
.L4:
        mov     edi, DWORD PTR [r12+rbx*4]
        add     rbx, 1
        call    Send(unsigned int)
        cmp     rbx, rbp
        jne     .L4
.L3:
        add     rbp, 1
        cmp     r13, rbp
        jne     .L5
        add     rsp, 8
        pop     rbx
        pop     rbp
        pop     r12
        pop     r13
        ret
.L15:
        ret

On each loop iteration there is read from memory, while the value could be stored once on register.

It doesn't matter, do we have internal loop or not, compiler do not optimise that construction, I've add the loop to demonstrate that compiler can not rely on processor cache

Is that valid for compiler according to C++ standard to load memory from register once before loop (if we have or don't have __restrict__ keyword)? Why compiler doesn't do that optimisation if it's valid? How can I say to compiler that nobody will change that memory and it's valid if now it's not?

1

There are 1 best solutions below

7
Thomas Matthews On

You could help the compiler by rearranging your code, so that you can see the impact of RAM optimizations.

void SendBuffer(uint32_t* __restrict__ buff, size_t n)
{
    // Access RAM sequentially to take advantage of the data cache.
    const uint32_t a = buff[0];
    const uint32_t b = buff[1];

    for (uint32_t i = 0u; i < n; ++i)
    {
        Send(a);
        Send(b);

        // Start at the third buffer slot.
        for (size_t j = 2; j < n; ++j)
        {
            Send(buff[j]);   
        }
    }
}

In the above code, the bottleneck is the call to Send. Accessing the buff array is much faster. Also, the branch evaluations in the loops take more time than accessing the array.

The true optimization here, should be to modify the Send so that it transfers blocks and not words. Most device communications have a block transfer capability.

Otherwise you can try unrolling the loop. (The compiler may perform loop unrolling a higher optimization levels)

size_t j;
for (j = 2u; (j + 4u) < n; j += 4)
{
     // Optimization:  load consecutively from data cache to reduce
     // the quantity of cache reloads.  
     const uint32_t a = buff[j + 0u];
     const uint32_t b = buff[j + 1u];
     const uint32_t c = buff[j + 2u];
     const uint32_t d = buff[j + 3u];

     // Send a "block" of data:
     Send(a);
     Send(b);
     Send(c);
     Send(d);
}
// Send the remaining words:
for (; j < n; ++j)
{
    Send(buff[j]);   
}

Examining the assembly language should show a better organized and optimized code.

Edit 1: Included outer loop, corrected index variable usage.