Performance cost of accessing memory using calculated adresses (base + offset) vs register

146 Views Asked by At

Is there any performance cost of accessing data by a calculted address like vmovupd ymm13, YMMWORD PTR [rbp+r14*8+78D0h] versus using an adress stored in a register like

vmovapd ymm13, YMMWORD PTR [rdi]

or vmovupd ymm0,ymmword ptr [r9] vs vmovupd ymm0,ymmword ptr [r9+60h]

More precisely: Does the arithmetic in [rbp+r14*8+78D0h] or [r9+60h] cost something and if so, what is the background?

Imagine a loop having a counter that serves as base offset per iteration for accessing various blocks of memory like this example in c.

for (uint64_t i = 0; i < n; i++)
{
    doSomethingWith (&data0[i],&otherData[i]);
    doSomethingDifferentWith (&data1[i+4],&otherData1[i+8]);
    doSomethingElseWith (&data2[i+8],&otherData2[i+4]);
}

This example produces that kind of offset like adressed. I wonder if it might be beneficial to iterate using stored adresses instead, which comes with the cost of extra instructions produced by pData0++; pOtherdata += 4; pData2 +=8; ... like lea, add, etc.

This is not about how to visualize effects using profilers. My aim is to understand the theory and mechanisms under the hood.

1

There are 1 best solutions below

2
fuz On

The specifics depend on the microarchitecture of the processor you are programming for. Generally speaking, there is a penalty for using a SIB operand if all fields in the operand are filled in, i.e. if there is a base, an index, and a displacement. The penalty is 1 µop extra latency for computing the address.

Refer to Agner Fog's microarchitecture guide for a more detailed explanation.