Is there any performance cost of accessing data by a calculted address like vmovupd ymm13, YMMWORD PTR [rbp+r14*8+78D0h] versus using an adress stored in a register like
vmovapd ymm13, YMMWORD PTR [rdi]
or
vmovupd ymm0,ymmword ptr [r9] vs
vmovupd ymm0,ymmword ptr [r9+60h]
More precisely: Does the arithmetic in [rbp+r14*8+78D0h] or [r9+60h] cost something and if so, what is the background?
Imagine a loop having a counter that serves as base offset per iteration for accessing various blocks of memory like this example in c.
for (uint64_t i = 0; i < n; i++)
{
doSomethingWith (&data0[i],&otherData[i]);
doSomethingDifferentWith (&data1[i+4],&otherData1[i+8]);
doSomethingElseWith (&data2[i+8],&otherData2[i+4]);
}
This example produces that kind of offset like adressed.
I wonder if it might be beneficial to iterate using stored adresses instead, which comes with the cost of extra instructions produced by pData0++; pOtherdata += 4; pData2 +=8; ... like lea, add, etc.
This is not about how to visualize effects using profilers. My aim is to understand the theory and mechanisms under the hood.
The specifics depend on the microarchitecture of the processor you are programming for. Generally speaking, there is a penalty for using a SIB operand if all fields in the operand are filled in, i.e. if there is a base, an index, and a displacement. The penalty is 1 µop extra latency for computing the address.
Refer to Agner Fog's microarchitecture guide for a more detailed explanation.