Who actually does the out of ordering of the memory accesses in MPCore?

117 Views Asked by At

As per my current understanding from the ARM Cortex A57 and A78 TRM, micro ops can be issued out of order to 1 among the several execution pipelines.

This is instruction reordering for independent instruction as far as I understood.

Memory access reordering is something which means observers and slaves in a system may observe memory accesses in different sequence compare to the program sequence. This could mean 1 of the following -

1 - CPU reordered the memory access micro ops and issued to the load and store pipelines. Interconnect(ACE/CHI) did not do any reordering

2 - CPU issues the micro-ops in program order but Interconnect(ACE/CHI) reordered it

Is my understanding correct? If yes, then will the barrier instruction halt the CPU pipeline by stopping further instruction issue or Interconnect throttles the CPU master interface till Barrier instruction response is received?

I asked in ARM blog but no response as of now.

https://community.arm.com/support-forums/f/architectures-and-processors-forum/54529/who-actually-does-the-out-of-ordering-of-the-memory-accesses-in-mpcore

EDIT 1

As per suggestion by Peter, I wanted to mention following precondition for my question -

1 - Multi cluster ARM SoC along with other ACE masters like DMA enginer, iGPU, etc.

2 - The question is for inner-shareable as well as outer shareable memory (eg - Memory accessed by threads running in different CPU cluster)

3 - Question is for Cacheable (This is clarified by Peter to a great extent) and Non-Cacheable normal memory as I wanted to understand how memory access observation by other observers is related to ordering in CPU pipeline in out of order pipeline architecture such as ARM Cortex A78

1

There are 1 best solutions below

7
Peter Cordes On

Memory reordering (of access to globally-visible cache state) happens inside the CPU core, not the interconnect. A barrier instruction doesn't send any messages to other cores.

(At least not dmb ish. I don't know about outer-shareable / non-cache coherent stuff, but those barriers might just order things wrt. cache-control instructions that you also need in those cases. The A32/T32 and A64 docs sound to me like even for stronger orders, it's still just about waiting for completion of things that were already going to happen because of other instructions, including loads or stores. There are probably more detailed docs somewhere, but maybe an ARM expert can shed some more light on this with another answer if this answer is missing anything important.)


Issuing a load micro-op to an execution unit attempts to read from cache right then. But issuing a store just copies the data+address to the store buffer. Memory reordering (of their accesses to coherent shared cache) happens inside each core, by various mechanisms including the store buffer and hit-under-miss non-blocking caches.

Out-of-order execution is one significant mechanism for LoadLoad reordering (if load addresses are ready in a different order), but all major kinds of memory reordering can happen on an in-order pipeline, due to cache miss loads and a store buffer. (And if the store buffer allows out of order commit of stores, which ARM normally would since its memory model doesn't guarantee StoreStore ordering.)

My understanding is that interconnects generally don't introduce reordering themselves. So memory barriers just have to make things inside this core wait until earlier loads have completed and/or the store buffer drains.

See also: