Are there CPU doing speculative execution that virtualize memory locations?

75 Views Asked by At

Consider the classical reuse of a register after an expensive computation, in pseudo assembly:

r2 = cos(r1)
*(r3) = r2
r2 = r5 + r6
*(r4) = r2

To be able to use the arithmetic units fully, the execution unit might do:

r2 = cos(r1)
*(r3) = r2

and in parallel:

r2bis = r5 + r6
*(r4) = r2bis

where r2bis is the virtualized (or renamed) r2 register.

Now imagine we work in a register poor CPU (or we have many but they are used already) and put the data in some temporary stack location:

*(sp+C) = cos(r1)
*(r3) = *(sp+C)
*(sp+C) = r5 + r6
*(r4) = *(sp+C)

Are there cases where the memory location whose address is known (as (sp+C) can be computed already) is virtualized by the execution unit to allow the same two execution to proceed in parallel?

That case may seem very silly as the compiler could be tasked with finding another location on the not so constrained stack space (unlike the very constrained register space). But other cases may not be so silly as virtualized memory could allow speculative executive of a condition branch that has to store short term data in memory. This is especially important for languages where there is no easy way to put object fields in registers, like Java for all but the most simple cases: you have to rule out "reference" (pointer) escape to avoid the new dynamic allocation and turn the Java class instance into a C++ class automatic instance (that can be stack allocated or in registers). (And then even C++ has difficulties not having a real this pointer in apparently simple uses of simple flat classes.)

1

There are 1 best solutions below

0
Peter Cordes On

Yes, store-to-load forwarding via the store buffer allows multiple independent store/reload chains to be in flight at once on the same location. Like register renaming, this allows an independent store to execute without a WAW or WAR hazard on previous stores and loads to the same location.

CPUs can even speculate on which whether a load will depend on a previous store or not (if the addresses aren't ready for all previous stores), and if they guess that it won't, take data from L1d cache instead of waiting. This is Memory Disambiguation. A rollback is necessary if they guess wrong. (On Intel, the perf event machine_clears.memory_ordering applies to this as well as when stores from another core cause violate the core's assumption that loading early would be ok, despite x86's store memory ordering rules.)


Most in-order CPUs and all out-of-order CPUs have a store buffer; it's necessary to decouple speculative execution from cache state that's visible to other cores: Can a speculatively executed CPU branch contain opcodes that access RAM? Also a cheap way to not stall on cache-miss stores.

Fast-path store-forwarding (when the reload fully overlaps the store) has full throughput on normal CPUs. Store-forwarding stalls (when this doesn't happen) on Intel CPUs can pipeline with fast-path store-forwarding but not with each other. Presumably it has to do a more detailed scan of the store buffer instead of just taking the first matching entry that's older than the load, since the load data might have to come from more than one recent store and maybe some bytes from L1d cache.


Does a series of x86 call/ret instructions form a dependent chain? has experimental results from perf showing this in action, with a couple add qword [rsp], 0 instructions lengthening the dependency chains even more. Branch prediction + speculative exec means each call starts a new dependency chain, storing a return address while the previous add and ret loads/stores + load are still in flight.


Zero latency memory renaming

Some CPUs can actually rename memory with zero latency, by matching the addressing modes separately from the store buffer. https://www.agner.org/forum/viewtopic.php?t=41 - Zen 2 and Zen 4 have it. https://chipsandcheese.com/2022/11/08/amds-zen-4-part-2-memory-subsystem-and-conclusion/ says Zen 3 has it, too, and so can Alder Lake P-cores (Golden Cove).

So does Ice Lake, it seems: https://www.realworldtech.com/forum/?threadid=186393&curpostid=186393

Pass-by-reference for small integers across boundaries where something couldn't inline (e.g. a virtual function) can benefit from this. Or stack args (especially in 32-bit code).