What's the impact of hyperthreading on low-latency development

234 Views Asked by At

I've read a post regarding low-latency development as below.

"We always avoid (relatively) slow language features like exceptions, memory allocation and virtual function calls on the critical path. I/O is done on separate threads and triggered through a message queue. Hyperthreading is disabled. We prepare as much data as possible in advance, laid out to minimise the number of cache lines we need to read on the critical path"

I wonder why hyperthreading is a disadvantage on this matter.

I cannot find any relevant argument online and always think that "modern CPUs are almost never slower with hyperthreading enabled because the OS is fairly competent in scheduling workloads to free cores."

1

There are 1 best solutions below

0
Peter Cordes On

If an interrupt happens on the other logical core, the CPU has to transition out of one-thread-active mode. (There's a perf event cpu_clk_unhalted.one_thread_active that counts how much of the time the CPU is in single-thread mode.) This takes some clock cycles to at least partially drain the ROB (ReOrder Buffer) since modern mainstream designs statically partition it and some other resources.

The ROB is physically a circular buffer, and I wouldn't be surprised if the partitioning cared where the split was in terms of the array of physical entries, so it's not necessarily just a matter of letting the ROB drain until half capacity. (This is a guess on my part; perhaps there's enough indirection in the ROB indexing by issue and retirement that they can just wrap back to an arbitrary point.)

Some other resources are also statically partitioned, like the L1iTLB (at least in Sandybridge: https://www.realworldtech.com/sandy-bridge/3/). So the other logical core waking could evict some iTLB entries and cause extra delays, potentially while this thread was in the middle of something latency-critical.

The store buffer is also statically partitioned, so it would also have to get drained, potentially waiting for cache-miss stores to finish their RFOs (Read For Ownership) and commit to L1d cache. (In program order on x86, since it has a strongly ordered memory model.)

See https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(client) and search for "fixed partion" or "static" to find resources that are divided in half between logical cores, as opposed to replicated aka duplicated vs. competitively shared (aka dynamic partitioning). Wikichip has article about various microarchitectures. Agner Fog's microarchitecture guide (https://agner.org/optimize/) also mentions some things being statically partitioned vs. replicated or competitively shared.

https://chipsandcheese.com/2022/12/25/golden-coves-lopsided-vector-register-file/ mentions some details, too; that site has good architecture deep-dives for recent stuff, filing a void since David Kanter stopped doing them on realworldtech.


Plus of course just running slower while the other thread is active (even for a short while), getting half the front-end throughput and having to compete for some out-of-order exec resources, or only having half the size for others. That alone happening in the middle of something you wanted to be low latency would be a problem.

Caches like L1d and L2 are competitively shared, so data this thread's using could be randomly evicted by the other logical core even just waking up to run a timer interrupt.

Or if the system was heavily loaded then the scheduler would start running tasks on both logical cores of the same physical core.

If you never want that to happen in the first place, just disable hyperthreading. Then the kernel only has to keep track of half the number of cores. It makes no sense to have all the per-CPU stuff replicated for a set of cores that you want to never use.

Plus if this is Linux or similar, there's RCU run_on having to schedule a task to every core in turn to make sure there isn't a thread still waiting to access a copy of something you're about to free. IDK if iso_cpus=4,5,6,7 could make that not happen for those SMT siblings of cores 0-3, but again, if you don't want the cores to be used, make them not exist in the first place.

You don't necessarily have to disable SMT in the BIOS, you could just boot with nosmt to have the kernel only start one logical core per physical core. (That's what the documentation sounds like it's saying; rather than not detecting the fact that two cores are actually SMT siblings of each other, which is what would happen if the kernel was actually fully unaware of hyperthreading.)

(SMT = Simultaneous Multithreading, the generic computer-architecture term for what Intel's Hyperthreading is.)