I wondered if it is possible if two threads belonging to the same program with the same PCID can share the TLB entry when they are scheduled to run on the same physical CPU?
I already looked into the SDM (https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html); page 3115 (TLB and HT) does not mention any sharing mechanism. But another part of the document states that before accessing the TLB entry, the PCID value is checked, and if it is equal, the value is used. However, there is also a bit for the current thread set next to the PCID identifier.
My question: is the PCID value used with priority over the CPU-thread bit or is it necessary that both values match?
From my observations, it is not possible (at least for the
dTLB), even though it would bring performance benefits.How I came to that conclusion
As suggested by Peter, I wrote a small program that consists of two worker threads that access the same heap region over and over again.
Compile with
-O0to prevent optimization.I decided to sum up all the values in the memory region (obviously, the
valuewill overflow) to prevent the CPU from doing microarchitectural optimization.[The other Idea was to simply dereference the memory region byte by byte and load the value in
RAX]We go over the memory region
repetitionstimes to reduce the noise within one run induced by the slightly different startup time of the threads and other processes and interrupts on the system.Results
My machine has four physical and eight logical cores. Logical core x and x+4 are located on the same physical one (lstopo).
CPU: Intel Core i5 8250u
Running on the same logical core
Since the kernel uses PCIDs to identify TLB entries, a context switch to the other thread should not invalidate the TLBs.
Running on two different physical cores
No TLB sharing or interference whatsoever.
Running on the same physical core
If TLB sharing is possible, I would expect to have here the lowest
sTLBhits and a low number ofdTLBpage walks. But instead, we have the highest number in both cases.Conclusion
As you can see, we have the most
sTLBhits anddTLBpage walks when running on the same physical core. Thus, I would follow from it that there is no sharing mechanism for the same PCID on the same physical core. Running the process on the same logical core and two different physical cores results in roughly the same amount of misses/hits to the sTLB. This further supports the thesis that there is sharing on the same logical core but not on the physical one.Update
As suggested by Peter also use a linked-list approach to prevent THP and prefetching. The modified data is shown below.
Compile with
-O0to prevent optimizationSame Logical Core
Different Physical Cores
Same Physical Core / Different Logical Cores