How to get rid of interrupts on isolated cores (caused by simple Go app) during low-latency work?

557 Views Asked by At

Summary

LOC, IWI, RES and CAL interrupts are observed on isolated cores on which low-latency benchmark is performed. Interrupts are caused by simple Go application (printing "Hello world" every 1 second) which runs on different, non-isolated cores. Similar Python application doesn't cause such problems.

Tested on Ubuntu 22.04.2 LTS (GNU/Linux 5.15.0-68-lowlatency x86_64) (compiled with: "Full Dynticks System (tickless)" and "No Forced Preemption (Server)").

Configuration

Hardware:

2 x Intel(R) Xeon(R) Gold 6438N (32 cores each)

BIOS:

Hyperthreading disabled

OS and configuration:

Ubuntu 22.04.2 LTS (GNU/Linux 5.15.0-68-lowlatency x86_64) (compiled with: "Full Dynticks System (tickless)" and "No Forced Preemption (Server)" from https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/jammy/log/?h=lowlatency-next)

irqbalance stopped and disabled:

$ systemctl stop irqbalance.service
$ systemctl disable irqbalance.service

Based on workload type, experiments and knowledge found in the Internet, following kernel parameters were used:

$ cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-5.15.0-68-lowlatency root=UUID=5c9c2ea3-e0c6-4dd8-ae70-57e0c0af20d3 ro ro rhgb quiet ipv6.disable=1 audit=0 selinux=0 hugepages=256 hugepagesz=1G intel_iommu=on iommu=pt nmi_watchdog=0 mce=off tsc=reliable nosoftlockup hpet=disable skew_tick=1 acpi_pad.disable=1 nowatchdog nomce numa_balancing=disable irqaffinity=0 rcu_nocb_poll processor.max_cstate=0 clocksource=tsc nosmt nohz=on nohz_full=20-23 rcu_nocbs=20-23 isolcpus=nohz,domain,managed_irq,20-23

For every core/socket: Cx states (for x > 0) were disabled, particular power governor was used and fixed uncore values were set. To achieve that power.py script from https://github.com/intel/CommsPowerManagement was used.

prepare_cpus.sh script for setting this up and results:

#!/bin/bash

#Disable all possible Cx states (x > 0)
./power.py -d C1_ACPI -r 0-63
./power.py -d C2_ACPI -r 0-63
./power.py -d C1 -r 0-63
./power.py -d C1E -r 0-63
./power.py -d C6 -r 0-63

#Powersave governor
./power.py -g powersave

#Set uncore min and max freq to 1100
./power.py -u 1100
./power.py -U 1100

sleep 3

#Show current status
./power.py -l

enter image description here

CPUs 20-23 are "isolated" (thanks to proper kernel parameters) - benchmark/workload will be run on them.

Kernel threads were moved from CPU20-23:

$ cat /sys/devices/virtual/workqueue/cpumask
ffffffff,ff0fffff

get_irqs.sh script which checks which target CPUs are permitted for a given IRQ sources:

#!/bin/bash

for I in $(ls /proc/irq)
do
    if [[ -d "/proc/irq/$I" ]]
    then
        echo -n $I: 
    cat /proc/irq/$I/smp_affinity_list
    fi
done

output of mentioned script:

0:0
1:0
10:0
11:0
12:0
124:0
13:0
133:0
134:0
135:0
136:0
137:0
138:0
14:0
15:0
16:0
2:0
203:0
212:0
24:0
25:0
26:0
27:0
277:0
278:0
279:0
28:0
280:1
281:2
282:3
283:4
284:5
285:6
286:7
287:8
288:9
289:10
29:0
290:11
291:12
292:13
293:14
294:15
295:16
296:17
297:18
298:19
299:20
3:0
30:0
300:21
301:22
302:23
303:24
304:25
305:26
306:27
307:28
308:29
309:30
31:0
310:31
311:32
312:33
313:34
314:35
315:36
316:37
317:38
318:39
319:40
32:0
320:41
321:42
322:43
323:44
324:45
325:46
326:47
327:48
328:49
329:50
33:0
330:51
331:52
332:53
333:54
334:55
335:56
336:57
337:58
338:59
339:60
34:0
340:61
341:62
342:63
343:0
344:0
345:0
347:0
348:0
349:0
35:0
350:0
351:0
352:0
353:0
354:0
355:0
356:0
357:0
358:0
359:0
36:0
360:0
361:0
362:0
363:0
364:0
365:0
37:0
38:0
39:0
4:0
40:0
41:0
42:0
43:0
44:0
45:0
46:0
47:0
48:0
49:0
5:0
50:0
51:0
52:0
53:0
54:0
55:0
56:0
57:0
58:0
59:0
6:0
7:0
8:0
9:0

lscpu output:

Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         52 bits physical, 57 bits virtual
  Byte Order:            Little Endian
CPU(s):                  64
  On-line CPU(s) list:   0-63
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) Gold 6438N
    CPU family:          6
    Model:               143
    Thread(s) per core:  1
    Core(s) per socket:  32
    Socket(s):           2
    Stepping:            8
    CPU max MHz:         3600.0000
    CPU min MHz:         800.0000
    BogoMIPS:            4000.00
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pn
                         i pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 invpcid_single cdp_l2 ssbd 
                         mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsav
                         ec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni
                          avx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities
Virtualization features: 
  Virtualization:        VT-x
Caches (sum of all):     
  L1d:                   3 MiB (64 instances)
  L1i:                   2 MiB (64 instances)
  L2:                    128 MiB (64 instances)
  L3:                    120 MiB (2 instances)
NUMA:                    
  NUMA node(s):          2
  NUMA node0 CPU(s):     0-31
  NUMA node1 CPU(s):     32-63
Vulnerabilities:         
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Not affected
  Retbleed:              Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl and seccomp
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
  Srbds:                 Not affected
  Tsx async abort:       Not affected

JITTER tool - Baseline

jitter is benchmarking tool which is meant for measuring the "jitter" in the execution time caused by OS and/or the underlying architecture.

git clone https://github.com/FDio/archived-pma_tools
cd archived-pma_tools/jitter

Put "run_jitter.sh" script inside above directory.

#!/bin/bash

printf "Start time:\n"
date
#Print current state of interrupts (for: cpu20, cpu21, cpu22, cpu23, cpu24), discard interrupts with only zeros (or empty)
printf "\nInterrupts before:\n"
sed '1s/^/INT:/' /proc/interrupts | awk '{print $1 " " $22 " " $23 " " $24 " " $25 " " $26}' | grep -v "0 0 0 0 0" | grep -v ":  "

printf "\nRun jitter... "
#Start on cpu21, benchmark thread will be run on isolated cpu20
taskset -c 21 ./jitter -c 20 -f -i 1450 -l 15500 > jit.log

printf "finished!\n"

printf "\nInterrupts after:\n"
sed '1s/^/INT:/' /proc/interrupts | awk '{print $1 " " $22 " " $23 " " $24 " " $25 " " $26}' | grep -v "0 0 0 0 0" | grep -v ":  "

printf "\nEnd time:\n"
date

printf "\nResults:\n"
cat jit.log

printf "\n"

Run:

make
./run_jitter.sh

Results:

  • output from "run_jitter.sh" script: jitter_base.txt
  • chart created from above output:

enter image description here

Comment: jitter tool shows intervals and jitter in CPU Core cycles. Benchmark is done on 2000 MHz core so on graph values are divided by 2 and presented in nanoseconds. Very stable results, no significant jitters (max jitter: 51ns) during 335 seconds. No interruptions made on isolated CPU20 during benchmark.

JITTER tool - Python

hello.py - simple Python app which prints "Hello world" every 1 second:

import time

while True:
   print("Hello world")
   time.sleep(1)

run_python_hello.sh - script to run python app on particular (non-isolated) core:

#!/bin/bash

taskset -a -c 8 python3 hello.py

$ python3 --version
Python 3.10.6

In first console ./run_python_hello.sh was started, in second console ./run_jitter.sh was run.

Results:

  • output from run_jitter.sh script: jitter_python.txt
  • chart created from above output: enter image description here

Comment: Acceptable result, one noticeable jitter (1190ns), the remaining jitters did not exceed 60ns during 336 seconds. No interruptions made on isolated CPU20 during benchmark.

JITTER tool - Golang

hello.go - simple Golang app which prints "Hello world" every 1 second:

package main

import (
    "fmt"
    "time"
)

func main() {
    for {
        fmt.Println("Hello world")
        time.Sleep(1 * time.Second)
    }
}

go.mod - go module definition:

module hello

go 1.20

run_go_hello.sh - script to run Go app on particular (non-isolated) core:

#!/bin/bash

taskset -a -c 8 ./hello

$ go version
go version go1.20.5 linux/amd64

In first console Go app was built: go build and started: ./run_go_hello.sh, in second console ./run_jitter.sh was run.

Results:

  • output from "run_jitter.sh" script: jitter_go.txt
  • chart created from above output: enter image description here

Comment: 34 significant jitters (the worst had: 44961ns) during 335 seconds. Following interruptions were made on isolated CPU20 during benchmark:

  • LOC: 67
  • IWI: 34
  • RES: 34
  • CAL: 34

It seems that every jitter is made every ~10s.

What is also interesting that for idle and isolated CPU22 and CPU23 no interruptions were made during benchmark. For CPU24 (not isolated) only LOC were made (335283 of them).

INTERRUPTS:

jitter runs for about 335s and I collect interrupts from /proc/interrupts just before jitter starts and just after jitter ends for isolated CPU20-23 and non-isolated CPU24 (for reference). Only these kind of interrupts appear for these cores for all scenarios: LOC, IWI, RES, CAL, TLB (I believe that last one is done by jitter when it ends). No other types for these cores.

So here is the difference in these interrupts for "BASELINE" (only jitter running): enter image description here

Here is for jitter with Python app: enter image description here

And here is for jitter with Go app: enter image description here

We can see that Python app doesn't influence isolated cores at all, while Go equivalent unfortunately does. As a Go engineer, I've been trying to find out how to proceed with writing applications in Go but have impact for isolated cores as in Python ;)

What and where should be tuned/configured/changed/fixed to achieve that?

Notes

  1. Instead of static isolation (using kernel parameters) I tried also with cpuset and its shield turned on. Unfortunately, results were even worse (jitters were "bigger" and more interruptions were made to shielded cores), moreover cset was not able to move kernel threads outside of shielded pool.
  2. I checked it also on Realtime kernel (GNU/Linux 5.15.65-rt49 x86_64 -> https://mirrors.edge.kernel.org/pub/linux/kernel/v5.x/linux-5.15.65.tar.gz patched with https://cdn.kernel.org/pub/linux/kernel/projects/rt/5.15/older/patch-5.15.65-rt49.patch.gz) and problem with interrupts and jitters done by Go app doesn't exist there. However, RT kernel is not the best solution for everyone and it would be great to not have jitters also on lowlatency tickless kernel.
  3. I also did a lot of experiments with different kernel parameters, seems that this combination was the best (however, maybe I missed something).
  4. Same situation with Go app built using 1.19.x and 1.20.2.
  5. I'm aware that this kind of benchmark should be executed for hours, but for now these results are pretty meaningful.
  6. Checked also for GODEBUG=asyncpreemptoff=1, GOGC=off and GOMAXPROCS=1 - pretty much same results.
0

There are 0 best solutions below