perf stat displays some interesting statistics that can be gathered from examining hardware and software counters.
In my research, I couldn't find any reliable information about what counts as a context-switch in perf stat. In spite of my efforts, I was unable to understand the kernel code in its entirety.
Suppose my InfiniBand network application calls a blocking read system call in the event mode 2000 times and perf stat counts 1,241 context switches. The context-switches refer to either the schedule-in process or the schedule-out process, or both?
The __schedule() function (kernel/sched/core.c) increments the switch_count counter whenever prev != next.
It seems that perf stats' context-switches include involuntary switches as well as voluntary switches.
It seems to me that only deschedule events are counted if the current context runs the schedule code and increases the nvcsw and nivcsw counters in the task_struct.
output from perf stat -- my_application:
1,241 context-switches
Meanwhile, if I only count the sched:sched_switch event the output is close to the expected number.
output from perf stat -e sched:sched_switch -- my_application:
2,168 sched:sched_switch
Is there a difference between context-switches and the sched_switch- event?
I think you only get a count for
context-switchesif a different task actually runs on a core that was running one of your threads. Aread()that blocks, but resumes before any user-space code from any other task runs on the core, probably won't count.Just entering the kernel at all for a system-call clearly doesn't count;
perf stat lsonly counts one context-switch in a largish directory for me, or zero if Ilsa smaller directory like/. I get much higher counts, like711for a recursivelsof a directory that I hadn't accessed recently, on a magnetic HDD. So it spent significant time waiting for I/O, and maybe running bottom-half interrupt handlers.The fact that the count can be odd means it's not counting both deschedule and re-schedule separately; since I'm looking at counts for a single-threaded process that eventually exited, if it was counting both the count would have to be even.
I expect the counting is done when
schedule()decides thatcurrentshould change to point to a new task that isn't this one. (currentis the Linux kernel's per-core variable that points to thetask_structof the current task, e.g. a user-space thread.) So every time that happens to a thread that's part of your process, you get 1 count.Indeed, the OP helpfully tracked down the source code; it's in
__scheduleinkernel/sched/core.c. For example in Linux 6.1I would guess the
context-switchesperf event sums both involuntary and voluntary switches away from a thread. (Assuming that's whatnvandnivstand for.)