I am trying to diagnose low cycles-per-instruction in a small program. Using perf stat -e cycles,stalled-cycles-backend ./myprogram, I see that around 90% backend cycles are idle.
Performance counter stats for './myprogram':
5207951548 cycles:u
4704755172 stalled-cycles-backend:u # 90.34% backend cycles idle
3.485708464 seconds time elapsed
My understanding is that the hardware counter assigned to stalled-cycles-backend is incremented whenever the fetched instructions are not dispatched due to resource constraints like result of memory loads, unavailability of floating point units etc, and the pipeline fills up with bubbles. I would like to drill further down into why exactly the backend is getting stalled.
On looking through some other events using perf list, I came across the following, among others.
--snip--
stall_slot [Kernel PMU event]
stall_slot_backend [Kernel PMU event]
stall_slot_frontend [Kernel PMU event]
--snip--
Using perf stat -e cycles,stall_slot_backend .build/release/Run, I see the following result.
Performance counter stats for './myprogram':
5207682200 cycles:u
28628535485 stall_slot_backend:u
So, this event supposedly happened more times than the number of clock cycles. I do not quite understand what this even means. From my processor's technical reference manual (ARM Cortex A78AE), this is the description for stall_slot_backend.
No operation sent for execution on a slot due to the backend
Which is not very illuminating. I could not find the definition of slot elsewhere in the manual. So, I have the following questions.
- What exactly does a slot mean in this context?
- Why is the count for this event more than the number of clock cycles? Is it possible for a counter to be incremented more than once per cycle? Or could these results be erroneous or misleading?
- What is the difference between the hardware events/event groups
stall_slot_backendandstalled-cycles-backend?