Command "strace -T" introducing too much overhead

211 Views Asked by At

I'm trying to check the execution time of each system call invoked by my application using strace -T.

I got the trace output. But the time calculated by strace for each system call seemed inaccurate.

To confirm the accuracy, I added few usleep(100) calls in my application. usleep is calling nanosleep() syscall. I expected the strace to show the time taken by nanosleep(100000) to be around 100 microseconds. But the values shown by strace for nanosleep(100000) is around 300 microseconds. For some calls, it is even showing 7 milliseconds(as seen in the 3rd line of below attached screenshot). enter image description here

  1. What is the reason for so much fluctuation in the execution time of a system call. Is this the issue with strace calculation or nanosleep(100000) is really taking a lot more than 100 microseconds to execute?
  2. If the issue is because of overhead introduced by strace, then the overhead is too much and strace is not useful to calculate the time taken by system call. Is there any way to avoid the overhead of strace? If not, is there any other tool to achieve the same?
2

There are 2 best solutions below

0
Zazaeil On

strace is quite invasive because it is built upon the ptrace, the man pages. The immediate implication is that strace suffers from an interaction between the user-space and kernel-space - that's because the ptrace is nothing but yet another system call with all the consequences of that.

The whole thing about eBPF is that it runs in the kernel space entirely, without costly interactions between the two worlds. You could try to experiment with the bpftrace and compare the outcome (strongly biasing to the latter to be more trustworthy than the former one). In my experience, eBPF is not only much faster and less resource-hungry but also much more accurate as well.

If you're new to eBPF and have no idea what it is, then start here.

0
pchaigno On

strace

This seems likely to be caused by strace. By default, strace intercepts all syscall entries and exits, causing two context switches per syscall.

If you want to trace a subset of syscalls, there is an experimental option to reduce the overhead, --seccomp-bpf. See https://pchaigno.github.io/strace/2019/10/02/introducing-strace-seccomp-bpf.html for an explanation.

bcc

In your case, one option is to use the existing bcc tool funclatency. See https://github.com/iovisor/bcc/blob/master/tools/funclatency_example.txt for examples of its usage.

You can also install funclatency via packages (bpfcc-tools on Ubuntu). Tracing the latency of the nanosleep(2) syscall would then be:

# funclatency-bpfcc __x64_sys_nanosleep
Tracing 1 functions for "__x64_sys_nanosleep"... Hit Ctrl-C to end.
^C
     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 0        |                                        |
      2048 -> 4095       : 0        |                                        |
      4096 -> 8191       : 0        |                                        |
      8192 -> 16383      : 0        |                                        |
     16384 -> 32767      : 0        |                                        |
     32768 -> 65535      : 0        |                                        |
     65536 -> 131071     : 5        |*                                       |
    131072 -> 262143     : 0        |                                        |
    262144 -> 524287     : 0        |                                        |
    524288 -> 1048575    : 0        |                                        |
   1048576 -> 2097151    : 0        |                                        |
   2097152 -> 4194303    : 0        |                                        |
   4194304 -> 8388607    : 0        |                                        |
   8388608 -> 16777215   : 132      |****************************************|

avg = 9766470 nsecs, total: 1338006501 nsecs, count: 137

Detaching...

If you want to display the latency as a function of the arguments, then you could either edit the funclatency.py script or rewrite something with bpftrace.