I'm trying to check the execution time of each system call invoked by my application using strace -T.
I got the trace output. But the time calculated by strace for each system call seemed inaccurate.
To confirm the accuracy, I added few usleep(100) calls in my application. usleep is calling nanosleep() syscall.
I expected the strace to show the time taken by nanosleep(100000) to be around 100 microseconds. But the values shown by strace for nanosleep(100000) is around 300 microseconds. For some calls, it is even showing 7 milliseconds(as seen in the 3rd line of below attached screenshot).

- What is the reason for so much fluctuation in the execution time of a system call. Is this the issue with strace calculation or nanosleep(100000) is really taking a lot more than 100 microseconds to execute?
- If the issue is because of overhead introduced by strace, then the overhead is too much and strace is not useful to calculate the time taken by system call. Is there any way to avoid the overhead of strace? If not, is there any other tool to achieve the same?
straceis quite invasive because it is built upon theptrace, the man pages. The immediate implication is thatstracesuffers from an interaction between the user-space and kernel-space - that's because theptraceis nothing but yet another system call with all the consequences of that.The whole thing about eBPF is that it runs in the kernel space entirely, without costly interactions between the two worlds. You could try to experiment with the bpftrace and compare the outcome (strongly biasing to the latter to be more trustworthy than the former one). In my experience, eBPF is not only much faster and less resource-hungry but also much more accurate as well.
If you're new to eBPF and have no idea what it is, then start here.