Bash script creates zombie and uses 100% CPU

353 Views Asked by At

I have a bash script which runs in a loop. Occasionally, the script gets into a state where it is using 100% of the CPU. Looking at pidtree, when that happens, the process has launched a child process to call date.

my_script(401)---my_script(463)---date(15804)

That PID of that child process is never changing. Moreover, that child process is somehow a zombie.

1 R root       463   401 51  80   0 -  4997 -      Mar22 ?        6-17:43:01 /bin/bash -eu /usr/sbin/my_script
0 Z root     15804   463  0  80   0 -     0 -      Mar28 ?        00:00:00 [date] <defunct>

Luckily, my code has exactly one place where date is called. That line, in a simplified version of the script, looks like (updated to include flock)

LOG="/tmp/foo"
(
flock -e -n 200 || exit 1
while true; do
  do_something_that_includes_sleep
  vals=("$(date --iso-8601=seconds --utc)")
  echo ${vals} >> ${LOG}
done
) 200>>${LOG}

How can this possibly cause date to become a Zombie? Even if it did somehow become a zombie, why would the main script be in Running state consuming 100% of CPU instead of blocking on a pipe read from the child?

1

There are 1 best solutions below

0
Paul Grinberg On

After much hair pulling, and several additional failure instances which were investigated with GDB, I now have an answer. There is nothing wrong with the Bash script itself. The problem is the Bash version. What I didn't mention originally is that this problem was only seen when running on Debian Stretch which has Bash version 4.4.11 (from 4.4-5 DEB package). This version has a known bug that was reported back in 2017 and has since been fixed, which explains why my other test systems that are running newer OS's didn't see the same failure.

Original bug report - https://lists.gnu.org/archive/html/bug-bash/2017-02/msg00025.html

There was also a second bug which explicitly documents the 100% CPU utilization waiting for a zombie child at https://lists.gnu.org/archive/html/bug-bash/2017-03/msg00141.html but that ultimately ties back to the original bug report.