How to narrow down Docker error message 'Failed to compute size of container rootfs <ID>: mount does not exist`?

204 Views Asked by At

For a while now we're encountering error messages in Jenkins build logs Failed to rm container <ID> with no obvious root cause.

Containers which are failing to be removed are started from groovy with the Docker Pipline plugin in version 563.vd5d2e5c4007f resulting in the following commandline:

docker run -t -d -u 1001:1001 --ulimit nofile=16384:32768 \
    --hostname debian-11 \
    -v <git-ref-repo>:<git-ref-repo>:ro \
    -v <git-repo>:<git-repo>:ro \
    -v <workspace>:<workspace>:rw,z -v <workspace_tmp>:<workspace_tmp>:rw,z
    -w <workspace> \
    (... bunch of -e ...) \
    <image-id> cat

greping for the container ID in /var/log/syslog we find:

Jan 18 03:24:29 <machine> containerd[1296]: time="2024-01-18T03:24:29.944843743+01:00" level=info msg="starting signal loop" namespace=moby path=/run/containerd/io.containerd.runtime.v2.task/moby/<ID> pid=947691 runtime=io.containerd.runc.v2
Jan 18 03:24:29 <machine> systemd[1]: Started libcontainer container <ID>.
Jan 18 04:35:41 <machine> dockerd[958853]: time="2024-01-18T04:35:41.041899333+01:00" level=info msg="Container failed to exit within 1s of signal 15 - using the force" container=<ID>
Jan 18 04:35:41 <machine> systemd[1]: docker-<ID>.scope: Deactivated successfully.
Jan 18 04:35:41 <machine> systemd[1]: docker-<ID>.scope: Consumed 8h 38min 52.709s CPU time.
Jan 18 04:35:41 <machine> containerd[1296]: time="2024-01-18T04:35:41.529534693+01:00" level=info msg="shim disconnected" id=<ID>
Jan 18 04:35:41 <machine> containerd[1296]: time="2024-01-18T04:35:41.529647257+01:00" level=warning msg="cleaning up after shim disconnected" id=<ID> namespace=moby
Jan 18 04:35:41 <machine> dockerd[958853]: time="2024-01-18T04:35:41.529711109+01:00" level=info msg="ignoring event" container=<ID> module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Jan 18 04:46:14 <machine> dockerd[958853]: time="2024-01-18T04:46:14.564180677+01:00" level=error msg="Failed to compute size of container rootfs <ID>: mount does not exist"

Update: the following is just for reference - the mentioned error message had been the result of some monitoring attempt.


While the warning cleaning up after shim disconnected shows up for other containers, too, it looks like the error Failed to compute size of container rootfs is coupled with the unsuccessful removal of the container by Jenkins. (and I suspect it to be the cause)

I've searched for the error message Failed to compute size of container rootfs and found this issue and in the source I lost trace at engine.layer.layerStore.GetRWLayer:

func (ls *layerStore) GetRWLayer(id string) (RWLayer, error) {
    ls.locker.Lock(id)
    defer ls.locker.Unlock(id)

    ls.mountL.Lock()
    mount := ls.mounts[id]
    ls.mountL.Unlock()
    if mount == nil {
        return nil, ErrMountDoesNotExist     #  this seems to create the observable error
    }

(same in moby)


Trying to check for suspicious processes on container shutdown, Bazel came into our focus. Currently it looks like we're encountering something like https://github.com/bazelbuild/bazel/issues/16907 or https://github.com/bazelbuild/bazel/issues/13823, i.E. the Docker server process can't be terminated and let's docker rm run into issues.

Any Docker + Go Bazel crack here who can help me spot the root cause for this? Currently I'm watching number of running/existing containers, system load, open files, number of connections to docker.sock, number of processes, but I didn't find an obvious similarities among situations which show this error. Also this happens on multiple machines (all running Ubuntu 22.04 with Docker version v24.0.5 (now on v25.0.0 with same symptoms))..

1

There are 1 best solutions below

0
VonC On

The error 'mount does not exist' should mean that Docker expects a mount point that is not present.
Verify the container's mount points using docker inspect <container_id> | grep -i mount.
And make sure the underlying filesystem where Docker stores its data (/var/lib/docker by default) is intact and has no issues.

Check also the Docker logs (journalctl -u docker), assuming Linux, since you mention Ubuntu.
And since containerd is involved in managing container life cycles, its logs should be checked (journalctl -u containerd).

With Jenkins involved, there might be race conditions where Jenkins attempts to interact with a container that is in an inconsistent state. That could be due to parallel jobs or premature termination attempts.