Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zombie containerd-shim processes #318

Open
tianon opened this issue Jul 19, 2021 · 7 comments
Open

zombie containerd-shim processes #318

tianon opened this issue Jul 19, 2021 · 7 comments

Comments

@tianon
Copy link
Member

tianon commented Jul 19, 2021

$ docker pull docker:20-dind
20-dind: Pulling from library/docker
Digest: sha256:4e1e22f471afc7ed5e024127396f56db392c1b6fc81fc0c05c0e072fb51909fe
Status: Image is up to date for docker:20-dind
docker.io/library/docker:20-dind

$ docker run -dit --privileged --name test docker:20-dind dockerd
1ee25dc98bf4bc5e232abe27a9e651b18cbfb8b3f6ca981c3ae64c894584e7b4
$ docker exec test ps faux
PID   USER     TIME  COMMAND
    1 root      0:00 dockerd
   33 root      0:00 containerd --config /var/run/docker/containerd/containerd.toml --log-level info
  154 root      0:00 ps faux
$ docker exec test docker run --rm tianon/true
Unable to find image 'tianon/true:latest' locally
latest: Pulling from tianon/true
c53fb220cbad: Pulling fs layer
c53fb220cbad: Verifying Checksum
c53fb220cbad: Download complete
c53fb220cbad: Pull complete
Digest: sha256:009cce421096698832595ce039aa13fa44327d96beedb84282a69d3dbcf5a81b
Status: Downloaded newer image for tianon/true:latest
$ docker exec test ps faux
PID   USER     TIME  COMMAND
    1 root      0:00 dockerd
   33 root      0:00 containerd --config /var/run/docker/containerd/containerd.toml --log-level info
  220 root      0:00 [containerd-shim]
  294 root      0:00 ps faux
$ docker exec test docker run --rm tianon/true
$ docker exec test docker run --rm tianon/true
$ docker exec test docker run --rm tianon/true
$ docker exec test ps faux
PID   USER     TIME  COMMAND
    1 root      0:00 dockerd
   33 root      0:00 containerd --config /var/run/docker/containerd/containerd.toml --log-level info
  220 root      0:00 [containerd-shim]
  331 root      0:00 [containerd-shim]
  429 root      0:00 [containerd-shim]
  529 root      0:00 [containerd-shim]
  600 root      0:00 ps faux

If I do the same test with --init or ... docker:20-dind docker-init dockerd, then we get no zombies.

I think this is technically a bug in containerd, because I can reproduce with bare containerd as pid1 as well, but it doesn't seem quite the same as containerd/containerd#5708 (although perhaps related).

cc @thaJeztah @cpuguy83

$ docker run -dit --privileged --name test --volume /var/lib/containerd docker:20-dind containerd
2fa1f7a0b543808572a7a2da7ad28fd165d783f1ac8f3e9c59ebb30417f43b9f
$ docker exec test ps faux
PID   USER     TIME  COMMAND
    1 root      0:00 containerd
   44 root      0:00 ps faux
$ docker exec test ctr i pull docker.io/tianon/true:latest
...
$ docker exec test ctr run --rm docker.io/tianon/true:latest foo
$ docker exec test ps faux
PID   USER     TIME  COMMAND
    1 root      0:00 containerd
  110 root      0:00 [containerd-shim]
  152 root      0:00 ps faux
@tianon
Copy link
Member Author

tianon commented Jul 19, 2021

The simplest "fix" (workaround) for this repository is something like adjusting ENTRYPOINT ["dockerd-entrypoint.sh"] to ENTRYPOINT ["docker-init", "dockerd-entrypoint.sh"].

@tianon
Copy link
Member Author

tianon commented Jul 19, 2021

(If you don't trust our entrypoint script [which, fair], you can also reproduce just the same with --entrypoint dockerd 😅)

@tianon
Copy link
Member Author

tianon commented Jul 20, 2021

Temporary workaround is up in #319 (to just throw docker-init on top of dockerd).

@thaJeztah
Copy link
Contributor

Did you open a ticket in containerd as well? (of the existing ones don't match this scenario?)

@tianon
Copy link
Member Author

tianon commented Jul 20, 2021

I didn't file an issue there yet, but I've commented at containerd/containerd#5708 (comment) now (because it feels way too similar to be coincidence, IMO).

@tianon
Copy link
Member Author

tianon commented Jul 23, 2021

Quoting containerd/containerd#5708 (comment) here for posterity:

I'm facing something that seems really closely related (and IMO it doesn't feel like it can be pure coincidence), although maybe not exactly the same? When running Docker in Docker (or even just raw conatinerd-in-Docker), I'm seeing 100% reliable behavior where every invocation of a container ends up in a containerd-shim zombie, and it goes away if I run the container with tini as pid1 instead:

$ docker run -dit --privileged --name test --volume /var/lib/containerd --entrypoint containerd docker:20-dind
2fa1f7a0b543808572a7a2da7ad28fd165d783f1ac8f3e9c59ebb30417f43b9f
$ docker exec test ps faux
PID   USER     TIME  COMMAND
    1 root      0:00 containerd
   44 root      0:00 ps faux
$ docker exec test ctr i pull docker.io/tianon/true:latest
...
$ docker exec test ctr run --rm docker.io/tianon/true:latest foo
$ docker exec test ps faux
PID   USER     TIME  COMMAND
    1 root      0:00 containerd
  110 root      0:00 [containerd-shim]
  152 root      0:00 ps faux
$ docker run -dit --privileged --name test --volume /var/lib/containerd --entrypoint containerd --init docker:20-dind
5d2d6ac195d6fdbb0646b6df8d64de3ac00c4ae3fc0dce62bdd8eb59ac20a322
$ docker exec test ps faux
PID   USER     TIME  COMMAND
    1 root      0:00 /sbin/docker-init -- containerd
    8 root      0:00 containerd
   32 root      0:00 ps faux
$ docker exec test ctr i pull docker.io/tianon/true:latest
...
$ docker exec test ctr run --rm docker.io/tianon/true:latest foo
$ docker exec test ps faux
PID   USER     TIME  COMMAND
    1 root      0:00 /sbin/docker-init -- containerd
    8 root      0:00 containerd
  142 root      0:00 ps faux

(See also docker-library/docker#318.)

@tianon The ctr uses containerd-shim-runc-v2 by default right now. The shimv2 binary will re-exec itself to start the running shim server, which makes that the parent pid of running shim server is 1. But the containerd isn't the reaper for the exited child processes. That is why that is zombie shim in dind.

And when use io.containerd.runtime.v1.linux as runtime, the runtime will call the containerd to publish that exit event.

https://github.com/containerd/containerd/blob/a963242f78c8a05967dfe050cab1016ac7aeabee/cmd/containerd-shim/main_unix.go#L287-L318

But the ctr run will delete the task when the task is stop.

https://github.com/containerd/containerd/blob/a963242f78c8a05967dfe050cab1016ac7aeabee/runtime/v1/shim/service.go#L509-L541

The p.SetExited(e.Status) will notify the ctr that the task quit. So, both the task.Delete in ctr and event publish action are handled in the same time. And the containerD will kill the shim force so that the containerd created by shim will be zombie.

➜  vagrant docker run -dit --privileged --name test --volume /var/lib/containerd --entrypoint containerd docker:20-dind
82f541cbb604077d99f76da45d9b866e03de577ffb209bf88b437e41ddca8440
➜  vagrant docker exec test ctr i pull docker.io/tianon/true:latest > /dev/null
➜  vagrant docker exec test ctr run --runtime io.containerd.runtime.v1.linux docker.io/tianon/true:latest foo

➜  vagrant docker exec test ps -ef
PID   USER     TIME  COMMAND
    1 root      0:00 containerd
  107 root      0:00 [containerd]
  122 root      0:00 ps -ef

If you run the foo container with detach mode, the shim will reap that containerd command.

➜  vagrant docker run -dit --privileged --name test --volume /var/lib/containerd --entrypoint containerd docker:20-dind
97243d2c9667a246827a07eca736f666dc9f0864744f532fb7bf16f7d80dda08
➜  vagrant docker exec test ctr i pull docker.io/tianon/true:latest > /dev/null
➜  vagrant docker exec test ctr run -d --runtime io.containerd.runtime.v1.linux docker.io/tianon/true:latest foo

➜  vagrant docker exec test ps -ef
PID   USER     TIME  COMMAND
    1 root      0:00 containerd
   74 root      0:00 containerd-shim -namespace default -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/default/foo -address /run/containerd/containerd.sock -containerd-binary /usr/local/bin/containerd
  112 root      0:00 ps -ef

➜  vagrant docker exec test ctr c rm foo

➜  vagrant docker exec test ps -ef
PID   USER     TIME  COMMAND
    1 root      0:00 containerd
  140 root      0:00 ps -ef

@tianon
Copy link
Member Author

tianon commented Jun 11, 2022

FWIW, I can still reproduce (using --entrypoint this time to avoid #319): 😞

$ docker run -dit --privileged --name test --entrypoint dockerd --pull=always docker:dind
dind: Pulling from library/docker
Digest: sha256:a7a9383d0631b5f6b59f0a8138912d20b63c9320127e3fb065cb9ca0257a58b2
Status: Downloaded newer image for docker:dind
41749ef585c457ff1e737f7ef2efc6ac8d3395219a6526c25f042c31bc43ca01
$ docker exec test ps faux
PID   USER     TIME  COMMAND
    1 root      0:00 dockerd
   22 root      0:00 containerd --config /var/run/docker/containerd/containerd.toml --log-level info
  138 root      0:00 ps faux
$ docker exec test docker run --rm tianon/true
Unable to find image 'tianon/true:latest' locally
latest: Pulling from tianon/true
c53fb220cbad: Pulling fs layer
c53fb220cbad: Download complete
c53fb220cbad: Pull complete
Digest: sha256:009cce421096698832595ce039aa13fa44327d96beedb84282a69d3dbcf5a81b
Status: Downloaded newer image for tianon/true:latest
$ docker exec test ps faux
PID   USER     TIME  COMMAND
    1 root      0:00 dockerd
   22 root      0:00 containerd --config /var/run/docker/containerd/containerd.toml --log-level info
  196 root      0:00 [containerd-shim]
  270 root      0:00 ps faux
$ docker exec test docker run --rm tianon/true
$ docker exec test ps faux
PID   USER     TIME  COMMAND
    1 root      0:00 dockerd
   22 root      0:00 containerd --config /var/run/docker/containerd/containerd.toml --log-level info
  196 root      0:00 [containerd-shim]
  303 root      0:00 [containerd-shim]
  376 root      0:00 ps faux
$ docker exec test docker version
Client:
 Version:           20.10.17
 API version:       1.41
 Go version:        go1.17.11
 Git commit:        100c701
 Built:             Mon Jun  6 22:56:42 2022
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.17
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.17.11
  Git commit:       a89b842
  Built:            Mon Jun  6 23:01:45 2022
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          v1.6.6
  GitCommit:        10c12954828e7c7c9b6e0ea9b0c02b01407d3ae1
 runc:
  Version:          1.1.2
  GitCommit:        v1.1.2-0-ga916309f
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

0lmi added a commit to 0lmi/mender-qa that referenced this issue Jan 4, 2023
dockerd might fail from time to time which looks related to the
known issue docker-library/docker#318
and using docker-init is the workaround used by the community

Changelog: None
Ticket: QA-508
Signed-off-by: Alex Miliukov <oleksandr.miliukov@northern.tech>
0lmi added a commit to 0lmi/mender-qa that referenced this issue Jan 4, 2023
dockerd might fail from time to time which looks related to the
known issue docker-library/docker#318
and using docker-init is the workaround used by the community

Changelog: None
Ticket: QA-508
Signed-off-by: Alex Miliukov <oleksandr.miliukov@northern.tech>
@tianon tianon mentioned this issue May 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
@tianon @thaJeztah and others