Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

containerd-shim process isn't reaped for some killed containers #5708

Open
dany74q opened this issue Jul 9, 2021 · 14 comments
Open

containerd-shim process isn't reaped for some killed containers #5708

dany74q opened this issue Jul 9, 2021 · 14 comments
Labels

Comments

@dany74q
Copy link

dany74q commented Jul 9, 2021

Description
We have several EKS clusters which autoscale throughout the day - they handle burst workloads,
and in a given day the underlying ASG-s may scale down to 0 nodes, or scale up to tens of nodes.

We've noticed that once in a while, we have nodes which have pods stuck in a 'Terminating' status on them
for days on-end, until we manually intervene and force-delete them;

I've SSH-ed to one of the nodes which experienced this behavior and tried to introspect the behavior,
here's what I've gathered, hopefully covering most abstraction layers, but I could not find the root cause unfortunately -
I'd love to know what can I do to further debug this.

Quick summary (more info below):

  • Kubelet is trying to kill an already dead container indefinitely
  • Docker thinks the container is running, although it is not
  • Containerd shows the task is stopped
  • The container-shim process wasn't reaped and is still alive
  • The shim receives a kill request from containerd, it execve's runc to kill the container; it receives "container not running" and responds to containerd with "process already finished: not found"
  • The above loops for every kubelet's kill retry

The above leads me to conclude that there's something fishy in the distributed orchestration of killing a container, I'd assume that it's somewhere between containerd<->runc, but I'm not entirely sure - and would love to know how can I better pinpoint the exact cause.

Steps to reproduce the issue:
I'm not entirely sure how to reproduce the behavior yet, as it happens sporadically in arbitrary nodes.

Describe the results you received:
Containers are dead, containerd is aware they are stopped, but the shim isn't reaped, docker thinks the container is still running and misleads kubelet to keep it as Terminated until manually intervening (force deleting the pod).

Describe the results you expected:
The shim should go down with the container, docker should be notified the container is stopped so that kubelet will update the pod's status accordingly.

What version of containerd are you using:

$ containerd --version
containerd github.com/containerd/containerd 1.4.1 c623d1b36f09f8ef6536a057bd658b3aa8632828

Any other relevant information (runC version, CRI configuration, OS/Kernel version, etc.):
Deeper dive per abstraction layer -

Kubernetes / kubelet:

  • The pod is stuck in a 'Terminating state'
  • There are no finalizers on the pod
  • When I describe the pod, I see something similar to 'Normal Killing 2m49s (x1715 over 2d6h) kubelet Stopping container csi-node'
  • In journalctl -u kubelet, I see the following near the time the pod started to terminate:
Jul 07 03:40:25 ip-10-0-73-87.ec2.internal kubelet[4811]: I0707 03:40:20.474218    4811 kubelet.go:1848] SyncLoop (DELETE, "api"): "csi-driver-aws-csi-driver-node-2qssj_default(6f2f36c4-06f0-406d-9681-b92fa0106441)
"
Jul 07 03:40:32 ip-10-0-73-87.ec2.internal kubelet[4811]: I0707 03:40:32.052286    4811 kubelet.go:1870] SyncLoop (PLEG): "csi-driver-aws-csi-driver-node-2qssj_default(6f2f36c4-06f0-406d-9681-b92fa0106441)", event:
 &pleg.PodLifecycleEvent{ID:"6f2f36c4-06f0-406d-9681-b92fa0106441", Type:"ContainerDied", Data:"789845b4d5dc620ce62b36ff5a6d2ef380a725226264b0170d33ea645bb837f1"}
Jul 07 03:41:18 ip-10-0-73-87.ec2.internal kubelet[4811]: I0707 03:41:18.375163    4811 kubelet.go:1870] SyncLoop (PLEG): "csi-driver-aws-csi-driver-node-2qssj_default(6f2f36c4-06f0-406d-9681-b92fa0106441)", event:
 &pleg.PodLifecycleEvent{ID:"6f2f36c4-06f0-406d-9681-b92fa0106441", Type:"ContainerDied", Data:"650c0009c27bfd04f4578e3a5fe2ce0eea300088acb5e669f1d71c5d187139ff"}
  • Then, I see indefinite "killing container" messages:
Jul 09 10:32:23 ip-10-0-73-87.ec2.internal kubelet[4811]: I0709 10:32:23.591613    4811 kuberuntime_container.go:635] Killing container "docker://cd7ed93ae2d106564609055e17b24679860bc6cfbfdb5c845f3644815387a37a" wi
th a 30 second grace period
Jul 09 10:34:24 ip-10-0-73-87.ec2.internal kubelet[4811]: I0709 10:34:24.591670    4811 kuberuntime_container.go:635] Killing container "docker://cd7ed93ae2d106564609055e17b24679860bc6cfbfdb5c845f3644815387a37a" wi
th a 30 second grace period
Jul 09 10:36:09 ip-10-0-73-87.ec2.internal kubelet[4811]: I0709 10:36:09.591654    4811 kuberuntime_container.go:635] Killing container "docker://cd7ed93ae2d106564609055e17b24679860bc6cfbfdb5c845f3644815387a37a" wi
th a 30 second grace period

Docker:

  • When I docker container ls / docker inspect - the container status is 'Running'
 {
        "Id": "cd7ed93ae2d106564609055e17b24679860bc6cfbfdb5c845f3644815387a37a",
        "Created": "2021-07-07T00:11:27.14530101Z",
        "Path": "/entrypoint.sh",
        "Args": [
            "--endpoint=unix:///csi/csi.sock",
            "--v=4",
            "--volume-attach-limit=5"
        ],
        "State": {
            "Status": "running",
            "Running": true,
            "Paused": false,
            "Restarting": false,
            "OOMKilled": false,
            "Dead": false,
            "Pid": 7064,
            "ExitCode": 0,
            "Error": "",
            "StartedAt": "2021-07-07T00:11:36.641265855Z",
            "FinishedAt": "0001-01-01T00:00:00Z"
        },
	...
}
  • In journalctl, looking near the time the pod started terminating, I see the following:
-- Logs begin at Wed 2021-07-07 00:10:20 UTC, end at Fri 2021-07-09 10:46:56 UTC. --
Jul 07 03:30:43 ip-10-0-73-87.ec2.internal dockerd[4325]: time="2021-07-07T03:30:43.252002398Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Jul 07 03:30:44 ip-10-0-73-87.ec2.internal dockerd[4325]: time="2021-07-07T03:30:44.542537801Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Jul 07 03:39:32 ip-10-0-73-87.ec2.internal dockerd[4325]: http: superfluous response.WriteHeader call from github.com/docker/docker/api/server/httputils.WriteJSON (httputils_write_json.go:11)
on returned error: write unix /var/run/docker.sock->@: write: broken pipe"
Jul 07 03:40:35 ip-10-0-73-87.ec2.internal dockerd[4325]: time="2021-07-07T03:40:35.735692652Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Jul 07 03:40:44 ip-10-0-73-87.ec2.internal dockerd[4325]: time="2021-07-07T03:40:44.165635762Z" level=error msg="stream copy error: reading from a closed fifo"
Jul 07 03:40:47 ip-10-0-73-87.ec2.internal dockerd[4325]: time="2021-07-07T03:40:47.504174793Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Jul 07 03:40:56 ip-10-0-73-87.ec2.internal dockerd[4325]: time="2021-07-07T03:40:55.687736472Z" level=info msg="Container cd7ed93ae2d106564609055e17b24679860bc6cfbfdb5c845f3644815387a37a failed to exit within 30 se
conds of signal 15 - using the force"
Jul 07 03:41:07 ip-10-0-73-87.ec2.internal dockerd[4325]: time="2021-07-07T03:41:06.802768003Z" level=info msg="Container cd7ed93ae2d1 failed to exit within 10 seconds of kill - trying direct SIGKILL"
Jul 07 03:41:17 ip-10-0-73-87.ec2.internal dockerd[4325]: time="2021-07-07T03:41:17.058149021Z" level=warning msg="Published ports are discarded when using host network mode"
Jul 07 03:41:17 ip-10-0-73-87.ec2.internal dockerd[4325]: time="2021-07-07T03:41:17.066107174Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Jul 07 03:41:48 ip-10-0-73-87.ec2.internal dockerd[4325]: time="2021-07-07T03:41:48.397339761Z" level=info msg="Container cd7ed93ae2d106564609055e17b24679860bc6cfbfdb5c845f3644815387a37a failed to exit within 30 se
conds of signal 15 - using the force"
Jul 07 03:41:58 ip-10-0-73-87.ec2.internal dockerd[4325]: time="2021-07-07T03:41:58.415920505Z" level=info msg="Container cd7ed93ae2d1 failed to exit within 10 seconds of kill - trying direct SIGKILL"
Jul 07 03:42:28 ip-10-0-73-87.ec2.internal dockerd[4325]: time="2021-07-07T03:42:28.634636074Z" level=info msg="Container cd7ed93ae2d106564609055e17b24679860bc6cfbfdb5c845f3644815387a37a failed to exit within 30 se
conds of signal 15 - using the force"
  • The SIGKILL / 'using the force' messages continue indefinitely
  • I've uploaded the dockerd stacktrace here - docker-stacktrace.txt

Containerd:

  • In containerd, the task of the container is 'STOPPED' - cd7ed93ae2d106564609055e17b24679860bc6cfbfdb5c845f3644815387a37a 7064 STOPPED
  • The task's metrics prints the following:
ID                                                                  TIMESTAMP                                  
cd7ed93ae2d106564609055e17b24679860bc6cfbfdb5c845f3644815387a37a    2021-07-09 10:52:16.172595382 +0000 UTC    

METRIC                   VALUE                                                
memory.usage_in_bytes    20029440                                             
memory.limit_in_bytes    9223372036854771712                                  
memory.stat.cache        11411456                                             
cpuacct.usage            69532053147                                          
cpuacct.usage_percpu     [14916245050 21098017259 14389165228 19128625610]    
pids.current             0                                                    
pids.limit               0  
  • The container's info doesn't show something particularly helpful:
{
    "ID": "cd7ed93ae2d106564609055e17b24679860bc6cfbfdb5c845f3644815387a37a",
    "Labels": {
        "com.docker/engine.bundle.path": "/var/run/docker/containerd/cd7ed93ae2d106564609055e17b24679860bc6cfbfdb5c845f3644815387a37a"
    },
    "Image": "",
    "Runtime": {
        "Name": "io.containerd.runtime.v1.linux",
        "Options": {
            "type_url": "containerd.linux.runc.RuncOptions",
            "value": "CgRydW5jEhwvdmFyL3J1bi9kb2NrZXIvcnVudGltZS1ydW5j"
        }
    },
    "SnapshotKey": "",
    "Snapshotter": "",
    "CreatedAt": "2021-07-07T00:11:36.498772067Z",
    "UpdatedAt": "2021-07-07T00:11:36.498772067Z",
    "Extensions": null,
    "Spec": {
        "ociVersion": "1.0.1-dev",
        ...
    },
    ...
}

containerd-shim:

  • The containerd-shim of the container is still running:
root      7023  0.0  0.0 710748  6268 ?        Sl   Jul07   0:12 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/cd7ed93ae2d106564609055e17b24679860bc6cfbfdb5c845f36
44815387a37a -address /run/containerd/containerd.sock -containerd-binary /usr/bin/containerd -runtime-root /var/run/docker/runtime-runc
  • I tried to peek at the stdout / stderr of the shim in - but either the process hadn't flushed data, or it is simply empty
  • To find the shim id, I've looked at the process fd-s and correlated the opened sockets w/ ss:
[root@ip-10-0-73-87 ~]# ls -al /proc/$pid/fd | awk '/socket/ { print $NF }' | grep -o '[0-9]*' | xargs -I{} sh -c "ss -apn | grep {}"
u_str            LISTEN                 0                   4096                /run/containerd/s/6e99a634bfa5b915cbeade50e47384f60874a9358e5e96cb59523a46339c138b 25584                                              
       * 0                    users:(("containerd-shim",pid=7023,fd=12),("containerd",pid=3538,fd=80))       
u_str            ESTAB                  0                   0                   /run/containerd/s/6e99a634bfa5b915cbeade50e47384f60874a9358e5e96cb59523a46339c138b 24369                                              
       * 25595                users:(("containerd-shim",pid=7023,fd=3))                                      
u_str            ESTAB                  0                   0                                                                                                 * 25595                                                 
    * 24369                users:(("containerd",pid=3538,fd=87))                                          
  • Then I looked at the containerd journal entries for the shim id (6e99a6...) and the only thing that came up was:
    Jul 07 00:11:36 ip-10-0-73-87.ec2.internal containerd[3538]: time="2021-07-07T00:11:36.521930693Z" level=info msg="shim containerd-shim started" address="unix:///run/containerd/s/6e99a634bfa5b915cbeade50e47384f6087 4a9358e5e96cb59523a46339c138b" debug=false pid=7023

  • I've managed to retrieve the shim's go stack trace by strace-ing to a file and sending a kill -USR1 to it,
    but I don't see anything of particular interest there:

shim stacktrace

write(8</var/lib/containerd/io.containerd.runtime.v1.linux/moby/cd7ed93ae2d106564609055e17b24679860bc6cfbfdb5c845f3644815387a37a/shim.stdout.log>, "time="2021-07-09T11:56:32Z" level=info msg="=== BEGIN goroutine stack dump ===
goroutine 9 [running]:
main.dumpStacks(0xc000064150)
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/cmd/containerd-shim/main_unix.go:276 +0x74
main.executeShim.func1(0xc00004a1e0, 0xc000064150)
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/cmd/containerd-shim/main_unix.go:186 +0x3d
created by main.executeShim
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/cmd/containerd-shim/main_unix.go:184 +0x5e9

goroutine 1 [select]:
main.handleSignals(0xc000064150, 0xc0000a8720, 0xc00005c090, 0xc00006e000, 0xc0000d9e50, 0x0)
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/cmd/containerd-shim/main_unix.go:239 +0x119
main.executeShim(0xc0000d2540, 0x7fe26c883088)
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/cmd/containerd-shim/main_unix.go:189 +0x625
main.main()
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/cmd/containerd-shim/main_unix.go:118 +0x20a

goroutine 18 [chan receive]:
main.main.func1()
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/cmd/containerd-shim/main_unix.go:89 +0x85
created by main.main
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/cmd/containerd-shim/main_unix.go:88 +0x74

goroutine 6 [syscall]:
syscall.Syscall6(0xe8, 0xb, 0xc00003b9b8, 0x80, 0xffffffffffffffff, 0x0, 0x0, 0xffffffffffffffff, 0x0, 0x4)
        /usr/lib/golang/src/syscall/asm_linux_amd64.s:41 +0x5
github.com/containerd/containerd/vendor/golang.org/x/sys/unix.EpollWait(0xb, 0xc00003b9b8, 0x80, 0x80, 0xffffffffffffffff, 0xffffffffffffffff, 0x85a8e0, 0xa5cb40)
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/vendor/golang.org/x/sys/unix/zsyscall_linux_amd64.go:76 +0x72
github.com/containerd/containerd/vendor/github.com/containerd/console.(*Epoller).Wait(0xc00005c2d0, 0xc0000d9a78, 0x8)
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/vendor/github.com/containerd/console/console_linux.go:111 +0x7a
created by github.com/containerd/containerd/runtime/v1/shim.(*Service).initPlatform
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/runtime/v1/shim/service_linux.go:113 +0xbb

goroutine 5 [chan receive]:
github.com/containerd/containerd/runtime/v1/shim.(*Service).processExits(0xc00006e000)
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/runtime/v1/shim/service.go:501 +0xd6
created by github.com/containerd/containerd/runtime/v1/shim.NewService
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/runtime/v1/shim/service.go:92 +0x40d

goroutine 24 [runnable]:
os/signal.process(0x85ec60, 0xa5cb70)
        /usr/lib/golang/src/os/signal/signal.go:240 +0x10a
os/signal.loop()
        /usr/lib/golang/src/os/signal/signal_unix.go:23 +0x45
created by os/signal.Notify.func1.1
        /usr/lib/golang/src/os/signal/signal.go:150 +0x45

goroutine 7 [chan receive, 3375 minutes]:
github.com/containerd/containerd/runtime/v1/shim.(*Service).forward(0xc00006e000, 0x85a3e0, 0xc00001a030)
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/runtime/v1/shim/service.go:579 +0x71
created by github.com/containerd/containerd/runtime/v1/shim.NewService
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/runtime/v1/shim/service.go:96 +0x4bf

goroutine 8 [IO wait, 23 minutes]:
internal/poll.runtime_pollWait(0x7fe26c882de8, 0x72, 0x0)
        /usr/lib/golang/src/runtime/netpoll.go:220 +0x55
internal/poll.(*pollDesc).wait(0xc000074098, 0x72, 0x0, 0x0, 0x7fee5f)
        /usr/lib/golang/src/internal/poll/fd_poll_runtime.go:87 +0x45
internal/poll.(*pollDesc).waitRead(...)
        /usr/lib/golang/src/internal/poll/fd_poll_runtime.go:92
internal/poll.(*FD).Accept(0xc000074080, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
        /usr/lib/golang/src/internal/poll/fd_unix.go:394 +0x1fc
net.(*netFD).accept(0xc000074080, 0xc00009e458, 0xc0000121b8, 0x79ccc0)
        /usr/lib/golang/src/net/fd_unix.go:172 +0x45
net.(*UnixListener).accept(0xc00005c330, 0xc00008fe50, 0xc00008fe58, 0x18)
        /usr/lib/golang/src/net/unixsock_posix.go:162 +0x32
net.(*UnixListener).Accept(0xc00005c330, 0x812878, 0xc000012190, 0x862f20, 0xc0000a0000)
        /usr/lib/golang/src/net/unixsock.go:260 +0x65
github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.(*Server).Serve(0xc00005c090, 0x862f20, 0xc0000a0000, 0x861720, 0xc00005c330, 0x0, 0x0)
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/server.go:87 +0x107
main.serve.func1(0x861720, 0xc00005c330, 0xc00005c090, 0x862f20, 0xc0000a0000)
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/cmd/containerd-shim/main_unix.go:224 +0x88
created by main.serve
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/cmd/containerd-shim/main_unix.go:222 +0x1fe

goroutine 25 [select]:
github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.(*serverConn).run(0xc00009e3c0, 0x862f20, 0xc0000a0000)
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/server.go:431 +0x433
created by github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.(*Server).Serve
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/server.go:127 +0x28d

goroutine 10 [IO wait]:
internal/poll.runtime_pollWait(0x7fe26c882d08, 0x72, 0x85a8e0)
        /usr/lib/golang/src/runtime/netpoll.go:220 +0x55
internal/poll.(*pollDesc).wait(0xc0000e2318, 0x72, 0x85a800, 0xa4b248, 0x0)
        /usr/lib/golang/src/internal/poll/fd_poll_runtime.go:87 +0x45
internal/poll.(*pollDesc).waitRead(...)
        /usr/lib/golang/src/internal/poll/fd_poll_runtime.go:92
internal/poll.(*FD).Read(0xc0000e2300, 0xc000077000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
        /usr/lib/golang/src/internal/poll/fd_unix.go:159 +0x1a5
net.(*netFD).Read(0xc0000e2300, 0xc000077000, 0x1000, 0x1000, 0xc000056270, 0xc00018fc28, 0x20)
        /usr/lib/golang/src/net/fd_posix.go:55 +0x4f
net.(*conn).Read(0xc0000a6078, 0xc000077000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
        /usr/lib/golang/src/net/net.go:182 +0x8e
bufio.(*Reader).Read(0xc00004a420, 0xc000078060, 0xa, 0xa, 0xc00018fd98, 0x447f74, 0xc00018ff18)
        /usr/lib/golang/src/bufio/bufio.go:227 +0x222
io.ReadAtLeast(0x859fa0, 0xc00004a420, 0xc000078060, 0xa, 0xa, 0xa, 0xa, 0x2, 0x789f60)
        /usr/lib/golang/src/io/io.go:314 +0x87
io.ReadFull(...)
        /usr/lib/golang/src/io/io.go:333
github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.readMessageHeader(0xc000078060, 0xa, 0xa, 0x859fa0, 0xc00004a420, 0x73, 0xc00002ec00, 0x0, 0x6b0aae)
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/channel.go:53 +0x69
github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.(*channel).recv(0xc000078040, 0xc00018fe2c, 0x3, 0x2, 0xc00014a400, 0x0, 0x0, 0x0)
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/channel.go:101 +0x6b
github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.(*serverConn).run.func1(0xc00005e120, 0xc00009e3c0, 0xc00005e1e0, 0xc000078040, 0xc00005e180, 0xc00004a480)
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/server.go:362 +0x1b0
created by github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.(*serverConn).run
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/server.go:332 +0x2c5

goroutine 17533 [select, 23 minutes]:
github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.(*serverConn).run(0xc00009e1e0, 0x862f20, 0xc0000a0000)
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/server.go:431 +0x433
created by github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.(*Server).Serve
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/server.go:127 +0x28d

goroutine 17545 [select, 23 minutes]:
github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.(*serverConn).run(0xc000012190, 0x862f20, 0xc0000a0000)
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/server.go:431 +0x433
created by github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.(*Server).Serve
        /builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/server.go:127 +0x28d

=== END goroutine stack dump ===" namespace=moby path=/run/containerd/io.containerd.runtime.v1.linux/moby/cd7ed93ae2d106564609055e17b24679860bc6cfbfdb5c845f3644815387a37a pid=7023"

- When strace-ing the shim, I saw that it execve-s runc with the following:
/usr/sbin/runc --root /var/run/docker/runtime-runc/moby --log /run/containerd/io.containerd.runtime.v1.linux/moby/cd7ed93ae2d106564609055e17b24679860bc6cfbfdb5c845f3644815387a37a/log.json --log-format json kill cd7ed93ae2d106564609055e17b24679860bc6cfbfdb5c845f3644815387a37a 9

to which runc returns "container not running", and in turn the shim reports to containerd - "process already finished: not found"

runc:

  • There are no hanging runc processes which I could observe, nor any journal logs related;
    not sure how to introspect this layer after the fact.
  • runc --root /var/run/docker/runtime-runc/moby events cd7ed93ae2d106564609055e17b24679860bc6cfbfdb5c845f3644815387a37a shows "container with id ... is not running"
  • runc list shows the container as stopped
  • runc state shows the following:
{
  "ociVersion": "1.0.2-dev",
  "id": "cd7ed93ae2d106564609055e17b24679860bc6cfbfdb5c845f3644815387a37a",
  "pid": 0,
  "status": "stopped",
  "bundle": "/run/containerd/io.containerd.runtime.v1.linux/moby/cd7ed93ae2d106564609055e17b24679860bc6cfbfdb5c845f3644815387a37a",
  "rootfs": "/var/lib/docker/overlay2/91052c9a423d3bee16e65973b097e7d0eb35baa5200e2c8f05f3484850e4c1c1/merged",
  "created": "2021-07-07T00:11:36.60742938Z",
  "owner": ""
}

os:

  • The container's process is gone, meaning it was actually killed

What you expected to happen:
Pods should terminate once their underlying container had died.

How to reproduce it (as minimally and precisely as possible):
Not actually sure how to reproduce it consistently - it happens when creating and destroying nodes
rapidly, I'd assume.

Environment

- Kubernetes version: 1.19
- Docker version output:
Client:
 Version:           19.03.13-ce
 API version:       1.40
 Go version:        go1.13.15
 Git commit:        4484c46
 Built:             Mon Oct 12 18:51:20 2020
 OS/Arch:           linux/amd64
 Experimental:      false

Server:
 Engine:
  Version:          19.03.13-ce
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.13.15
  Git commit:       4484c46
  Built:            Mon Oct 12 18:51:50 2020
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.4.1
  GitCommit:        c623d1b36f09f8ef6536a057bd658b3aa8632828
 runc:
  Version:          1.0.0-rc93
  GitCommit:        12644e614e25b05da6fd08a38ffa0cfe1903fdec
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0
- Docker info output:
Client:
 Debug Mode: false

Server:
 Containers: 14
  Running: 9
  Paused: 0
  Stopped: 5
 Images: 8
 Server Version: 19.03.13-ce
 Storage Driver: overlay2
  Backing Filesystem: xfs
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: c623d1b36f09f8ef6536a057bd658b3aa8632828
 runc version: 12644e614e25b05da6fd08a38ffa0cfe1903fdec
 init version: de40ad0 (expected: fec3683)
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 5.4.117-58.216.amzn2.x86_64
 Operating System: Amazon Linux 2
 OSType: linux
 Architecture: x86_64
 CPUs: 4
 Total Memory: 15.46GiB
 Name: ip-10-0-73-87.ec2.internal
 ID: BFFR:6SUN:2BSZ:4MB4:K5NO:OBN2:6VHK:Z2YQ:LS3U:KPEW:5TUV:AEBW
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: true

runc --version
$ runc --version
runc version 1.0.0-rc93
commit: 12644e614e25b05da6fd08a38ffa0cfe1903fdec
spec: 1.0.2-dev
go: go1.15.8
libseccomp: 2.4.1
uname -a
$ uname -a
Linux ip-10-0-73-87.ec2.internal 5.4.117-58.216.amzn2.x86_64 #1 SMP Tue May 11 20:50:07 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
@zouyee
Copy link
Contributor

zouyee commented Jul 19, 2021

moby/moby#41586

@tianon
Copy link
Member

tianon commented Jul 20, 2021

I'm facing something that seems really closely related (and IMO it doesn't feel like it can be pure coincidence), although maybe not exactly the same? When running Docker in Docker (or even just raw conatinerd-in-Docker), I'm seeing 100% reliable behavior where every invocation of a container ends up in a containerd-shim zombie, and it goes away if I run the container with tini as pid1 instead:

$ docker run -dit --privileged --name test --volume /var/lib/containerd --entrypoint containerd docker:20-dind
2fa1f7a0b543808572a7a2da7ad28fd165d783f1ac8f3e9c59ebb30417f43b9f
$ docker exec test ps faux
PID   USER     TIME  COMMAND
    1 root      0:00 containerd
   44 root      0:00 ps faux
$ docker exec test ctr i pull docker.io/tianon/true:latest
...
$ docker exec test ctr run --rm docker.io/tianon/true:latest foo
$ docker exec test ps faux
PID   USER     TIME  COMMAND
    1 root      0:00 containerd
  110 root      0:00 [containerd-shim]
  152 root      0:00 ps faux
$ docker run -dit --privileged --name test --volume /var/lib/containerd --entrypoint containerd --init docker:20-dind
5d2d6ac195d6fdbb0646b6df8d64de3ac00c4ae3fc0dce62bdd8eb59ac20a322
$ docker exec test ps faux
PID   USER     TIME  COMMAND
    1 root      0:00 /sbin/docker-init -- containerd
    8 root      0:00 containerd
   32 root      0:00 ps faux
$ docker exec test ctr i pull docker.io/tianon/true:latest
...
$ docker exec test ctr run --rm docker.io/tianon/true:latest foo
$ docker exec test ps faux
PID   USER     TIME  COMMAND
    1 root      0:00 /sbin/docker-init -- containerd
    8 root      0:00 containerd
  142 root      0:00 ps faux

(See also docker-library/docker#318.)

@fuweid
Copy link
Member

fuweid commented Jul 21, 2021

Maybe related to opencontainers/runc#2575. Will take a look

@fuweid
Copy link
Member

fuweid commented Jul 21, 2021

@dany74q not sure that the container-shim cd7ed93ae2d106564609055e17b24679860bc6cfbfdb5c845f3644815387a37a is still there. If so, would you mind to provide the content of /var/run/docker/runtime-runc/moby/cd7ed93ae2d106564609055e17b24679860bc6cfbfdb5c845f3644815387a37a/state.json? thanks

@dany74q
Copy link
Author

dany74q commented Jul 21, 2021

Hey @fuweid - The container shim mentioned here is already dead (killed it manually), but I did manage to find another container in the same state, here's the state.json:

{
    "id": "56a6ebb29b6e732738afc865df65bc25db7ab424fe200b543a10f246ef2cef6b",
    "init_process_pid": 6707,
    "init_process_start": 8486,
    "created": "2021-07-20T12:40:36.126738308Z",
    "config": {
        "no_pivot_root": false,
        "parent_death_signal": 0,
        "rootfs": "/var/lib/docker/overlay2/4882ba8996023a0374dcc3ebed51dac3d19d40f8f7e095e93310a8ceb83f98f0/merged",
        "umask": null,
        "readonlyfs": false,
        "rootPropagation": 0,
        "mounts": [
            {
                "source": "proc",
                "destination": "/proc",
                "device": "proc",
                "flags": 14,
                "propagation_flags": null,
                "data": "",
                "relabel": "",
                "extensions": 0,
                "premount_cmds": null,
                "postmount_cmds": null
            },
            {
                "source": "tmpfs",
                "destination": "/dev",
                "device": "tmpfs",
                "flags": 16777218,
                "propagation_flags": null,
                "data": "mode=755,size=65536k",
                "relabel": "",
                "extensions": 0,
                "premount_cmds": null,
                "postmount_cmds": null
            },
            {
                "source": "devpts",
                "destination": "/dev/pts",
                "device": "devpts",
                "flags": 10,
                "propagation_flags": null,
                "data": "newinstance,ptmxmode=0666,mode=0620,gid=5",
                "relabel": "",
                "extensions": 0,
                "premount_cmds": null,
                "postmount_cmds": null
            },
            {
                "source": "sysfs",
                "destination": "/sys",
                "device": "sysfs",
                "flags": 15,
                "propagation_flags": null,
                "data": "",
                "relabel": "",
                "extensions": 0,
                "premount_cmds": null,
                "postmount_cmds": null
            },
            {
                "source": "cgroup",
                "destination": "/sys/fs/cgroup",
                "device": "cgroup",
                "flags": 15,
                "propagation_flags": null,
                "data": "",
                "relabel": "",
                "extensions": 0,
                "premount_cmds": null,
                "postmount_cmds": null
            },
            {
                "source": "mqueue",
                "destination": "/dev/mqueue",
                "device": "mqueue",
                "flags": 14,
                "propagation_flags": null,
                "data": "",
                "relabel": "",
                "extensions": 0,
                "premount_cmds": null,
                "postmount_cmds": null
            },
            {
                "source": "/var/lib/kubelet/plugins/ebs.csi.aws.com",
                "destination": "/csi",
                "device": "bind",
                "flags": 20480,
                "propagation_flags": [
                    278528
                ],
                "data": "",
                "relabel": "",
                "extensions": 0,
                "premount_cmds": null,
                "postmount_cmds": null
            },
            {
                "source": "/var/lib/kubelet/plugins_registry",
                "destination": "/registration",
                "device": "bind",
                "flags": 20480,
                "propagation_flags": [
                    278528
                ],
                "data": "",
                "relabel": "",
                "extensions": 0,
                "premount_cmds": null,
                "postmount_cmds": null
            },
            {
                "source": "/var/lib/kubelet/pods/ec951463-d7bb-4c9a-b86c-df9f5360b563/containers/csi-node-driver-registrar/7a3fce3f",
                "destination": "/dev/termination-log",
                "device": "bind",
                "flags": 20480,
                "propagation_flags": [
                    278528
                ],
                "data": "",
                "relabel": "",
                "extensions": 0,
                "premount_cmds": null,
                "postmount_cmds": null
            },
            {
                "source": "/var/lib/docker/containers/19c45783df5ea0410cd69352bd570ac6ea2f94fa43776801045cae094d659d9c/resolv.conf",
                "destination": "/etc/resolv.conf",
                "device": "bind",
                "flags": 20480,
                "propagation_flags": [
                    278528
                ],
                "data": "",
                "relabel": "",
                "extensions": 0,
                "premount_cmds": null,
                "postmount_cmds": null
            },
            {
                "source": "/var/lib/docker/containers/19c45783df5ea0410cd69352bd570ac6ea2f94fa43776801045cae094d659d9c/hostname",
                "destination": "/etc/hostname",
                "device": "bind",
                "flags": 20480,
                "propagation_flags": [
                    278528
                ],
                "data": "",
                "relabel": "",
                "extensions": 0,
                "premount_cmds": null,
                "postmount_cmds": null
            },
            {
                "source": "/var/lib/kubelet/pods/ec951463-d7bb-4c9a-b86c-df9f5360b563/etc-hosts",
                "destination": "/etc/hosts",
                "device": "bind",
                "flags": 20480,
                "propagation_flags": [
                    278528
                ],
                "data": "",
                "relabel": "",
                "extensions": 0,
                "premount_cmds": null,
                "postmount_cmds": null
            },
            {
                "source": "/var/lib/docker/containers/19c45783df5ea0410cd69352bd570ac6ea2f94fa43776801045cae094d659d9c/mounts/shm",
                "destination": "/dev/shm",
                "device": "bind",
                "flags": 20480,
                "propagation_flags": [
                    278528
                ],
                "data": "",
                "relabel": "",
                "extensions": 0,
                "premount_cmds": null,
                "postmount_cmds": null
            },
            {
                "source": "/var/lib/kubelet/pods/ec951463-d7bb-4c9a-b86c-df9f5360b563/volumes/kubernetes.io~secret/csi-node-sa-token-fqxtk",
                "destination": "/var/run/secrets/kubernetes.io/serviceaccount",
                "device": "bind",
                "flags": 20481,
                "propagation_flags": [
                    278528
                ],
                "data": "",
                "relabel": "",
                "extensions": 0,
                "premount_cmds": null,
                "postmount_cmds": null
            }
        ],
        "devices": [
            {
                "type": 99,
                "major": 1,
                "minor": 3,
                "permissions": "rwm",
                "allow": true,
                "path": "/dev/null",
                "file_mode": 438,
                "uid": 0,
                "gid": 0
            },
            {
                "type": 99,
                "major": 1,
                "minor": 8,
                "permissions": "rwm",
                "allow": true,
                "path": "/dev/random",
                "file_mode": 438,
                "uid": 0,
                "gid": 0
            },
            {
                "type": 99,
                "major": 1,
                "minor": 7,
                "permissions": "rwm",
                "allow": true,
                "path": "/dev/full",
                "file_mode": 438,
                "uid": 0,
                "gid": 0
            },
            {
                "type": 99,
                "major": 5,
                "minor": 0,
                "permissions": "rwm",
                "allow": true,
                "path": "/dev/tty",
                "file_mode": 438,
                "uid": 0,
                "gid": 0
            },
            {
                "type": 99,
                "major": 1,
                "minor": 5,
                "permissions": "rwm",
                "allow": true,
                "path": "/dev/zero",
                "file_mode": 438,
                "uid": 0,
                "gid": 0
            },
            {
                "type": 99,
                "major": 1,
                "minor": 9,
                "permissions": "rwm",
                "allow": true,
                "path": "/dev/urandom",
                "file_mode": 438,
                "uid": 0,
                "gid": 0
            }
        ],
        "mount_label": "",
        "hostname": "",
        "namespaces": [
            {
                "type": "NEWNS",
                "path": ""
            },
            {
                "type": "NEWNET",
                "path": "/proc/6509/ns/net"
            },
            {
                "type": "NEWPID",
                "path": ""
            },
            {
                "type": "NEWIPC",
                "path": "/proc/6509/ns/ipc"
            }
        ],
        "capabilities": {
            "Bounding": [
                "CAP_CHOWN",
                "CAP_DAC_OVERRIDE",
                "CAP_FSETID",
                "CAP_FOWNER",
                "CAP_MKNOD",
                "CAP_NET_RAW",
                "CAP_SETGID",
                "CAP_SETUID",
                "CAP_SETFCAP",
                "CAP_SETPCAP",
                "CAP_NET_BIND_SERVICE",
                "CAP_SYS_CHROOT",
                "CAP_KILL",
                "CAP_AUDIT_WRITE"
            ],
            "Effective": [
                "CAP_CHOWN",
                "CAP_DAC_OVERRIDE",
                "CAP_FSETID",
                "CAP_FOWNER",
                "CAP_MKNOD",
                "CAP_NET_RAW",
                "CAP_SETGID",
                "CAP_SETUID",
                "CAP_SETFCAP",
                "CAP_SETPCAP",
                "CAP_NET_BIND_SERVICE",
                "CAP_SYS_CHROOT",
                "CAP_KILL",
                "CAP_AUDIT_WRITE"
            ],
            "Inheritable": [
                "CAP_CHOWN",
                "CAP_DAC_OVERRIDE",
                "CAP_FSETID",
                "CAP_FOWNER",
                "CAP_MKNOD",
                "CAP_NET_RAW",
                "CAP_SETGID",
                "CAP_SETUID",
                "CAP_SETFCAP",
                "CAP_SETPCAP",
                "CAP_NET_BIND_SERVICE",
                "CAP_SYS_CHROOT",
                "CAP_KILL",
                "CAP_AUDIT_WRITE"
            ],
            "Permitted": [
                "CAP_CHOWN",
                "CAP_DAC_OVERRIDE",
                "CAP_FSETID",
                "CAP_FOWNER",
                "CAP_MKNOD",
                "CAP_NET_RAW",
                "CAP_SETGID",
                "CAP_SETUID",
                "CAP_SETFCAP",
                "CAP_SETPCAP",
                "CAP_NET_BIND_SERVICE",
                "CAP_SYS_CHROOT",
                "CAP_KILL",
                "CAP_AUDIT_WRITE"
            ],
            "Ambient": null
        },
        "networks": null,
        "routes": null,
        "cgroups": {
            "path": "/kubepods/besteffort/podec951463-d7bb-4c9a-b86c-df9f5360b563/56a6ebb29b6e732738afc865df65bc25db7ab424fe200b543a10f246ef2cef6b",
            "scope_prefix": "",
            "Paths": null,
            "devices": [
                {
                    "type": 97,
                    "major": -1,
                    "minor": -1,
                    "permissions": "rwm",
                    "allow": false
                },
                {
                    "type": 99,
                    "major": 1,
                    "minor": 5,
                    "permissions": "rwm",
                    "allow": true
                },
                {
                    "type": 99,
                    "major": 1,
                    "minor": 3,
                    "permissions": "rwm",
                    "allow": true
                },
                {
                    "type": 99,
                    "major": 1,
                    "minor": 9,
                    "permissions": "rwm",
                    "allow": true
                },
                {
                    "type": 99,
                    "major": 1,
                    "minor": 8,
                    "permissions": "rwm",
                    "allow": true
                },
                {
                    "type": 99,
                    "major": 5,
                    "minor": 0,
                    "permissions": "rwm",
                    "allow": true
                },
                {
                    "type": 99,
                    "major": 5,
                    "minor": 1,
                    "permissions": "rwm",
                    "allow": true
                },
                {
                    "type": 99,
                    "major": 10,
                    "minor": 229,
                    "permissions": "rwm",
                    "allow": false
                },
                {
                    "type": 99,
                    "major": -1,
                    "minor": -1,
                    "permissions": "m",
                    "allow": true
                },
                {
                    "type": 98,
                    "major": -1,
                    "minor": -1,
                    "permissions": "m",
                    "allow": true
                },
                {
                    "type": 99,
                    "major": 1,
                    "minor": 3,
                    "permissions": "rwm",
                    "allow": true
                },
                {
                    "type": 99,
                    "major": 1,
                    "minor": 8,
                    "permissions": "rwm",
                    "allow": true
                },
                {
                    "type": 99,
                    "major": 1,
                    "minor": 7,
                    "permissions": "rwm",
                    "allow": true
                },
                {
                    "type": 99,
                    "major": 5,
                    "minor": 0,
                    "permissions": "rwm",
                    "allow": true
                },
                {
                    "type": 99,
                    "major": 1,
                    "minor": 5,
                    "permissions": "rwm",
                    "allow": true
                },
                {
                    "type": 99,
                    "major": 1,
                    "minor": 9,
                    "permissions": "rwm",
                    "allow": true
                },
                {
                    "type": 99,
                    "major": 136,
                    "minor": -1,
                    "permissions": "rwm",
                    "allow": true
                },
                {
                    "type": 99,
                    "major": 5,
                    "minor": 2,
                    "permissions": "rwm",
                    "allow": true
                },
                {
                    "type": 99,
                    "major": 10,
                    "minor": 200,
                    "permissions": "rwm",
                    "allow": true
                }
            ],
            "memory": 0,
            "memory_reservation": 0,
            "memory_swap": 0,
            "kernel_memory": 0,
            "kernel_memory_tcp": 0,
            "cpu_shares": 2,
            "cpu_quota": 0,
            "cpu_period": 100000,
            "cpu_rt_quota": 0,
            "cpu_rt_period": 0,
            "cpuset_cpus": "",
            "cpuset_mems": "",
            "pids_limit": 0,
            "blkio_weight": 0,
            "blkio_leaf_weight": 0,
            "blkio_weight_device": null,
            "blkio_throttle_read_bps_device": null,
            "blkio_throttle_write_bps_device": null,
            "blkio_throttle_read_iops_device": null,
            "blkio_throttle_write_iops_device": null,
            "freezer": "",
            "hugetlb_limit": null,
            "oom_kill_disable": false,
            "memory_swappiness": null,
            "net_prio_ifpriomap": null,
            "net_cls_classid_u": 0,
            "cpu_weight": 1,
            "unified": null,
            "skip_devices": false
        },
        "oom_score_adj": 1000,
        "uid_mappings": null,
        "gid_mappings": null,
        "mask_paths": [
            "/proc/acpi",
            "/proc/kcore",
            "/proc/keys",
            "/proc/latency_stats",
            "/proc/timer_list",
            "/proc/timer_stats",
            "/proc/sched_debug",
            "/proc/scsi",
            "/sys/firmware"
        ],
        "readonly_paths": [
            "/proc/asound",
            "/proc/bus",
            "/proc/fs",
            "/proc/irq",
            "/proc/sys",
            "/proc/sysrq-trigger"
        ],
        "sysctl": null,
        "seccomp": null,
        "Hooks": {
            "createContainer": null,
            "createRuntime": null,
            "poststart": null,
            "poststop": null,
            "prestart": null,
            "startContainer": null
        },
        "version": "1.0.2-dev",
        "labels": [
            "bundle=/run/containerd/io.containerd.runtime.v1.linux/moby/56a6ebb29b6e732738afc865df65bc25db7ab424fe200b543a10f246ef2cef6b"
        ],
        "no_new_keyring": false
    },
    "rootless": false,
    "cgroup_paths": {
        "blkio": "/sys/fs/cgroup/blkio/kubepods/besteffort/podec951463-d7bb-4c9a-b86c-df9f5360b563/56a6ebb29b6e732738afc865df65bc25db7ab424fe200b543a10f246ef2cef6b",
        "cpu": "/sys/fs/cgroup/cpu,cpuacct/kubepods/besteffort/podec951463-d7bb-4c9a-b86c-df9f5360b563/56a6ebb29b6e732738afc865df65bc25db7ab424fe200b543a10f246ef2cef6b",
        "cpuacct": "/sys/fs/cgroup/cpu,cpuacct/kubepods/besteffort/podec951463-d7bb-4c9a-b86c-df9f5360b563/56a6ebb29b6e732738afc865df65bc25db7ab424fe200b543a10f246ef2cef6b",
        "cpuset": "/sys/fs/cgroup/cpuset/kubepods/besteffort/podec951463-d7bb-4c9a-b86c-df9f5360b563/56a6ebb29b6e732738afc865df65bc25db7ab424fe200b543a10f246ef2cef6b",
        "devices": "/sys/fs/cgroup/devices/kubepods/besteffort/podec951463-d7bb-4c9a-b86c-df9f5360b563/56a6ebb29b6e732738afc865df65bc25db7ab424fe200b543a10f246ef2cef6b",
        "freezer": "/sys/fs/cgroup/freezer/kubepods/besteffort/podec951463-d7bb-4c9a-b86c-df9f5360b563/56a6ebb29b6e732738afc865df65bc25db7ab424fe200b543a10f246ef2cef6b",
        "hugetlb": "/sys/fs/cgroup/hugetlb/kubepods/besteffort/podec951463-d7bb-4c9a-b86c-df9f5360b563/56a6ebb29b6e732738afc865df65bc25db7ab424fe200b543a10f246ef2cef6b",
        "memory": "/sys/fs/cgroup/memory/kubepods/besteffort/podec951463-d7bb-4c9a-b86c-df9f5360b563/56a6ebb29b6e732738afc865df65bc25db7ab424fe200b543a10f246ef2cef6b",
        "name=systemd": "/sys/fs/cgroup/systemd/kubepods/besteffort/podec951463-d7bb-4c9a-b86c-df9f5360b563/56a6ebb29b6e732738afc865df65bc25db7ab424fe200b543a10f246ef2cef6b",
        "net_cls": "/sys/fs/cgroup/net_cls,net_prio/kubepods/besteffort/podec951463-d7bb-4c9a-b86c-df9f5360b563/56a6ebb29b6e732738afc865df65bc25db7ab424fe200b543a10f246ef2cef6b",
        "net_prio": "/sys/fs/cgroup/net_cls,net_prio/kubepods/besteffort/podec951463-d7bb-4c9a-b86c-df9f5360b563/56a6ebb29b6e732738afc865df65bc25db7ab424fe200b543a10f246ef2cef6b",
        "perf_event": "/sys/fs/cgroup/perf_event/kubepods/besteffort/podec951463-d7bb-4c9a-b86c-df9f5360b563/56a6ebb29b6e732738afc865df65bc25db7ab424fe200b543a10f246ef2cef6b",
        "pids": "/sys/fs/cgroup/pids/kubepods/besteffort/podec951463-d7bb-4c9a-b86c-df9f5360b563/56a6ebb29b6e732738afc865df65bc25db7ab424fe200b543a10f246ef2cef6b"
    },
    "namespace_paths": {
        "NEWCGROUP": "/proc/6707/ns/cgroup",
        "NEWIPC": "/proc/6707/ns/ipc",
        "NEWNET": "/proc/6707/ns/net",
        "NEWNS": "/proc/6707/ns/mnt",
        "NEWPID": "/proc/6707/ns/pid",
        "NEWUSER": "/proc/6707/ns/user",
        "NEWUTS": "/proc/6707/ns/uts"
    },
    "external_descriptors": [
        "/dev/null",
        "pipe:[25372]",
        "pipe:[25373]"
    ],
    "intel_rdt_path": ""
}

I have access to this, or machines in similar state - so feel free to request any additional information, thanks a lot - appreciate your time !

In the mean time, I'll update that - in order to test the hypothesis that what triggers this behavior is dockerd (and not crd/runc) - I'll soon implement a transition of our EKS AMIs to use the one launched yesterday, which supports using crd as a CRI w/o dockerd.

@fuweid
Copy link
Member

fuweid commented Jul 22, 2021

I'm facing something that seems really closely related (and IMO it doesn't feel like it can be pure coincidence), although maybe not exactly the same? When running Docker in Docker (or even just raw conatinerd-in-Docker), I'm seeing 100% reliable behavior where every invocation of a container ends up in a containerd-shim zombie, and it goes away if I run the container with tini as pid1 instead:

$ docker run -dit --privileged --name test --volume /var/lib/containerd --entrypoint containerd docker:20-dind
2fa1f7a0b543808572a7a2da7ad28fd165d783f1ac8f3e9c59ebb30417f43b9f
$ docker exec test ps faux
PID   USER     TIME  COMMAND
    1 root      0:00 containerd
   44 root      0:00 ps faux
$ docker exec test ctr i pull docker.io/tianon/true:latest
...
$ docker exec test ctr run --rm docker.io/tianon/true:latest foo
$ docker exec test ps faux
PID   USER     TIME  COMMAND
    1 root      0:00 containerd
  110 root      0:00 [containerd-shim]
  152 root      0:00 ps faux
$ docker run -dit --privileged --name test --volume /var/lib/containerd --entrypoint containerd --init docker:20-dind
5d2d6ac195d6fdbb0646b6df8d64de3ac00c4ae3fc0dce62bdd8eb59ac20a322
$ docker exec test ps faux
PID   USER     TIME  COMMAND
    1 root      0:00 /sbin/docker-init -- containerd
    8 root      0:00 containerd
   32 root      0:00 ps faux
$ docker exec test ctr i pull docker.io/tianon/true:latest
...
$ docker exec test ctr run --rm docker.io/tianon/true:latest foo
$ docker exec test ps faux
PID   USER     TIME  COMMAND
    1 root      0:00 /sbin/docker-init -- containerd
    8 root      0:00 containerd
  142 root      0:00 ps faux

(See also docker-library/docker#318.)

@tianon The ctr uses containerd-shim-runc-v2 by default right now. The shimv2 binary will re-exec itself to start the running shim server, which makes that the parent pid of running shim server is 1. But the containerd isn't the reaper for the exited child processes. That is why that is zombie shim in dind.

And when use io.containerd.runtime.v1.linux as runtime, the runtime will call the containerd to publish that exit event.

func (l *remoteEventsPublisher) Publish(ctx context.Context, topic string, event events.Event) error {
ns, _ := namespaces.Namespace(ctx)
encoded, err := typeurl.MarshalAny(event)
if err != nil {
return err
}
data, err := encoded.Marshal()
if err != nil {
return err
}
cmd := exec.CommandContext(ctx, containerdBinaryFlag, "--address", l.address, "publish", "--topic", topic, "--namespace", ns)
cmd.Stdin = bytes.NewReader(data)
b := bufPool.Get().(*bytes.Buffer)
defer func() {
b.Reset()
bufPool.Put(b)
}()
cmd.Stdout = b
cmd.Stderr = b
c, err := reaper.Default.Start(cmd)
if err != nil {
return err
}
status, err := reaper.Default.Wait(cmd, c)
if err != nil {
return errors.Wrapf(err, "failed to publish event: %s", b.String())
}
if status != 0 {
return errors.Errorf("failed to publish event: %s", b.String())
}
return nil
}

But the ctr run will delete the task when the task is stop.

func (s *Service) checkProcesses(e runc.Exit) {
var p process.Process
s.mu.Lock()
for _, proc := range s.processes {
if proc.Pid() == e.Pid {
p = proc
break
}
}
s.mu.Unlock()
if p == nil {
log.G(s.context).Debugf("process with id:%d wasn't found", e.Pid)
return
}
if ip, ok := p.(*process.Init); ok {
// Ensure all children are killed
if shouldKillAllOnExit(s.context, s.bundle) {
if err := ip.KillAll(s.context); err != nil {
log.G(s.context).WithError(err).WithField("id", ip.ID()).
Error("failed to kill init's children")
}
}
}
p.SetExited(e.Status)
s.events <- &eventstypes.TaskExit{
ContainerID: s.id,
ID: p.ID(),
Pid: uint32(e.Pid),
ExitStatus: uint32(e.Status),
ExitedAt: p.ExitedAt(),
}
}

The p.SetExited(e.Status) will notify the ctr that the task quit. So, both the task.Delete in ctr and event publish action are handled in the same time. And the containerD will kill the shim force so that the containerd created by shim will be zombie.

➜  vagrant docker run -dit --privileged --name test --volume /var/lib/containerd --entrypoint containerd docker:20-dind
82f541cbb604077d99f76da45d9b866e03de577ffb209bf88b437e41ddca8440
➜  vagrant docker exec test ctr i pull docker.io/tianon/true:latest > /dev/null
➜  vagrant docker exec test ctr run --runtime io.containerd.runtime.v1.linux docker.io/tianon/true:latest foo

➜  vagrant docker exec test ps -ef
PID   USER     TIME  COMMAND
    1 root      0:00 containerd
  107 root      0:00 [containerd]
  122 root      0:00 ps -ef

If you run the foo container with detach mode, the shim will reap that containerd command.

➜  vagrant docker run -dit --privileged --name test --volume /var/lib/containerd --entrypoint containerd docker:20-dind
97243d2c9667a246827a07eca736f666dc9f0864744f532fb7bf16f7d80dda08
➜  vagrant docker exec test ctr i pull docker.io/tianon/true:latest > /dev/null
➜  vagrant docker exec test ctr run -d --runtime io.containerd.runtime.v1.linux docker.io/tianon/true:latest foo

➜  vagrant docker exec test ps -ef
PID   USER     TIME  COMMAND
    1 root      0:00 containerd
   74 root      0:00 containerd-shim -namespace default -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/default/foo -address /run/containerd/containerd.sock -containerd-binary /usr/local/bin/containerd
  112 root      0:00 ps -ef

➜  vagrant docker exec test ctr c rm foo

➜  vagrant docker exec test ps -ef
PID   USER     TIME  COMMAND
    1 root      0:00 containerd
  140 root      0:00 ps -ef

@fuweid
Copy link
Member

fuweid commented Jul 23, 2021

@tianon it seems your case is not related to this one~

@tianon
Copy link
Member

tianon commented Jul 23, 2021

My bad, thanks for checking! (and the detailed reply)

@fuweid
Copy link
Member

fuweid commented Aug 4, 2021

Updated: didn't find any clue about this issue. It seems that docker daemon lost the exit event so that the status of container is still running.

@cpuguy83
Copy link
Member

Based on the OP's description, this sounds like docker not handling errors from kill correctly, but it may be because the error from runc is obfuscated by containerd and docker is not able to understand the current state. We squashed at least one of these issues before.

@cpuguy83
Copy link
Member

I notice OP is using a really old version of docker and containerd, which could be related to why it is not fixed, but I haven't found the last fix we did yet to know if its in that release.

@Kyslik
Copy link

Kyslik commented Jan 3, 2023

We experience something similar in our EKS (1.22) cluster(s) (containerd 1.6.6-1.amzn2, and runc 1.1.3-1.amzn2).

k describe po POD
// SNIP
Events:
  Type     Reason     Age                   From     Message
  ----     ------     ----                  ----     -------
  Warning  Unhealthy  74s (x6975 over 49d)  kubelet  Readiness probe failed:
ctr --namespace k8s.io task ls
// SNIP
TASK                                                                PID      STATUS
f6ce7f2f529d921b42b6414b68849104e04fb6526f791d13403bf3d7bae3b878    0        UNKNOWN
// SNIP
ctr --namespace k8s.io container ls
CONTAINER                                                           IMAGE
// SNIP
f6ce7f2f529d921b42b6414b68849104e04fb6526f791d13403bf3d7bae3b878    800184023465.dkr.ecr.ap-east-1.amazonaws.com/eks/pause:3.5                                 io.containerd.runc.v2
// SNIP
systemctl status
// SNIP
               ├─kubepods-burstable-pod3ca53026_1ea8_44e5_a3c9_c9377d17a6d3.slice
               │ └─cri-containerd-f6ce7f2f529d921b42b6414b68849104e04fb6526f791d13403bf3d7bae3b878.scope
               │   └─12678 /pause
// SNIP
ps -aef --forest
// SNIP
root     12650     1  0 Nov30 ?        00:42:18 /usr/bin/containerd-shim-runc-v2 -namespace k8s.io -id f6ce7f2f529d921b42b6414b68849104e04fb6526f791d13403bf3d7bae3b878 -address /run/containerd/con
65535    12678 12650  0 Nov30 ?        00:00:00  \_ /pause
// SNIP

We detect this by checking for logs that are similar to this one:

E1221 23:47:35.467999       1 scraper.go:140] "Failed to scrape node" err="Get \"https://ip-xxx-xxx-xxx-xxx.REGION.compute.internal:10250/metrics/resource\": context deadline exceeded" node="ip-xxx-xxx-xxx-xxx.REGION.compute.internal"

Its because ctr --namespace k8s.io task metrics TASKID fails (thus metric server fails as well, on that particular node).

We manually have to kill the containerd-shim-runc-v2 and it gets to the expected / normal state.

@arnaldo2792
Copy link

Hey @Kyslik , are you using Amazon Linux or Bottlerocket on your EKS clusters? We have experienced similar behavior in Bottlerocket and I want to discard Bottlerocket being the source of problems here.

@Kyslik
Copy link

Kyslik commented Jan 21, 2023

We are baking AMI ourselves based on Amazon Linux.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants