New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Containerd v1.6.12 slow memory leak when pod readiness probe gets stuck forever #7802
Comments
Most likely the issue is arising because of the presence of following line in the code: Line 165 in 12f30e6
The documentations says forcefully killing the process, but when there is an error it is returning and not executing p.kill in the next line. |
or that line you mention is an async wait request that returns successfully and we end up waiting a few lines down ( Line 186 in 12f30e6
@cpuguy83 note line 186.. should that line read: @dhiman360 if you have some time maybe add some log outs and see which path is happening/not happening for your two cases? Cheers, Mike |
@mikebrow, I have modified the function in a patch like below after removing the wait etc., and installed patched containerd in my cluster, but it didn't help, the background process with sleep did not get killed:
And I can see the following lines filled the containerd logs.
Also kubelet logs like:
|
By default, the child processes spawned by exec process will inherit standard io file descriptors. The shim server creates a pipe as data channel. Both exec process and its children write data into the write end of the pipe. And the shim server will read data from the pipe. If the write end is still open, the shim server will continue to wait for data from pipe. So, if the exec command is like `bash -c "sleep 365d &"`, the exec process is bash and quit after create `sleep 365d`. But the `sleep 365d` will hold the write end of the pipe for a year! It doesn't make senses that CRI plugin should wait for it. For this case, we should use timeout to drain exec process's io instead of waiting for it. Fixes: containerd#7802 Signed-off-by: Wei Fu <fuweid89@gmail.com>
By default, the child processes spawned by exec process will inherit standard io file descriptors. The shim server creates a pipe as data channel. Both exec process and its children write data into the write end of the pipe. And the shim server will read data from the pipe. If the write end is still open, the shim server will continue to wait for data from pipe. So, if the exec command is like `bash -c "sleep 365d &"`, the exec process is bash and quit after create `sleep 365d`. But the `sleep 365d` will hold the write end of the pipe for a year! It doesn't make senses that CRI plugin should wait for it. For this case, we should use timeout to drain exec process's io instead of waiting for it. Fixes: containerd#7802 Signed-off-by: Wei Fu <fuweid89@gmail.com>
By default, the child processes spawned by exec process will inherit standard io file descriptors. The shim server creates a pipe as data channel. Both exec process and its children write data into the write end of the pipe. And the shim server will read data from the pipe. If the write end is still open, the shim server will continue to wait for data from pipe. So, if the exec command is like `bash -c "sleep 365d &"`, the exec process is bash and quit after create `sleep 365d`. But the `sleep 365d` will hold the write end of the pipe for a year! It doesn't make senses that CRI plugin should wait for it. For this case, we should use timeout to drain exec process's io instead of waiting for it. Fixes: containerd#7802 Signed-off-by: Wei Fu <fuweid89@gmail.com>
@fuweid , Code fetch:
Containerd Version:
Exec into the pod and pe -ef, sleep processes are kept on increasing:
|
@dhiman360 Currently, it is unsupported. The children processes created by exec-process will be reparented to pid-1 after exec-process exits. It is unlikely to trace all the processes, created by exec-process, in containerd side. If the pid-1 doesn't reap the child processes, they will be zombie ones. So, I don't suggest to run this pattern in your cluster and I don't think we should support this. |
@fuweid ,
Now, because it is allowed by kubernetes, we can't disallows the pod developers to not write such code. Question is a bad pod is causing the whole framework to fail looks terribly bad. Do you think we can handle these in containerd in some other way, like work around etc. ? |
I see this is a somewhat old problem moby/moby#9098 with some proposed self-help work arounds. |
If the script may get stuck forever, the owner of script should take care about this. For example, use timeout to ensure that it will receive the signal kill.
It is all about the process manager. As I mentioned before, even if containerd can kill all the processes created by one exec, there will be zombie process issues. The pid-1 process should be aware about the dead process and reap it. Unfortunately, most of processes doesn't watch SIGCHLD signal. For example,
So, the user should take care the script by themself. |
By default, the child processes spawned by exec process will inherit standard io file descriptors. The shim server creates a pipe as data channel. Both exec process and its children write data into the write end of the pipe. And the shim server will read data from the pipe. If the write end is still open, the shim server will continue to wait for data from pipe. So, if the exec command is like `bash -c "sleep 365d &"`, the exec process is bash and quit after create `sleep 365d`. But the `sleep 365d` will hold the write end of the pipe for a year! It doesn't make senses that CRI plugin should wait for it. For this case, we should use timeout to drain exec process's io instead of waiting for it. Fixes: containerd#7802 Signed-off-by: Wei Fu <fuweid89@gmail.com>
By default, the child processes spawned by exec process will inherit standard io file descriptors. The shim server creates a pipe as data channel. Both exec process and its children write data into the write end of the pipe. And the shim server will read data from the pipe. If the write end is still open, the shim server will continue to wait for data from pipe. So, if the exec command is like `bash -c "sleep 365d &"`, the exec process is bash and quit after create `sleep 365d`. But the `sleep 365d` will hold the write end of the pipe for a year! It doesn't make senses that CRI plugin should wait for it. For this case, we should use timeout to drain exec process's io instead of waiting for it. Fixes: containerd#7802 Signed-off-by: Wei Fu <fuweid89@gmail.com> (cherry picked from commit 82c0f4f) Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
By default, the child processes spawned by exec process will inherit standard io file descriptors. The shim server creates a pipe as data channel. Both exec process and its children write data into the write end of the pipe. And the shim server will read data from the pipe. If the write end is still open, the shim server will continue to wait for data from pipe. So, if the exec command is like `bash -c "sleep 365d &"`, the exec process is bash and quit after create `sleep 365d`. But the `sleep 365d` will hold the write end of the pipe for a year! It doesn't make senses that CRI plugin should wait for it. For this case, we should use timeout to drain exec process's io instead of waiting for it. Fixes: containerd#7802 Signed-off-by: Wei Fu <fuweid89@gmail.com>
By default, the child processes spawned by exec process will inherit standard io file descriptors. The shim server creates a pipe as data channel. Both exec process and its children write data into the write end of the pipe. And the shim server will read data from the pipe. If the write end is still open, the shim server will continue to wait for data from pipe. So, if the exec command is like `bash -c "sleep 365d &"`, the exec process is bash and quit after create `sleep 365d`. But the `sleep 365d` will hold the write end of the pipe for a year! It doesn't make senses that CRI plugin should wait for it. For this case, we should use timeout to drain exec process's io instead of waiting for it. Fixes: containerd#7802 Signed-off-by: Wei Fu <fuweid89@gmail.com>
By default, the child processes spawned by exec process will inherit standard io file descriptors. The shim server creates a pipe as data channel. Both exec process and its children write data into the write end of the pipe. And the shim server will read data from the pipe. If the write end is still open, the shim server will continue to wait for data from pipe. So, if the exec command is like `bash -c "sleep 365d &"`, the exec process is bash and quit after create `sleep 365d`. But the `sleep 365d` will hold the write end of the pipe for a year! It doesn't make senses that CRI plugin should wait for it. For this case, we should use timeout to drain exec process's io instead of waiting for it. Fixes: containerd#7802 Signed-off-by: Wei Fu <fuweid89@gmail.com> (cherry picked from commit 82c0f4f) Signed-off-by: Wei Fu <fuweid89@gmail.com>
By default, the child processes spawned by exec process will inherit standard io file descriptors. The shim server creates a pipe as data channel. Both exec process and its children write data into the write end of the pipe. And the shim server will read data from the pipe. If the write end is still open, the shim server will continue to wait for data from pipe. So, if the exec command is like `bash -c "sleep 365d &"`, the exec process is bash and quit after create `sleep 365d`. But the `sleep 365d` will hold the write end of the pipe for a year! It doesn't make senses that CRI plugin should wait for it. For this case, we should use timeout to drain exec process's io instead of waiting for it. Fixes: containerd#7802 Signed-off-by: Wei Fu <fuweid89@gmail.com> (cherry picked from commit 82c0f4f) Signed-off-by: Wei Fu <fuweid89@gmail.com>
Description
Test with non responding curl command
Pod.yaml
readiness-probe.sh
When the curl command not responding, exec probe readiness bash script starts curl as a foreground process for probing the http server. periodSeconds: 5 , timeoutSeconds: 5 , the following happens, the bash process times out after 5 s and containerd/shim deletes the bash process it started but not the foreground process curl.
After the timeout the next probe is run after 2 minutes and 5 seconds and not 5 s as per spec (periodSeconds). This is due to the kubelet configuration variable runtimeRequestTimeout in /var/lib/kubelet/config.yaml runtimeRequestTimeout is set to 0s, if 0 an internal timer in kubelet will be set to a default value of 2 minutes + the spec timeoutSeconds, in this case 2m + 5s.
This leads to a slow leaking memory in containerd.
Steps to reproduce the issue
To simulate the same scenario and easy reproduction I have replaced the stuck curl command with a very long running sleep. If you deploy this pod the problem will be reproduced.
Pod with stuck probe at sleep.
Describe the results you received and expected
With the above pod deployed, a monitoring of containerd memory shows the slow memory leak in the worker where pod is running:
worker:~> echo; date; ps -e -o rss,args | grep containerd| grep log-level
Mon 12 Dec 2022 12:52:45 PM UTC
81656 /usr/local/bin/containerd --log-level=warn
Mon 12 Dec 2022 01:03:09 PM UTC
82180 /usr/local/bin/containerd --log-level=warn
Mon 12 Dec 2022 01:09:06 PM UTC
82964 /usr/local/bin/containerd --log-level=warn
...
...
Mon 12 Dec 2022 02:34:44 PM UTC
90264 /usr/local/bin/containerd --log-level=warn
Mon 12 Dec 2022 03:03:13 PM UTC
92312 /usr/local/bin/containerd --log-level=warn
Mon 12 Dec 2022 03:25:04 PM UTC
94360 /usr/local/bin/containerd --log-level=warn
Mon 12 Dec 2022 04:00:28 PM UTC
96408 /usr/local/bin/containerd --log-level=warn
Mon 12 Dec 2022 04:28:40 PM UTC
100624 /usr/local/bin/containerd --log-level=warn
...
...
Tue 13 Dec 2022 02:42:26 AM UTC
150236 /usr/local/bin/containerd --log-level=warn
An exec to the pod
ps -ef shows sleeping processes with numbers of processes kept on increasing over time, one added every 2m 5sec as explained above.
ps -ef
PID USER COMMAND
1 root /bin/sh -c touch /tmp/healthy; sleep 30; rm -f /tmp/healthy; sleep 600000000
15 root sleep 100000000
16 root sh
24 root sleep 600000000
37 root sleep 100000000
46 root sleep 100000000
53 root sleep 100000000
60 root sleep 100000000
67 root sleep 100000000
75 root sleep 100000000
82 root sleep 100000000
89 root sleep 100000000
96 root sleep 100000000
103 root sleep 100000000
110 root sleep 100000000
117 root sleep 100000000
124 root sleep 100000000
132 root sleep 100000000
138 root sleep 100000000
144 root sleep 100000000
151 root sleep 100000000
...
...
What version of containerd are you using?
containerd github.com/containerd/containerd v1.6.12 a05d175
Any other relevant information
worker:~> runc -version
worker:~> sudo crictl info
worker:~> uname -a
Linux worker-pool1-xxxxxx 5.14.21-150400.24.33-default #1 SMP PREEMPT_DYNAMIC Fri Nov 4 13:55:06 UTC 2022 (76cfe60) x86_64 x86_64 x86_64 GNU/Linux
worker:~> cat /var/lib/kubelet/config.yaml
Show configuration if it is related to CRI plugin.
version = 2
root = "/var/lib/docker/containerd/root"
state = "/run/containerd"
plugin_dir = ""
disabled_plugins = []
required_plugins = []
oom_score = -999
[grpc]
address = "/run/containerd/containerd.sock"
tcp_address = ""
tcp_tls_cert = ""
tcp_tls_key = ""
uid = 0
gid = 0
max_recv_message_size = 16777216
max_send_message_size = 16777216
[ttrpc]
address = ""
uid = 0
gid = 0
[debug]
address = ""
uid = 0
gid = 0
level = ""
[metrics]
address = ""
grpc_histogram = false
[cgroup]
path = ""
[timeouts]
"io.containerd.timeout.shim.cleanup" = "5s"
"io.containerd.timeout.shim.load" = "5s"
"io.containerd.timeout.shim.shutdown" = "3s"
"io.containerd.timeout.task.state" = "2s"
[plugins]
[plugins."io.containerd.gc.v1.scheduler"]
pause_threshold = 0.02
deletion_threshold = 0
mutation_threshold = 100
schedule_delay = "0s"
startup_delay = "100ms"
[plugins."io.containerd.grpc.v1.cri"]
disable_tcp_service = true
stream_server_address = "127.0.0.1"
stream_server_port = "0"
stream_idle_timeout = "4h0m0s"
enable_selinux = false
sandbox_image = "registry.eccd.local:5000/pause:3.8-1-3e405cfb"
stats_collect_period = 10
systemd_cgroup = false
enable_tls_streaming = false
max_container_log_line_size = 16384
disable_cgroup = false
disable_apparmor = false
restrict_oom_score_adj = false
max_concurrent_downloads = 3
device_ownership_from_security_context = true
disable_proc_mount = false
[plugins."io.containerd.grpc.v1.cri".containerd]
snapshotter = "overlayfs"
default_runtime_name = "runc"
no_pivot = false
[plugins."io.containerd.grpc.v1.cri".containerd.default_runtime]
runtime_type = ""
runtime_engine = ""
runtime_root = ""
privileged_without_host_devices = false
[plugins."io.containerd.grpc.v1.cri".containerd.untrusted_workload_runtime]
runtime_type = ""
runtime_engine = ""
runtime_root = ""
privileged_without_host_devices = false
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v2"
runtime_engine = ""
runtime_root = ""
privileged_without_host_devices = false
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
SystemdCgroup = true
[plugins."io.containerd.grpc.v1.cri".cni]
bin_dir = "/opt/cni/bin"
conf_dir = "/etc/cni/net.d"
max_conf_num = 1
conf_template = ""
[plugins."io.containerd.grpc.v1.cri".registry]
config_path = "/etc/containerd/certs.d"
[plugins."io.containerd.grpc.v1.cri".registry.mirrors]
[plugins."io.containerd.grpc.v1.cri".x509_key_pair_streaming]
tls_cert_file = ""
tls_key_file = ""
[plugins."io.containerd.internal.v1.opt"]
path = "/opt/containerd"
[plugins."io.containerd.internal.v1.restart"]
interval = "10s"
[plugins."io.containerd.metadata.v1.bolt"]
content_sharing_policy = "shared"
[plugins."io.containerd.monitor.v1.cgroups"]
no_prometheus = false
[plugins."io.containerd.runtime.v1.linux"]
shim = "containerd-shim"
runtime = "runc"
runtime_root = ""
no_shim = false
shim_debug = false
[plugins."io.containerd.runtime.v2.task"]
platforms = ["linux/amd64"]
[plugins."io.containerd.service.v1.diff-service"]
default = ["walking"]
[plugins."io.containerd.snapshotter.v1.devmapper"]
root_path = ""
pool_name = ""
base_image_size = ""
The text was updated successfully, but these errors were encountered: