Containerd v1.6.12 slow memory leak when pod readiness probe gets stuck forever #7802

dhiman360 · 2022-12-13T03:50:12Z

Description

Test with non responding curl command

Pod.yaml

         readinessProbe:
           initialDelaySeconds: 5
           periodSeconds: 5
           timeoutSeconds: 5
           exec:
             command:
               - /app/readiness-probe.sh

readiness-probe.sh

#!/bin/bash
curl -sS -X GET "http://localhost:9000/health/readiness"

When the curl command not responding, exec probe readiness bash script starts curl as a foreground process for probing the http server. periodSeconds: 5 , timeoutSeconds: 5 , the following happens, the bash process times out after 5 s and containerd/shim deletes the bash process it started but not the foreground process curl.

After the timeout the next probe is run after 2 minutes and 5 seconds and not 5 s as per spec (periodSeconds). This is due to the kubelet configuration variable runtimeRequestTimeout in /var/lib/kubelet/config.yaml runtimeRequestTimeout is set to 0s, if 0 an internal timer in kubelet will be set to a default value of 2 minutes + the spec timeoutSeconds, in this case 2m + 5s.

This leads to a slow leaking memory in containerd.

Steps to reproduce the issue

To simulate the same scenario and easy reproduction I have replaced the stuck curl command with a very long running sleep. If you deploy this pod the problem will be reproduced.

Pod with stuck probe at sleep.

apiVersion: v1
kind: Pod
metadata:
  labels:
    test: readiness
  name: readiness-exec
spec:
  containers:
  - name: readiness
    image: registry.k8s.io/busybox
    args:
    - /bin/sh
    - -c
    - touch /tmp/healthy; sleep 30; rm -f /tmp/healthy; sleep 600000000 # A long huge sleep to keep the pod running.
    readinessProbe:
      exec:
        command:
        - /bin/sh
        - -c
        - sleep 100000000 & # Forked a very long running sleep to simulate stuck probe
      initialDelaySeconds: 5
      periodSeconds: 5
      timeoutSeconds: 5

Describe the results you received and expected

With the above pod deployed, a monitoring of containerd memory shows the slow memory leak in the worker where pod is running:

worker:~> echo; date; ps -e -o rss,args | grep containerd| grep log-level

Mon 12 Dec 2022 12:52:45 PM UTC
81656 /usr/local/bin/containerd --log-level=warn

Mon 12 Dec 2022 01:03:09 PM UTC
82180 /usr/local/bin/containerd --log-level=warn

Mon 12 Dec 2022 01:09:06 PM UTC
82964 /usr/local/bin/containerd --log-level=warn
...
...
Mon 12 Dec 2022 02:34:44 PM UTC
90264 /usr/local/bin/containerd --log-level=warn

Mon 12 Dec 2022 03:03:13 PM UTC
92312 /usr/local/bin/containerd --log-level=warn

Mon 12 Dec 2022 03:25:04 PM UTC
94360 /usr/local/bin/containerd --log-level=warn

Mon 12 Dec 2022 04:00:28 PM UTC
96408 /usr/local/bin/containerd --log-level=warn

Mon 12 Dec 2022 04:28:40 PM UTC
100624 /usr/local/bin/containerd --log-level=warn
...
...
Tue 13 Dec 2022 02:42:26 AM UTC
150236 /usr/local/bin/containerd --log-level=warn

An exec to the pod

ps -ef shows sleeping processes with numbers of processes kept on increasing over time, one added every 2m 5sec as explained above.

ps -ef
PID USER COMMAND
1 root /bin/sh -c touch /tmp/healthy; sleep 30; rm -f /tmp/healthy; sleep 600000000
15 root sleep 100000000
16 root sh
24 root sleep 600000000
37 root sleep 100000000
46 root sleep 100000000
53 root sleep 100000000
60 root sleep 100000000
67 root sleep 100000000
75 root sleep 100000000
82 root sleep 100000000
89 root sleep 100000000
96 root sleep 100000000
103 root sleep 100000000
110 root sleep 100000000
117 root sleep 100000000
124 root sleep 100000000
132 root sleep 100000000
138 root sleep 100000000
144 root sleep 100000000
151 root sleep 100000000
...
...

What version of containerd are you using?

containerd github.com/containerd/containerd v1.6.12 a05d175

Any other relevant information

worker:~> runc -version

runc version 1.1.4
commit: v1.1.4-0-g5fd4c4d1
spec: 1.0.2-dev
go: go1.17.6
libseccomp: 2.5.3

worker:~> sudo crictl info

{
  "status": {
    "conditions": [
      {
        "type": "RuntimeReady",
        "status": true,
        "reason": "",
        "message": ""
      },
      {
        "type": "NetworkReady",
        "status": true,
        "reason": "",
        "message": ""
      }
    ]
  },
  "cniconfig": {
    "PluginDirs": [
      "/opt/cni/bin"
    ],
    "PluginConfDir": "/etc/cni/net.d",
    "PluginMaxConfNum": 1,
    "Prefix": "eth",
    "Networks": [
      {
        "Config": {
          "Name": "cni-loopback",
          "CNIVersion": "0.3.1",
          "Plugins": [
            {
              "Network": {
                "type": "loopback",
                "ipam": {},
                "dns": {}
              },
              "Source": "{\"type\":\"loopback\"}"
            }
          ],
          "Source": "{\n\"cniVersion\": \"0.3.1\",\n\"name\": \"cni-loopback\",\n\"plugins\": [{\n  \"type\": \"loopback\"\n}]\n}"
        },
        "IFName": "lo"
      },
      {
        "Config": {
          "Name": "k8s-pod-network",
          "CNIVersion": "0.3.1",
          "Plugins": [
            {
              "Network": {
                "type": "calico",
                "ipam": {
                  "type": "calico-ipam"
                },
                "dns": {}
              },
              "Source": "{\"container_settings\":{\"allow_ip_forwarding\":true},\"etcd_ca_cert_file\":\"/etc/cni/net.d/calico-tls/etcd-ca\",\"etcd_cert_file\":\"/etc/cni/net.d/calico-tls/etcd-cert\",\"etcd_endpoints\":\"https://10.0.16.2:2379,https://10.0.16.20:2379,https://10.0.16.4:2379\",\"etcd_key_file\":\"/etc/cni/net.d/calico-tls/etcd-key\",\"ipam\":{\"assign_ipv4\":\"true\",\"assign_ipv6\":\"false\",\"type\":\"calico-ipam\"},\"kubernetes\":{\"kubeconfig\":\"/etc/cni/net.d/calico-kubeconfig\"},\"log_level\":\"error\",\"mtu\":2070,\"policy\":{\"k8s_api_root\":\"https://[10.96.0.1]:443\",\"type\":\"k8s\"},\"type\":\"calico\"}"
            },
            {
              "Network": {
                "type": "tuning",
                "ipam": {},
                "dns": {}
              },
              "Source": "{\"sysctl\":{\"net.ipv4.tcp_mtu_probing\":\"1\"},\"type\":\"tuning\"}"
            },
            {
              "Network": {
                "type": "portmap",
                "capabilities": {
                  "portMappings": true
                },
                "ipam": {},
                "dns": {}
              },
              "Source": "{\"capabilities\":{\"portMappings\":true},\"snat\":true,\"type\":\"portmap\"}"
            }
          ],
          "Source": "{\n  \"name\": \"k8s-pod-network\",\n  \"cniVersion\": \"0.3.1\",\n  \"plugins\": [\n    {\n      \"type\": \"calico\",\n      \"log_level\": \"error\",\n      \"etcd_endpoints\": \"https://10.0.16.2:2379,https://10.0.16.20:2379,https://10.0.16.4:2379\",\n      \"etcd_key_file\": \"/etc/cni/net.d/calico-tls/etcd-key\",\n      \"etcd_cert_file\": \"/etc/cni/net.d/calico-tls/etcd-cert\",\n      \"etcd_ca_cert_file\": \"/etc/cni/net.d/calico-tls/etcd-ca\",\n      \"mtu\": 2070,\n      \"ipam\": {\n        \"type\": \"calico-ipam\",\n        \"assign_ipv4\": \"true\",\n        \"assign_ipv6\": \"false\"\n      },\n      \"container_settings\": {\n        \"allow_ip_forwarding\": true\n      },\n      \"policy\": {\n        \"type\": \"k8s\",\n        \"k8s_api_root\": \"https://[10.96.0.1]:443\"\n      },\n      \"kubernetes\": {\n          \"kubeconfig\": \"/etc/cni/net.d/calico-kubeconfig\"\n      }\n    },\n    {\n      \"type\": \"tuning\",\n      \"sysctl\": {\"net.ipv4.tcp_mtu_probing\": \"1\"}\n    },\n    {\n      \"type\": \"portmap\",\n      \"snat\": true,\n      \"capabilities\": {\"portMappings\": true}\n    }\n  ]\n}"
        },
        "IFName": "eth0"
      }
    ]
  },
  "config": {
    "containerd": {
      "snapshotter": "overlayfs",
      "defaultRuntimeName": "runc",
      "defaultRuntime": {
        "runtimeType": "",
        "runtimePath": "",
        "runtimeEngine": "",
        "PodAnnotations": null,
        "ContainerAnnotations": null,
        "runtimeRoot": "",
        "options": null,
        "privileged_without_host_devices": false,
        "baseRuntimeSpec": "",
        "cniConfDir": "",
        "cniMaxConfNum": 0
      },
      "untrustedWorkloadRuntime": {
        "runtimeType": "",
        "runtimePath": "",
        "runtimeEngine": "",
        "PodAnnotations": null,
        "ContainerAnnotations": null,
        "runtimeRoot": "",
        "options": null,
        "privileged_without_host_devices": false,
        "baseRuntimeSpec": "",
        "cniConfDir": "",
        "cniMaxConfNum": 0
      },
      "runtimes": {
        "runc": {
          "runtimeType": "io.containerd.runc.v2",
          "runtimePath": "",
          "runtimeEngine": "",
          "PodAnnotations": null,
          "ContainerAnnotations": null,
          "runtimeRoot": "",
          "options": {
            "SystemdCgroup": true
          },
          "privileged_without_host_devices": false,
          "baseRuntimeSpec": "",
          "cniConfDir": "",
          "cniMaxConfNum": 0
        }
      },
      "noPivot": false,
      "disableSnapshotAnnotations": true,
      "discardUnpackedLayers": false,
      "ignoreRdtNotEnabledErrors": false
    },
    "cni": {
      "binDir": "/opt/cni/bin",
      "confDir": "/etc/cni/net.d",
      "maxConfNum": 1,
      "confTemplate": "",
      "ipPref": ""
    },
    "registry": {
      "configPath": "/etc/containerd/certs.d",
      "mirrors": {},
      "configs": null,
      "auths": null,
      "headers": null
    },
    "imageDecryption": {
      "keyModel": "node"
    },
    "disableTCPService": true,
    "streamServerAddress": "127.0.0.1",
    "streamServerPort": "0",
    "streamIdleTimeout": "4h0m0s",
    "enableSelinux": false,
    "selinuxCategoryRange": 1024,
    "sandboxImage": "registry.eccd.local:5000/pause:3.8-1-3e405cfb",
    "statsCollectPeriod": 10,
    "systemdCgroup": false,
    "enableTLSStreaming": false,
    "x509KeyPairStreaming": {
      "tlsCertFile": "",
      "tlsKeyFile": ""
    },
    "maxContainerLogSize": 16384,
    "disableCgroup": false,
    "disableApparmor": false,
    "restrictOOMScoreAdj": false,
    "maxConcurrentDownloads": 3,
    "disableProcMount": false,
    "unsetSeccompProfile": "",
    "tolerateMissingHugetlbController": true,
    "disableHugetlbController": true,
    "device_ownership_from_security_context": true,
    "ignoreImageDefinedVolumes": false,
    "netnsMountsUnderStateDir": false,
    "enableUnprivilegedPorts": false,
    "enableUnprivilegedICMP": false,
    "containerdRootDir": "/var/lib/docker/containerd/root",
    "containerdEndpoint": "/run/containerd/containerd.sock",
    "rootDir": "/var/lib/docker/containerd/root/io.containerd.grpc.v1.cri",
    "stateDir": "/run/containerd/io.containerd.grpc.v1.cri"
  },
  "golang": "go1.17.6",
  "lastCNILoadStatus": "OK",
  "lastCNILoadStatus.default": "OK"
}

worker:~> uname -a
Linux worker-pool1-xxxxxx 5.14.21-150400.24.33-default #1 SMP PREEMPT_DYNAMIC Fri Nov 4 13:55:06 UTC 2022 (76cfe60) x86_64 x86_64 x86_64 GNU/Linux

worker:~> cat /var/lib/kubelet/config.yaml

apiVersion: kubelet.config.k8s.io/v1beta1
authentication:
  anonymous:
    enabled: false
  webhook:
    cacheTTL: 0s
    enabled: true
  x509:
    clientCAFile: /etc/kubernetes/pki/ca.crt
authorization:
  mode: Webhook
  webhook:
    cacheAuthorizedTTL: 0s
    cacheUnauthorizedTTL: 0s
cgroupDriver: systemd
clusterDNS:
- 169.254.20.10
clusterDomain: cluster.local
containerLogMaxFiles: 5
containerLogMaxSize: 50Mi
cpuManagerReconcilePeriod: 0s
evictionPressureTransitionPeriod: 0s
featureGates:
  AllAlpha: false
  LegacyServiceAccountTokenNoAutoGeneration: false
fileCheckFrequency: 0s
healthzBindAddress: 127.0.0.1
healthzPort: 10248
httpCheckFrequency: 0s
imageGCHighThresholdPercent: 80
imageGCLowThresholdPercent: 75
imageMinimumGCAge: 0s
kind: KubeletConfiguration
kubeletCgroups: /ccd.slice/kubelet.service
logging:
  flushFrequency: 0
  options:
    json:
      infoBufferSize: "0"
  verbosity: 0
memorySwap: {}
nodeStatusReportFrequency: 0s
nodeStatusUpdateFrequency: 0s
rotateCertificates: true
runtimeRequestTimeout: 0s
serverTLSBootstrap: true
shutdownGracePeriod: 0s
shutdownGracePeriodCriticalPods: 0s
staticPodPath: /etc/kubernetes/manifests
streamingConnectionIdleTimeout: 0s
syncFrequency: 0s
tlsCipherSuites:
- TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256
- TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
- TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305
- TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
- TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305
- TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384
- TLS_RSA_WITH_AES_256_GCM_SHA384
- TLS_RSA_WITH_AES_128_GCM_SHA256
volumeStatsAggPeriod: 0s
cpuManagerPolicy: static
reservedSystemCPUs: "1,3"
systemReserved: {'ephemeral-storage': '1Gi', 'cpu': '1000m', 'memory': '500Mi'}

Show configuration if it is related to CRI plugin.

version = 2
root = "/var/lib/docker/containerd/root"
state = "/run/containerd"
plugin_dir = ""
disabled_plugins = []
required_plugins = []
oom_score = -999

[grpc]
address = "/run/containerd/containerd.sock"
tcp_address = ""
tcp_tls_cert = ""
tcp_tls_key = ""
uid = 0
gid = 0
max_recv_message_size = 16777216
max_send_message_size = 16777216

[ttrpc]
address = ""
uid = 0
gid = 0

[debug]
address = ""
uid = 0
gid = 0
level = ""

[metrics]
address = ""
grpc_histogram = false

[cgroup]
path = ""

[timeouts]
"io.containerd.timeout.shim.cleanup" = "5s"
"io.containerd.timeout.shim.load" = "5s"
"io.containerd.timeout.shim.shutdown" = "3s"
"io.containerd.timeout.task.state" = "2s"

[plugins]
[plugins."io.containerd.gc.v1.scheduler"]
pause_threshold = 0.02
deletion_threshold = 0
mutation_threshold = 100
schedule_delay = "0s"
startup_delay = "100ms"
[plugins."io.containerd.grpc.v1.cri"]
disable_tcp_service = true
stream_server_address = "127.0.0.1"
stream_server_port = "0"
stream_idle_timeout = "4h0m0s"
enable_selinux = false
sandbox_image = "registry.eccd.local:5000/pause:3.8-1-3e405cfb"
stats_collect_period = 10
systemd_cgroup = false
enable_tls_streaming = false
max_container_log_line_size = 16384
disable_cgroup = false
disable_apparmor = false
restrict_oom_score_adj = false
max_concurrent_downloads = 3
device_ownership_from_security_context = true
disable_proc_mount = false
[plugins."io.containerd.grpc.v1.cri".containerd]
snapshotter = "overlayfs"
default_runtime_name = "runc"
no_pivot = false
[plugins."io.containerd.grpc.v1.cri".containerd.default_runtime]
runtime_type = ""
runtime_engine = ""
runtime_root = ""
privileged_without_host_devices = false
[plugins."io.containerd.grpc.v1.cri".containerd.untrusted_workload_runtime]
runtime_type = ""
runtime_engine = ""
runtime_root = ""
privileged_without_host_devices = false
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v2"
runtime_engine = ""
runtime_root = ""
privileged_without_host_devices = false
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
SystemdCgroup = true
[plugins."io.containerd.grpc.v1.cri".cni]
bin_dir = "/opt/cni/bin"
conf_dir = "/etc/cni/net.d"
max_conf_num = 1
conf_template = ""
[plugins."io.containerd.grpc.v1.cri".registry]
config_path = "/etc/containerd/certs.d"
[plugins."io.containerd.grpc.v1.cri".registry.mirrors]
[plugins."io.containerd.grpc.v1.cri".x509_key_pair_streaming]
tls_cert_file = ""
tls_key_file = ""
[plugins."io.containerd.internal.v1.opt"]
path = "/opt/containerd"
[plugins."io.containerd.internal.v1.restart"]
interval = "10s"
[plugins."io.containerd.metadata.v1.bolt"]
content_sharing_policy = "shared"
[plugins."io.containerd.monitor.v1.cgroups"]
no_prometheus = false
[plugins."io.containerd.runtime.v1.linux"]
shim = "containerd-shim"
runtime = "runc"
runtime_root = ""
no_shim = false
shim_debug = false
[plugins."io.containerd.runtime.v2.task"]
platforms = ["linux/amd64"]
[plugins."io.containerd.service.v1.diff-service"]
default = ["walking"]
[plugins."io.containerd.snapshotter.v1.devmapper"]
root_path = ""
pool_name = ""
base_image_size = ""

The text was updated successfully, but these errors were encountered:

dhiman360 · 2022-12-13T04:00:06Z

CC: @nikitar @dcantah .

dhiman360 · 2022-12-16T07:29:17Z

Most likely the issue is arising because of the presence of following line in the code:

containerd/task_opts.go

Line 165 in 12f30e6

s, err := p.Wait(ctx)

	// ignore errors to wait and kill as we are forcefully killing
	// the process and don't care about the exit status
	s, err := p.Wait(ctx)
	if err != nil {
		return err
	}
	if err := p.Kill(ctx, syscall.SIGKILL, WithKillAll); err != nil {

The documentations says forcefully killing the process, but when there is an error it is returning and not executing p.kill in the next line.

mikebrow · 2022-12-16T21:13:17Z

Most likely the issue is arising because of the presence of following line in the code:

containerd/task_opts.go

Line 165 in 12f30e6

s, err := p.Wait(ctx)
	// ignore errors to wait and kill as we are forcefully killing
	// the process and don't care about the exit status
	s, err := p.Wait(ctx)
	if err != nil {
		return err
	}
	if err := p.Kill(ctx, syscall.SIGKILL, WithKillAll); err != nil {
The documentations says forcefully killing the process, but when there is an error it is returning and not executing p.kill in the next line.

or that line you mention is an async wait request that returns successfully and we end up waiting a few lines down (

containerd/task_opts.go

Line 186 in 12f30e6

<-s

) after the SIGKILL has been sent but the process is uninterruptible.. so we don't back out of this call..

@cpuguy83 note line 186.. should that line read:
select {
case <-ctx.Done():
return ctx.Err()
case <-s:
return nil
}

@dhiman360 if you have some time maybe add some log outs and see which path is happening/not happening for your two cases? Cheers, Mike

dhiman360 · 2022-12-17T18:12:47Z

@mikebrow, I have modified the function in a patch like below after removing the wait etc., and installed patched containerd in my cluster, but it didn't help, the background process with sleep did not get killed:

// WithProcessKill will forcefully kill and delete a process
func WithProcessKill(ctx context.Context, p Process) error {
        ctx, cancel := context.WithCancel(ctx)
        defer cancel()
        if err := p.Kill(ctx, syscall.SIGKILL, WithKillAll); err != nil {
                return err
        }
        return nil
}

And I can see the following lines filled the containerd logs.

Dec 17 17:48:40 worker-pool1-a1h73b4d-ezghodh-ibd-containerd-1612-patch-stack containerd[10626]: time="2022-12-17T17:48:40.907001331Z" level=error msg="Failed to delete exec process \"fcc7ec96ef5a60e436ab227592b8046fbc7bd284da5cd717cd94bf63d15dcb43\" for container \"fa80a637beaa361c8223605b5d34ce11f99e1f5556bbdc5ea065bd13756f437a\"" error="process already finished: not found"
Dec 17 17:48:41 worker-pool1-a1h73b4d-ezghodh-ibd-containerd-1612-patch-stack containerd[10626]: time="2022-12-17T17:48:41.474752081Z" level=error msg="Failed to delete exec process \"c2b954de7c65dcde8f0b20b8fbbddc5f168e10577fc72a6e3fcb8f6921d5820b\" for container \"b04a56a829844b7ec597d401f4f96b4f63427504ac289230b728cb592f03b93c\"" error="process already finished: not found"
Dec 17 17:48:41 worker-pool1-a1h73b4d-ezghodh-ibd-containerd-1612-patch-stack containerd[10626]: time="2022-12-17T17:48:41.474963830Z" level=error msg="Failed to delete exec process \"a248e4cbda93853e0956fb585586fd1968936aeb6a9581c2f73ea1877bbe9d8b\" for container \"b04a56a829844b7ec597d401f4f96b4f63427504ac289230b728cb592f03b93c\"" error="process already finished: not found"

Also kubelet logs like:

Dec 17 18:03:02 worker-pool1-a1h73b4d-ezghodh-ibd-containerd-1612-patch-stack kubelet[11623]: E1217 18:03:02.826261   11623 remote_runtime.go:734] "ExecSync cmd from runtime service failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" containerID="2ba9ba473d65a99cce897c3eeffb598670cdaef0ee906e8a46120bf48824ab5e" cmd=[/bin/sh -c sleep 100000000 &]

By default, the child processes spawned by exec process will inherit standard io file descriptors. The shim server creates a pipe as data channel. Both exec process and its children write data into the write end of the pipe. And the shim server will read data from the pipe. If the write end is still open, the shim server will continue to wait for data from pipe. So, if the exec command is like `bash -c "sleep 365d &"`, the exec process is bash and quit after create `sleep 365d`. But the `sleep 365d` will hold the write end of the pipe for a year! It doesn't make senses that CRI plugin should wait for it. For this case, we should use timeout to drain exec process's io instead of waiting for it. Fixes: containerd#7802 Signed-off-by: Wei Fu <fuweid89@gmail.com>

dhiman360 · 2022-12-19T11:11:58Z

@fuweid ,
Hi, I have pulled your changes and tested today. It seems it has not fixed the issue.

Code fetch:

git clone https://github.com/containerd/containerd containerd
cd containerd && git fetch origin pull/7832/head:fuweid-fix-7802 && git checkout fuweid-fix-7802

Containerd Version:

eccd@director-0-ezghodh-containerd-patch-stack:~> containerd -version
containerd github.com/containerd/containerd v1.7.0-beta.1-28-gc851f1169 c851f1169922c6455a944b34995650e3d22ab4b2

Exec into the pod and pe -ef, sleep processes are kept on increasing:

eccd@director-0-ezghodh-containerd-patch-stack:~> kubectl exec -it readiness-exec -- sh
/ # ps -ef
PID   USER     COMMAND
    1 root     /bin/sh -c touch /tmp/healthy; sleep 30; rm -f /tmp/healthy; sleep 600000000
   15 root     sleep 100000000
   22 root     sleep 100000000
   24 root     sleep 600000000
   31 root     sleep 100000000
   38 root     sleep 100000000
   45 root     sleep 100000000
   51 root     sleep 100000000
   58 root     sleep 100000000
   64 root     sleep 100000000
   71 root     sleep 100000000
   78 root     sleep 100000000
   85 root     sleep 100000000
   92 root     sleep 100000000
   99 root     sleep 100000000
  106 root     sleep 100000000
  113 root     sleep 100000000
  114 root     sh
  120 root     ps -ef

fuweid · 2022-12-19T12:10:07Z

@dhiman360 Currently, it is unsupported. The children processes created by exec-process will be reparented to pid-1 after exec-process exits. It is unlikely to trace all the processes, created by exec-process, in containerd side. If the pid-1 doesn't reap the child processes, they will be zombie ones. So, I don't suggest to run this pattern in your cluster and I don't think we should support this.

dhiman360 · 2022-12-19T12:47:51Z

@fuweid ,
So you are saying we should not use any scripts like the below one in readiness and liveness probes which may get stuck forever.

#!/bin/bash
curl -sS -X GET "http://localhost:9000/health/readiness"

Now, because it is allowed by kubernetes, we can't disallows the pod developers to not write such code. Question is a bad pod is causing the whole framework to fail looks terribly bad. Do you think we can handle these in containerd in some other way, like work around etc. ?

mikebrow · 2022-12-19T14:38:37Z

From: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#configure-probes

Caution: Incorrect implementation of readiness probes may result in an ever growing number of processes in the container, and resource starvation if this is left unchecked.

I see this is a somewhat old problem moby/moby#9098 with some proposed self-help work arounds.

fuweid · 2022-12-19T15:25:11Z

So you are saying we should not use any scripts like the below one in readiness and liveness probes which may get stuck forever.

If the script may get stuck forever, the owner of script should take care about this. For example, use timeout to ensure that it will receive the signal kill.

Do you think we can handle these in containerd in some other way, like work around etc. ?

It is all about the process manager. As I mentioned before, even if containerd can kill all the processes created by one exec, there will be zombie process issues. The pid-1 process should be aware about the dead process and reap it. Unfortunately, most of processes doesn't watch SIGCHLD signal.

For example,

apiVersion: v1
kind: Pod
metadata:
  name: qos-demo
spec:
  containers:
  - name: qos-demo-ctr
    image: ubuntu
    resources:
      limits:
        memory: "200Mi"
      requests:
        memory: "100Mi"
    command: ["sleep", "1d"]
    readinessProbe:
      exec:
        command:
        - /bin/sh
        - -c
        - sleep 100000000 & # Forked a very long running sleep to simulate stuck probe
      initialDelaySeconds: 5
      periodSeconds: 5
      timeoutSeconds: 5

/bin/sh exits and the sleep 100000000's parent will be changed to sleep 1d. After kill sleep 100000000, it will be zombie.

➜  containerd git:(main) sudo crictl exec -it 37395f8a03e99 bash
root@qos-demo:/# ps -fe
UID          PID    PPID  C STIME TTY          TIME CMD
root           1       0  0 15:14 ?        00:00:00 sleep 1d
root          13       1  0 15:15 ?        00:00:00 sleep 100000000
root          14       0  0 15:15 pts/0    00:00:00 bash
root          22      14  0 15:15 pts/0    00:00:00 ps -fe
root@qos-demo:/# exit
exit

➜  containerd git:(main) sudo kill -9 654442
➜  containerd git:(main) sudo crictl exec -it 37395f8a03e99 bash
root@qos-demo:/# ps -ef
UID          PID    PPID  C STIME TTY          TIME CMD
root           1       0  0 15:14 ?        00:00:00 sleep 1d
root          13       1  0 15:15 ?        00:00:00 [sleep] <defunct>
root          29       1  0 15:15 ?        00:00:00 sleep 100000000
root          30       0  0 15:15 pts/0    00:00:00 bash
root          38      30  0 15:15 pts/0    00:00:00 ps -ef
root@qos-demo:/#

So, the user should take care the script by themself.

By default, the child processes spawned by exec process will inherit standard io file descriptors. The shim server creates a pipe as data channel. Both exec process and its children write data into the write end of the pipe. And the shim server will read data from the pipe. If the write end is still open, the shim server will continue to wait for data from pipe. So, if the exec command is like `bash -c "sleep 365d &"`, the exec process is bash and quit after create `sleep 365d`. But the `sleep 365d` will hold the write end of the pipe for a year! It doesn't make senses that CRI plugin should wait for it. For this case, we should use timeout to drain exec process's io instead of waiting for it. Fixes: containerd#7802 Signed-off-by: Wei Fu <fuweid89@gmail.com>

By default, the child processes spawned by exec process will inherit standard io file descriptors. The shim server creates a pipe as data channel. Both exec process and its children write data into the write end of the pipe. And the shim server will read data from the pipe. If the write end is still open, the shim server will continue to wait for data from pipe. So, if the exec command is like `bash -c "sleep 365d &"`, the exec process is bash and quit after create `sleep 365d`. But the `sleep 365d` will hold the write end of the pipe for a year! It doesn't make senses that CRI plugin should wait for it. For this case, we should use timeout to drain exec process's io instead of waiting for it. Fixes: containerd#7802 Signed-off-by: Wei Fu <fuweid89@gmail.com> (cherry picked from commit 82c0f4f) Signed-off-by: Sebastiaan van Stijn <github@gone.nl>

By default, the child processes spawned by exec process will inherit standard io file descriptors. The shim server creates a pipe as data channel. Both exec process and its children write data into the write end of the pipe. And the shim server will read data from the pipe. If the write end is still open, the shim server will continue to wait for data from pipe. So, if the exec command is like `bash -c "sleep 365d &"`, the exec process is bash and quit after create `sleep 365d`. But the `sleep 365d` will hold the write end of the pipe for a year! It doesn't make senses that CRI plugin should wait for it. For this case, we should use timeout to drain exec process's io instead of waiting for it. Fixes: containerd#7802 Signed-off-by: Wei Fu <fuweid89@gmail.com>

By default, the child processes spawned by exec process will inherit standard io file descriptors. The shim server creates a pipe as data channel. Both exec process and its children write data into the write end of the pipe. And the shim server will read data from the pipe. If the write end is still open, the shim server will continue to wait for data from pipe. So, if the exec command is like `bash -c "sleep 365d &"`, the exec process is bash and quit after create `sleep 365d`. But the `sleep 365d` will hold the write end of the pipe for a year! It doesn't make senses that CRI plugin should wait for it. For this case, we should use timeout to drain exec process's io instead of waiting for it. Fixes: containerd#7802 Signed-off-by: Wei Fu <fuweid89@gmail.com> (cherry picked from commit 82c0f4f) Signed-off-by: Wei Fu <fuweid89@gmail.com>

dhiman360 added the kind/bug label Dec 13, 2022

fuweid mentioned this issue Dec 18, 2022

pkg/cri: add timeout to drain exec io #7832

Merged

fuweid added the area/cri Container Runtime Interface (CRI) label Dec 18, 2022

dmcgowan closed this as completed in #7832 Mar 3, 2023

thaJeztah mentioned this issue Jul 14, 2023

[release/1.6 backport] pkg/cri: add timeout to drain exec io #8828

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Containerd v1.6.12 slow memory leak when pod readiness probe gets stuck forever #7802

Containerd v1.6.12 slow memory leak when pod readiness probe gets stuck forever #7802

dhiman360 commented Dec 13, 2022 •

edited

dhiman360 commented Dec 13, 2022

dhiman360 commented Dec 16, 2022

mikebrow commented Dec 16, 2022

dhiman360 commented Dec 17, 2022

dhiman360 commented Dec 19, 2022

fuweid commented Dec 19, 2022 •

edited

dhiman360 commented Dec 19, 2022

mikebrow commented Dec 19, 2022

fuweid commented Dec 19, 2022

Containerd v1.6.12 slow memory leak when pod readiness probe gets stuck forever #7802

Containerd v1.6.12 slow memory leak when pod readiness probe gets stuck forever #7802

Comments

dhiman360 commented Dec 13, 2022 • edited

Description

Test with non responding curl command

Steps to reproduce the issue

Pod with stuck probe at sleep.

Describe the results you received and expected

With the above pod deployed, a monitoring of containerd memory shows the slow memory leak in the worker where pod is running:

An exec to the pod

What version of containerd are you using?

Any other relevant information

Show configuration if it is related to CRI plugin.

dhiman360 commented Dec 13, 2022

dhiman360 commented Dec 16, 2022

mikebrow commented Dec 16, 2022

dhiman360 commented Dec 17, 2022

dhiman360 commented Dec 19, 2022

fuweid commented Dec 19, 2022 • edited

dhiman360 commented Dec 19, 2022

mikebrow commented Dec 19, 2022

fuweid commented Dec 19, 2022

dhiman360 commented Dec 13, 2022 •

edited

fuweid commented Dec 19, 2022 •

edited