Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Containerd v1.6.12 slow memory leak when pod readiness probe gets stuck forever #7802

Closed
dhiman360 opened this issue Dec 13, 2022 · 9 comments · Fixed by #7832
Closed

Containerd v1.6.12 slow memory leak when pod readiness probe gets stuck forever #7802

dhiman360 opened this issue Dec 13, 2022 · 9 comments · Fixed by #7832
Labels
area/cri Container Runtime Interface (CRI) kind/bug

Comments

@dhiman360
Copy link

dhiman360 commented Dec 13, 2022

Description

Test with non responding curl command

Pod.yaml

         readinessProbe:
           initialDelaySeconds: 5
           periodSeconds: 5
           timeoutSeconds: 5
           exec:
             command:
               - /app/readiness-probe.sh

readiness-probe.sh

#!/bin/bash
curl -sS -X GET "http://localhost:9000/health/readiness"

When the curl command not responding, exec probe readiness bash script starts curl as a foreground process for probing the http server. periodSeconds: 5 , timeoutSeconds: 5 , the following happens, the bash process times out after 5 s and containerd/shim deletes the bash process it started but not the foreground process curl.

After the timeout the next probe is run after 2 minutes and 5 seconds and not 5 s as per spec (periodSeconds). This is due to the kubelet configuration variable runtimeRequestTimeout in /var/lib/kubelet/config.yaml runtimeRequestTimeout is set to 0s, if 0 an internal timer in kubelet will be set to a default value of 2 minutes + the spec timeoutSeconds, in this case 2m + 5s.

This leads to a slow leaking memory in containerd.

Steps to reproduce the issue

To simulate the same scenario and easy reproduction I have replaced the stuck curl command with a very long running sleep. If you deploy this pod the problem will be reproduced.

Pod with stuck probe at sleep.

apiVersion: v1
kind: Pod
metadata:
  labels:
    test: readiness
  name: readiness-exec
spec:
  containers:
  - name: readiness
    image: registry.k8s.io/busybox
    args:
    - /bin/sh
    - -c
    - touch /tmp/healthy; sleep 30; rm -f /tmp/healthy; sleep 600000000 # A long huge sleep to keep the pod running.
    readinessProbe:
      exec:
        command:
        - /bin/sh
        - -c
        - sleep 100000000 & # Forked a very long running sleep to simulate stuck probe
      initialDelaySeconds: 5
      periodSeconds: 5
      timeoutSeconds: 5

Describe the results you received and expected

With the above pod deployed, a monitoring of containerd memory shows the slow memory leak in the worker where pod is running:

worker:~> echo; date; ps -e -o rss,args | grep containerd| grep log-level

Mon 12 Dec 2022 12:52:45 PM UTC
81656 /usr/local/bin/containerd --log-level=warn

Mon 12 Dec 2022 01:03:09 PM UTC
82180 /usr/local/bin/containerd --log-level=warn

Mon 12 Dec 2022 01:09:06 PM UTC
82964 /usr/local/bin/containerd --log-level=warn
...
...
Mon 12 Dec 2022 02:34:44 PM UTC
90264 /usr/local/bin/containerd --log-level=warn

Mon 12 Dec 2022 03:03:13 PM UTC
92312 /usr/local/bin/containerd --log-level=warn

Mon 12 Dec 2022 03:25:04 PM UTC
94360 /usr/local/bin/containerd --log-level=warn

Mon 12 Dec 2022 04:00:28 PM UTC
96408 /usr/local/bin/containerd --log-level=warn

Mon 12 Dec 2022 04:28:40 PM UTC
100624 /usr/local/bin/containerd --log-level=warn
...
...
Tue 13 Dec 2022 02:42:26 AM UTC
150236 /usr/local/bin/containerd --log-level=warn

An exec to the pod

ps -ef shows sleeping processes with numbers of processes kept on increasing over time, one added every 2m 5sec as explained above.

ps -ef
PID USER COMMAND
1 root /bin/sh -c touch /tmp/healthy; sleep 30; rm -f /tmp/healthy; sleep 600000000
15 root sleep 100000000
16 root sh
24 root sleep 600000000
37 root sleep 100000000
46 root sleep 100000000
53 root sleep 100000000
60 root sleep 100000000
67 root sleep 100000000
75 root sleep 100000000
82 root sleep 100000000
89 root sleep 100000000
96 root sleep 100000000
103 root sleep 100000000
110 root sleep 100000000
117 root sleep 100000000
124 root sleep 100000000
132 root sleep 100000000
138 root sleep 100000000
144 root sleep 100000000
151 root sleep 100000000
...
...

What version of containerd are you using?

containerd github.com/containerd/containerd v1.6.12 a05d175

Any other relevant information

worker:~> runc -version

runc version 1.1.4
commit: v1.1.4-0-g5fd4c4d1
spec: 1.0.2-dev
go: go1.17.6
libseccomp: 2.5.3

worker:~> sudo crictl info

{
  "status": {
    "conditions": [
      {
        "type": "RuntimeReady",
        "status": true,
        "reason": "",
        "message": ""
      },
      {
        "type": "NetworkReady",
        "status": true,
        "reason": "",
        "message": ""
      }
    ]
  },
  "cniconfig": {
    "PluginDirs": [
      "/opt/cni/bin"
    ],
    "PluginConfDir": "/etc/cni/net.d",
    "PluginMaxConfNum": 1,
    "Prefix": "eth",
    "Networks": [
      {
        "Config": {
          "Name": "cni-loopback",
          "CNIVersion": "0.3.1",
          "Plugins": [
            {
              "Network": {
                "type": "loopback",
                "ipam": {},
                "dns": {}
              },
              "Source": "{\"type\":\"loopback\"}"
            }
          ],
          "Source": "{\n\"cniVersion\": \"0.3.1\",\n\"name\": \"cni-loopback\",\n\"plugins\": [{\n  \"type\": \"loopback\"\n}]\n}"
        },
        "IFName": "lo"
      },
      {
        "Config": {
          "Name": "k8s-pod-network",
          "CNIVersion": "0.3.1",
          "Plugins": [
            {
              "Network": {
                "type": "calico",
                "ipam": {
                  "type": "calico-ipam"
                },
                "dns": {}
              },
              "Source": "{\"container_settings\":{\"allow_ip_forwarding\":true},\"etcd_ca_cert_file\":\"/etc/cni/net.d/calico-tls/etcd-ca\",\"etcd_cert_file\":\"/etc/cni/net.d/calico-tls/etcd-cert\",\"etcd_endpoints\":\"https://10.0.16.2:2379,https://10.0.16.20:2379,https://10.0.16.4:2379\",\"etcd_key_file\":\"/etc/cni/net.d/calico-tls/etcd-key\",\"ipam\":{\"assign_ipv4\":\"true\",\"assign_ipv6\":\"false\",\"type\":\"calico-ipam\"},\"kubernetes\":{\"kubeconfig\":\"/etc/cni/net.d/calico-kubeconfig\"},\"log_level\":\"error\",\"mtu\":2070,\"policy\":{\"k8s_api_root\":\"https://[10.96.0.1]:443\",\"type\":\"k8s\"},\"type\":\"calico\"}"
            },
            {
              "Network": {
                "type": "tuning",
                "ipam": {},
                "dns": {}
              },
              "Source": "{\"sysctl\":{\"net.ipv4.tcp_mtu_probing\":\"1\"},\"type\":\"tuning\"}"
            },
            {
              "Network": {
                "type": "portmap",
                "capabilities": {
                  "portMappings": true
                },
                "ipam": {},
                "dns": {}
              },
              "Source": "{\"capabilities\":{\"portMappings\":true},\"snat\":true,\"type\":\"portmap\"}"
            }
          ],
          "Source": "{\n  \"name\": \"k8s-pod-network\",\n  \"cniVersion\": \"0.3.1\",\n  \"plugins\": [\n    {\n      \"type\": \"calico\",\n      \"log_level\": \"error\",\n      \"etcd_endpoints\": \"https://10.0.16.2:2379,https://10.0.16.20:2379,https://10.0.16.4:2379\",\n      \"etcd_key_file\": \"/etc/cni/net.d/calico-tls/etcd-key\",\n      \"etcd_cert_file\": \"/etc/cni/net.d/calico-tls/etcd-cert\",\n      \"etcd_ca_cert_file\": \"/etc/cni/net.d/calico-tls/etcd-ca\",\n      \"mtu\": 2070,\n      \"ipam\": {\n        \"type\": \"calico-ipam\",\n        \"assign_ipv4\": \"true\",\n        \"assign_ipv6\": \"false\"\n      },\n      \"container_settings\": {\n        \"allow_ip_forwarding\": true\n      },\n      \"policy\": {\n        \"type\": \"k8s\",\n        \"k8s_api_root\": \"https://[10.96.0.1]:443\"\n      },\n      \"kubernetes\": {\n          \"kubeconfig\": \"/etc/cni/net.d/calico-kubeconfig\"\n      }\n    },\n    {\n      \"type\": \"tuning\",\n      \"sysctl\": {\"net.ipv4.tcp_mtu_probing\": \"1\"}\n    },\n    {\n      \"type\": \"portmap\",\n      \"snat\": true,\n      \"capabilities\": {\"portMappings\": true}\n    }\n  ]\n}"
        },
        "IFName": "eth0"
      }
    ]
  },
  "config": {
    "containerd": {
      "snapshotter": "overlayfs",
      "defaultRuntimeName": "runc",
      "defaultRuntime": {
        "runtimeType": "",
        "runtimePath": "",
        "runtimeEngine": "",
        "PodAnnotations": null,
        "ContainerAnnotations": null,
        "runtimeRoot": "",
        "options": null,
        "privileged_without_host_devices": false,
        "baseRuntimeSpec": "",
        "cniConfDir": "",
        "cniMaxConfNum": 0
      },
      "untrustedWorkloadRuntime": {
        "runtimeType": "",
        "runtimePath": "",
        "runtimeEngine": "",
        "PodAnnotations": null,
        "ContainerAnnotations": null,
        "runtimeRoot": "",
        "options": null,
        "privileged_without_host_devices": false,
        "baseRuntimeSpec": "",
        "cniConfDir": "",
        "cniMaxConfNum": 0
      },
      "runtimes": {
        "runc": {
          "runtimeType": "io.containerd.runc.v2",
          "runtimePath": "",
          "runtimeEngine": "",
          "PodAnnotations": null,
          "ContainerAnnotations": null,
          "runtimeRoot": "",
          "options": {
            "SystemdCgroup": true
          },
          "privileged_without_host_devices": false,
          "baseRuntimeSpec": "",
          "cniConfDir": "",
          "cniMaxConfNum": 0
        }
      },
      "noPivot": false,
      "disableSnapshotAnnotations": true,
      "discardUnpackedLayers": false,
      "ignoreRdtNotEnabledErrors": false
    },
    "cni": {
      "binDir": "/opt/cni/bin",
      "confDir": "/etc/cni/net.d",
      "maxConfNum": 1,
      "confTemplate": "",
      "ipPref": ""
    },
    "registry": {
      "configPath": "/etc/containerd/certs.d",
      "mirrors": {},
      "configs": null,
      "auths": null,
      "headers": null
    },
    "imageDecryption": {
      "keyModel": "node"
    },
    "disableTCPService": true,
    "streamServerAddress": "127.0.0.1",
    "streamServerPort": "0",
    "streamIdleTimeout": "4h0m0s",
    "enableSelinux": false,
    "selinuxCategoryRange": 1024,
    "sandboxImage": "registry.eccd.local:5000/pause:3.8-1-3e405cfb",
    "statsCollectPeriod": 10,
    "systemdCgroup": false,
    "enableTLSStreaming": false,
    "x509KeyPairStreaming": {
      "tlsCertFile": "",
      "tlsKeyFile": ""
    },
    "maxContainerLogSize": 16384,
    "disableCgroup": false,
    "disableApparmor": false,
    "restrictOOMScoreAdj": false,
    "maxConcurrentDownloads": 3,
    "disableProcMount": false,
    "unsetSeccompProfile": "",
    "tolerateMissingHugetlbController": true,
    "disableHugetlbController": true,
    "device_ownership_from_security_context": true,
    "ignoreImageDefinedVolumes": false,
    "netnsMountsUnderStateDir": false,
    "enableUnprivilegedPorts": false,
    "enableUnprivilegedICMP": false,
    "containerdRootDir": "/var/lib/docker/containerd/root",
    "containerdEndpoint": "/run/containerd/containerd.sock",
    "rootDir": "/var/lib/docker/containerd/root/io.containerd.grpc.v1.cri",
    "stateDir": "/run/containerd/io.containerd.grpc.v1.cri"
  },
  "golang": "go1.17.6",
  "lastCNILoadStatus": "OK",
  "lastCNILoadStatus.default": "OK"
}

worker:~> uname -a
Linux worker-pool1-xxxxxx 5.14.21-150400.24.33-default #1 SMP PREEMPT_DYNAMIC Fri Nov 4 13:55:06 UTC 2022 (76cfe60) x86_64 x86_64 x86_64 GNU/Linux

worker:~> cat /var/lib/kubelet/config.yaml

apiVersion: kubelet.config.k8s.io/v1beta1
authentication:
  anonymous:
    enabled: false
  webhook:
    cacheTTL: 0s
    enabled: true
  x509:
    clientCAFile: /etc/kubernetes/pki/ca.crt
authorization:
  mode: Webhook
  webhook:
    cacheAuthorizedTTL: 0s
    cacheUnauthorizedTTL: 0s
cgroupDriver: systemd
clusterDNS:
- 169.254.20.10
clusterDomain: cluster.local
containerLogMaxFiles: 5
containerLogMaxSize: 50Mi
cpuManagerReconcilePeriod: 0s
evictionPressureTransitionPeriod: 0s
featureGates:
  AllAlpha: false
  LegacyServiceAccountTokenNoAutoGeneration: false
fileCheckFrequency: 0s
healthzBindAddress: 127.0.0.1
healthzPort: 10248
httpCheckFrequency: 0s
imageGCHighThresholdPercent: 80
imageGCLowThresholdPercent: 75
imageMinimumGCAge: 0s
kind: KubeletConfiguration
kubeletCgroups: /ccd.slice/kubelet.service
logging:
  flushFrequency: 0
  options:
    json:
      infoBufferSize: "0"
  verbosity: 0
memorySwap: {}
nodeStatusReportFrequency: 0s
nodeStatusUpdateFrequency: 0s
rotateCertificates: true
runtimeRequestTimeout: 0s
serverTLSBootstrap: true
shutdownGracePeriod: 0s
shutdownGracePeriodCriticalPods: 0s
staticPodPath: /etc/kubernetes/manifests
streamingConnectionIdleTimeout: 0s
syncFrequency: 0s
tlsCipherSuites:
- TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256
- TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
- TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305
- TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
- TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305
- TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384
- TLS_RSA_WITH_AES_256_GCM_SHA384
- TLS_RSA_WITH_AES_128_GCM_SHA256
volumeStatsAggPeriod: 0s
cpuManagerPolicy: static
reservedSystemCPUs: "1,3"
systemReserved: {'ephemeral-storage': '1Gi', 'cpu': '1000m', 'memory': '500Mi'}

Show configuration if it is related to CRI plugin.

version = 2
root = "/var/lib/docker/containerd/root"
state = "/run/containerd"
plugin_dir = ""
disabled_plugins = []
required_plugins = []
oom_score = -999

[grpc]
address = "/run/containerd/containerd.sock"
tcp_address = ""
tcp_tls_cert = ""
tcp_tls_key = ""
uid = 0
gid = 0
max_recv_message_size = 16777216
max_send_message_size = 16777216

[ttrpc]
address = ""
uid = 0
gid = 0

[debug]
address = ""
uid = 0
gid = 0
level = ""

[metrics]
address = ""
grpc_histogram = false

[cgroup]
path = ""

[timeouts]
"io.containerd.timeout.shim.cleanup" = "5s"
"io.containerd.timeout.shim.load" = "5s"
"io.containerd.timeout.shim.shutdown" = "3s"
"io.containerd.timeout.task.state" = "2s"

[plugins]
[plugins."io.containerd.gc.v1.scheduler"]
pause_threshold = 0.02
deletion_threshold = 0
mutation_threshold = 100
schedule_delay = "0s"
startup_delay = "100ms"
[plugins."io.containerd.grpc.v1.cri"]
disable_tcp_service = true
stream_server_address = "127.0.0.1"
stream_server_port = "0"
stream_idle_timeout = "4h0m0s"
enable_selinux = false
sandbox_image = "registry.eccd.local:5000/pause:3.8-1-3e405cfb"
stats_collect_period = 10
systemd_cgroup = false
enable_tls_streaming = false
max_container_log_line_size = 16384
disable_cgroup = false
disable_apparmor = false
restrict_oom_score_adj = false
max_concurrent_downloads = 3
device_ownership_from_security_context = true
disable_proc_mount = false
[plugins."io.containerd.grpc.v1.cri".containerd]
snapshotter = "overlayfs"
default_runtime_name = "runc"
no_pivot = false
[plugins."io.containerd.grpc.v1.cri".containerd.default_runtime]
runtime_type = ""
runtime_engine = ""
runtime_root = ""
privileged_without_host_devices = false
[plugins."io.containerd.grpc.v1.cri".containerd.untrusted_workload_runtime]
runtime_type = ""
runtime_engine = ""
runtime_root = ""
privileged_without_host_devices = false
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v2"
runtime_engine = ""
runtime_root = ""
privileged_without_host_devices = false
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
SystemdCgroup = true
[plugins."io.containerd.grpc.v1.cri".cni]
bin_dir = "/opt/cni/bin"
conf_dir = "/etc/cni/net.d"
max_conf_num = 1
conf_template = ""
[plugins."io.containerd.grpc.v1.cri".registry]
config_path = "/etc/containerd/certs.d"
[plugins."io.containerd.grpc.v1.cri".registry.mirrors]
[plugins."io.containerd.grpc.v1.cri".x509_key_pair_streaming]
tls_cert_file = ""
tls_key_file = ""
[plugins."io.containerd.internal.v1.opt"]
path = "/opt/containerd"
[plugins."io.containerd.internal.v1.restart"]
interval = "10s"
[plugins."io.containerd.metadata.v1.bolt"]
content_sharing_policy = "shared"
[plugins."io.containerd.monitor.v1.cgroups"]
no_prometheus = false
[plugins."io.containerd.runtime.v1.linux"]
shim = "containerd-shim"
runtime = "runc"
runtime_root = ""
no_shim = false
shim_debug = false
[plugins."io.containerd.runtime.v2.task"]
platforms = ["linux/amd64"]
[plugins."io.containerd.service.v1.diff-service"]
default = ["walking"]
[plugins."io.containerd.snapshotter.v1.devmapper"]
root_path = ""
pool_name = ""
base_image_size = ""

@dhiman360
Copy link
Author

CC: @nikitar @dcantah .

@dhiman360
Copy link
Author

Most likely the issue is arising because of the presence of following line in the code:

s, err := p.Wait(ctx)

	// ignore errors to wait and kill as we are forcefully killing
	// the process and don't care about the exit status
	s, err := p.Wait(ctx)
	if err != nil {
		return err
	}
	if err := p.Kill(ctx, syscall.SIGKILL, WithKillAll); err != nil {

The documentations says forcefully killing the process, but when there is an error it is returning and not executing p.kill in the next line.

@mikebrow
Copy link
Member

Most likely the issue is arising because of the presence of following line in the code:

s, err := p.Wait(ctx)

	// ignore errors to wait and kill as we are forcefully killing
	// the process and don't care about the exit status
	s, err := p.Wait(ctx)
	if err != nil {
		return err
	}
	if err := p.Kill(ctx, syscall.SIGKILL, WithKillAll); err != nil {

The documentations says forcefully killing the process, but when there is an error it is returning and not executing p.kill in the next line.

or that line you mention is an async wait request that returns successfully and we end up waiting a few lines down (

) after the SIGKILL has been sent but the process is uninterruptible.. so we don't back out of this call..

@cpuguy83 note line 186.. should that line read:
select {
case <-ctx.Done():
return ctx.Err()
case <-s:
return nil
}

@dhiman360 if you have some time maybe add some log outs and see which path is happening/not happening for your two cases? Cheers, Mike

@dhiman360
Copy link
Author

@mikebrow, I have modified the function in a patch like below after removing the wait etc., and installed patched containerd in my cluster, but it didn't help, the background process with sleep did not get killed:

// WithProcessKill will forcefully kill and delete a process
func WithProcessKill(ctx context.Context, p Process) error {
        ctx, cancel := context.WithCancel(ctx)
        defer cancel()
        if err := p.Kill(ctx, syscall.SIGKILL, WithKillAll); err != nil {
                return err
        }
        return nil
}

And I can see the following lines filled the containerd logs.

Dec 17 17:48:40 worker-pool1-a1h73b4d-ezghodh-ibd-containerd-1612-patch-stack containerd[10626]: time="2022-12-17T17:48:40.907001331Z" level=error msg="Failed to delete exec process \"fcc7ec96ef5a60e436ab227592b8046fbc7bd284da5cd717cd94bf63d15dcb43\" for container \"fa80a637beaa361c8223605b5d34ce11f99e1f5556bbdc5ea065bd13756f437a\"" error="process already finished: not found"
Dec 17 17:48:41 worker-pool1-a1h73b4d-ezghodh-ibd-containerd-1612-patch-stack containerd[10626]: time="2022-12-17T17:48:41.474752081Z" level=error msg="Failed to delete exec process \"c2b954de7c65dcde8f0b20b8fbbddc5f168e10577fc72a6e3fcb8f6921d5820b\" for container \"b04a56a829844b7ec597d401f4f96b4f63427504ac289230b728cb592f03b93c\"" error="process already finished: not found"
Dec 17 17:48:41 worker-pool1-a1h73b4d-ezghodh-ibd-containerd-1612-patch-stack containerd[10626]: time="2022-12-17T17:48:41.474963830Z" level=error msg="Failed to delete exec process \"a248e4cbda93853e0956fb585586fd1968936aeb6a9581c2f73ea1877bbe9d8b\" for container \"b04a56a829844b7ec597d401f4f96b4f63427504ac289230b728cb592f03b93c\"" error="process already finished: not found"

Also kubelet logs like:

Dec 17 18:03:02 worker-pool1-a1h73b4d-ezghodh-ibd-containerd-1612-patch-stack kubelet[11623]: E1217 18:03:02.826261   11623 remote_runtime.go:734] "ExecSync cmd from runtime service failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" containerID="2ba9ba473d65a99cce897c3eeffb598670cdaef0ee906e8a46120bf48824ab5e" cmd=[/bin/sh -c sleep 100000000 &]

fuweid added a commit to fuweid/containerd that referenced this issue Dec 18, 2022
By default, the child processes spawned by exec process will inherit standard
io file descriptors. The shim server creates a pipe as data channel. Both
exec process and its children write data into the write end of the pipe.
And the shim server will read data from the pipe. If the write end is
still open, the shim server will continue to wait for data from pipe.

So, if the exec command is like `bash -c "sleep 365d &"`, the exec
process is bash and quit after create `sleep 365d`. But the `sleep 365d`
will hold the write end of the pipe for a year! It doesn't make senses
that CRI plugin should wait for it.

For this case, we should use timeout to drain exec process's io instead
of waiting for it.

Fixes: containerd#7802

Signed-off-by: Wei Fu <fuweid89@gmail.com>
fuweid added a commit to fuweid/containerd that referenced this issue Dec 18, 2022
By default, the child processes spawned by exec process will inherit standard
io file descriptors. The shim server creates a pipe as data channel. Both
exec process and its children write data into the write end of the pipe.
And the shim server will read data from the pipe. If the write end is
still open, the shim server will continue to wait for data from pipe.

So, if the exec command is like `bash -c "sleep 365d &"`, the exec
process is bash and quit after create `sleep 365d`. But the `sleep 365d`
will hold the write end of the pipe for a year! It doesn't make senses
that CRI plugin should wait for it.

For this case, we should use timeout to drain exec process's io instead
of waiting for it.

Fixes: containerd#7802

Signed-off-by: Wei Fu <fuweid89@gmail.com>
fuweid added a commit to fuweid/containerd that referenced this issue Dec 18, 2022
By default, the child processes spawned by exec process will inherit standard
io file descriptors. The shim server creates a pipe as data channel. Both exec
process and its children write data into the write end of the pipe. And the
shim server will read data from the pipe. If the write end is still open, the
shim server will continue to wait for data from pipe.

So, if the exec command is like `bash -c "sleep 365d &"`, the exec process is
bash and quit after create `sleep 365d`. But the `sleep 365d` will hold the
write end of the pipe for a year! It doesn't make senses that CRI plugin
should wait for it.

For this case, we should use timeout to drain exec process's io instead of
waiting for it.

Fixes: containerd#7802

Signed-off-by: Wei Fu <fuweid89@gmail.com>
@fuweid fuweid added the area/cri Container Runtime Interface (CRI) label Dec 18, 2022
@dhiman360
Copy link
Author

@fuweid ,
Hi, I have pulled your changes and tested today. It seems it has not fixed the issue.

Code fetch:

git clone https://github.com/containerd/containerd containerd
cd containerd && git fetch origin pull/7832/head:fuweid-fix-7802 && git checkout fuweid-fix-7802

Containerd Version:

eccd@director-0-ezghodh-containerd-patch-stack:~> containerd -version
containerd github.com/containerd/containerd v1.7.0-beta.1-28-gc851f1169 c851f1169922c6455a944b34995650e3d22ab4b2

Exec into the pod and pe -ef, sleep processes are kept on increasing:

eccd@director-0-ezghodh-containerd-patch-stack:~> kubectl exec -it readiness-exec -- sh
/ # ps -ef
PID   USER     COMMAND
    1 root     /bin/sh -c touch /tmp/healthy; sleep 30; rm -f /tmp/healthy; sleep 600000000
   15 root     sleep 100000000
   22 root     sleep 100000000
   24 root     sleep 600000000
   31 root     sleep 100000000
   38 root     sleep 100000000
   45 root     sleep 100000000
   51 root     sleep 100000000
   58 root     sleep 100000000
   64 root     sleep 100000000
   71 root     sleep 100000000
   78 root     sleep 100000000
   85 root     sleep 100000000
   92 root     sleep 100000000
   99 root     sleep 100000000
  106 root     sleep 100000000
  113 root     sleep 100000000
  114 root     sh
  120 root     ps -ef

@fuweid
Copy link
Member

fuweid commented Dec 19, 2022

@dhiman360 Currently, it is unsupported. The children processes created by exec-process will be reparented to pid-1 after exec-process exits. It is unlikely to trace all the processes, created by exec-process, in containerd side. If the pid-1 doesn't reap the child processes, they will be zombie ones. So, I don't suggest to run this pattern in your cluster and I don't think we should support this.

@dhiman360
Copy link
Author

@fuweid ,
So you are saying we should not use any scripts like the below one in readiness and liveness probes which may get stuck forever.

#!/bin/bash
curl -sS -X GET "http://localhost:9000/health/readiness"

Now, because it is allowed by kubernetes, we can't disallows the pod developers to not write such code. Question is a bad pod is causing the whole framework to fail looks terribly bad. Do you think we can handle these in containerd in some other way, like work around etc. ?

@mikebrow
Copy link
Member

From: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#configure-probes

Caution: Incorrect implementation of readiness probes may result in an ever growing number of processes in the container, and resource starvation if this is left unchecked.

I see this is a somewhat old problem moby/moby#9098 with some proposed self-help work arounds.

@fuweid
Copy link
Member

fuweid commented Dec 19, 2022

So you are saying we should not use any scripts like the below one in readiness and liveness probes which may get stuck forever.

If the script may get stuck forever, the owner of script should take care about this. For example, use timeout to ensure that it will receive the signal kill.

Do you think we can handle these in containerd in some other way, like work around etc. ?

It is all about the process manager. As I mentioned before, even if containerd can kill all the processes created by one exec, there will be zombie process issues. The pid-1 process should be aware about the dead process and reap it. Unfortunately, most of processes doesn't watch SIGCHLD signal.

For example,

apiVersion: v1
kind: Pod
metadata:
  name: qos-demo
spec:
  containers:
  - name: qos-demo-ctr
    image: ubuntu
    resources:
      limits:
        memory: "200Mi"
      requests:
        memory: "100Mi"
    command: ["sleep", "1d"]
    readinessProbe:
      exec:
        command:
        - /bin/sh
        - -c
        - sleep 100000000 & # Forked a very long running sleep to simulate stuck probe
      initialDelaySeconds: 5
      periodSeconds: 5
      timeoutSeconds: 5

/bin/sh exits and the sleep 100000000's parent will be changed to sleep 1d. After kill sleep 100000000, it will be zombie.

➜  containerd git:(main) sudo crictl exec -it 37395f8a03e99 bash
root@qos-demo:/# ps -fe
UID          PID    PPID  C STIME TTY          TIME CMD
root           1       0  0 15:14 ?        00:00:00 sleep 1d
root          13       1  0 15:15 ?        00:00:00 sleep 100000000
root          14       0  0 15:15 pts/0    00:00:00 bash
root          22      14  0 15:15 pts/0    00:00:00 ps -fe
root@qos-demo:/# exit
exit

➜  containerd git:(main) sudo kill -9 654442
➜  containerd git:(main) sudo crictl exec -it 37395f8a03e99 bash
root@qos-demo:/# ps -ef
UID          PID    PPID  C STIME TTY          TIME CMD
root           1       0  0 15:14 ?        00:00:00 sleep 1d
root          13       1  0 15:15 ?        00:00:00 [sleep] <defunct>
root          29       1  0 15:15 ?        00:00:00 sleep 100000000
root          30       0  0 15:15 pts/0    00:00:00 bash
root          38      30  0 15:15 pts/0    00:00:00 ps -ef
root@qos-demo:/#

So, the user should take care the script by themself.

fuweid added a commit to fuweid/containerd that referenced this issue Mar 2, 2023
By default, the child processes spawned by exec process will inherit standard
io file descriptors. The shim server creates a pipe as data channel. Both exec
process and its children write data into the write end of the pipe. And the
shim server will read data from the pipe. If the write end is still open, the
shim server will continue to wait for data from pipe.

So, if the exec command is like `bash -c "sleep 365d &"`, the exec process is
bash and quit after create `sleep 365d`. But the `sleep 365d` will hold the
write end of the pipe for a year! It doesn't make senses that CRI plugin
should wait for it.

For this case, we should use timeout to drain exec process's io instead of
waiting for it.

Fixes: containerd#7802

Signed-off-by: Wei Fu <fuweid89@gmail.com>
thaJeztah pushed a commit to thaJeztah/containerd that referenced this issue Jul 14, 2023
By default, the child processes spawned by exec process will inherit standard
io file descriptors. The shim server creates a pipe as data channel. Both exec
process and its children write data into the write end of the pipe. And the
shim server will read data from the pipe. If the write end is still open, the
shim server will continue to wait for data from pipe.

So, if the exec command is like `bash -c "sleep 365d &"`, the exec process is
bash and quit after create `sleep 365d`. But the `sleep 365d` will hold the
write end of the pipe for a year! It doesn't make senses that CRI plugin
should wait for it.

For this case, we should use timeout to drain exec process's io instead of
waiting for it.

Fixes: containerd#7802

Signed-off-by: Wei Fu <fuweid89@gmail.com>
(cherry picked from commit 82c0f4f)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
jsturtevant pushed a commit to jsturtevant/containerd that referenced this issue Sep 21, 2023
By default, the child processes spawned by exec process will inherit standard
io file descriptors. The shim server creates a pipe as data channel. Both exec
process and its children write data into the write end of the pipe. And the
shim server will read data from the pipe. If the write end is still open, the
shim server will continue to wait for data from pipe.

So, if the exec command is like `bash -c "sleep 365d &"`, the exec process is
bash and quit after create `sleep 365d`. But the `sleep 365d` will hold the
write end of the pipe for a year! It doesn't make senses that CRI plugin
should wait for it.

For this case, we should use timeout to drain exec process's io instead of
waiting for it.

Fixes: containerd#7802

Signed-off-by: Wei Fu <fuweid89@gmail.com>
juliusl pushed a commit to juliusl/containerd that referenced this issue Jan 26, 2024
By default, the child processes spawned by exec process will inherit standard
io file descriptors. The shim server creates a pipe as data channel. Both exec
process and its children write data into the write end of the pipe. And the
shim server will read data from the pipe. If the write end is still open, the
shim server will continue to wait for data from pipe.

So, if the exec command is like `bash -c "sleep 365d &"`, the exec process is
bash and quit after create `sleep 365d`. But the `sleep 365d` will hold the
write end of the pipe for a year! It doesn't make senses that CRI plugin
should wait for it.

For this case, we should use timeout to drain exec process's io instead of
waiting for it.

Fixes: containerd#7802

Signed-off-by: Wei Fu <fuweid89@gmail.com>
fuweid added a commit to fuweid/containerd that referenced this issue Feb 6, 2024
By default, the child processes spawned by exec process will inherit standard
io file descriptors. The shim server creates a pipe as data channel. Both exec
process and its children write data into the write end of the pipe. And the
shim server will read data from the pipe. If the write end is still open, the
shim server will continue to wait for data from pipe.

So, if the exec command is like `bash -c "sleep 365d &"`, the exec process is
bash and quit after create `sleep 365d`. But the `sleep 365d` will hold the
write end of the pipe for a year! It doesn't make senses that CRI plugin
should wait for it.

For this case, we should use timeout to drain exec process's io instead of
waiting for it.

Fixes: containerd#7802

Signed-off-by: Wei Fu <fuweid89@gmail.com>
(cherry picked from commit 82c0f4f)
Signed-off-by: Wei Fu <fuweid89@gmail.com>
fuweid added a commit to fuweid/containerd that referenced this issue Feb 16, 2024
By default, the child processes spawned by exec process will inherit standard
io file descriptors. The shim server creates a pipe as data channel. Both exec
process and its children write data into the write end of the pipe. And the
shim server will read data from the pipe. If the write end is still open, the
shim server will continue to wait for data from pipe.

So, if the exec command is like `bash -c "sleep 365d &"`, the exec process is
bash and quit after create `sleep 365d`. But the `sleep 365d` will hold the
write end of the pipe for a year! It doesn't make senses that CRI plugin
should wait for it.

For this case, we should use timeout to drain exec process's io instead of
waiting for it.

Fixes: containerd#7802

Signed-off-by: Wei Fu <fuweid89@gmail.com>
(cherry picked from commit 82c0f4f)
Signed-off-by: Wei Fu <fuweid89@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cri Container Runtime Interface (CRI) kind/bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants