Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pods stuck on terminating #51835

Closed
igorleao opened this issue Sep 1, 2017 · 191 comments
Closed

Pods stuck on terminating #51835

igorleao opened this issue Sep 1, 2017 · 191 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/storage Categorizes an issue or PR as relevant to SIG Storage.

Comments

@igorleao
Copy link

igorleao commented Sep 1, 2017

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened:
Pods stuck on terminating for a long time

What you expected to happen:
Pods get terminated

How to reproduce it (as minimally and precisely as possible):

  1. Run a deployment
  2. Delete it
  3. Pods are still terminating

Anything else we need to know?:
Kubernetes pods stuck as Terminating for a few hours after getting deleted.

Logs:
kubectl describe pod my-pod-3854038851-r1hc3

Name:				my-pod-3854038851-r1hc3
Namespace:			container-4-production
Node:				ip-172-16-30-204.ec2.internal/172.16.30.204
Start Time:			Fri, 01 Sep 2017 11:58:24 -0300
Labels:				pod-template-hash=3854038851
				release=stable
				run=my-pod-3
Annotations:			kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicaSet","namespace":"container-4-production","name":"my-pod-3-3854038851","uid":"5816c...
				prometheus.io/scrape=true
Status:				Terminating (expires Fri, 01 Sep 2017 14:17:53 -0300)
Termination Grace Period:	30s
IP:
Created By:			ReplicaSet/my-pod-3-3854038851
Controlled By:			ReplicaSet/my-pod-3-3854038851
Init Containers:
  ensure-network:
    Container ID:	docker://guid-1
    Image:		XXXXX
    Image ID:		docker-pullable://repo/ensure-network@sha256:guid-0
    Port:		<none>
    State:		Terminated
      Exit Code:	0
      Started:		Mon, 01 Jan 0001 00:00:00 +0000
      Finished:		Mon, 01 Jan 0001 00:00:00 +0000
    Ready:		True
    Restart Count:	0
    Environment:	<none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-xxxxx (ro)
Containers:
  container-1:
    Container ID:	docker://container-id-guid-1
    Image:		XXXXX
    Image ID:		docker-pullable://repo/container-1@sha256:guid-2
    Port:		<none>
    State:		Terminated
      Exit Code:	0
      Started:		Mon, 01 Jan 0001 00:00:00 +0000
      Finished:		Mon, 01 Jan 0001 00:00:00 +0000
    Ready:		False
    Restart Count:	0
    Limits:
      cpu:	100m
      memory:	1G
    Requests:
      cpu:	100m
      memory:	1G
    Environment:
      XXXX
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-xxxxx (ro)
  container-2:
    Container ID:	docker://container-id-guid-2
    Image:		alpine:3.4
    Image ID:		docker-pullable://alpine@sha256:alpine-container-id-1
    Port:		<none>
    Command:
      X
    State:		Terminated
      Exit Code:	0
      Started:		Mon, 01 Jan 0001 00:00:00 +0000
      Finished:		Mon, 01 Jan 0001 00:00:00 +0000
    Ready:		False
    Restart Count:	0
    Limits:
      cpu:	20m
      memory:	40M
    Requests:
      cpu:		10m
      memory:		20M
    Environment:	<none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-xxxxx (ro)
  container-3:
    Container ID:	docker://container-id-guid-3
    Image:		XXXXX
    Image ID:		docker-pullable://repo/container-3@sha256:guid-3
    Port:		<none>
    State:		Terminated
      Exit Code:	0
      Started:		Mon, 01 Jan 0001 00:00:00 +0000
      Finished:		Mon, 01 Jan 0001 00:00:00 +0000
    Ready:		False
    Restart Count:	0
    Limits:
      cpu:	100m
      memory:	200M
    Requests:
      cpu:	100m
      memory:	100M
    Readiness:	exec [nc -zv localhost 80] delay=1s timeout=1s period=5s #success=1 #failure=3
    Environment:
      XXXX
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-xxxxx (ro)
  container-4:
    Container ID:	docker://container-id-guid-4
    Image:		XXXX
    Image ID:		docker-pullable://repo/container-4@sha256:guid-4
    Port:		9102/TCP
    State:		Terminated
      Exit Code:	0
      Started:		Mon, 01 Jan 0001 00:00:00 +0000
      Finished:		Mon, 01 Jan 0001 00:00:00 +0000
    Ready:		False
    Restart Count:	0
    Limits:
      cpu:	600m
      memory:	1500M
    Requests:
      cpu:	600m
      memory:	1500M
    Readiness:	http-get http://:8080/healthy delay=1s timeout=1s period=10s #success=1 #failure=3
    Environment:
      XXXX
    Mounts:
      /app/config/external from volume-2 (ro)
      /data/volume-1 from volume-1 (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-xxxxx (ro)
Conditions:
  Type		Status
  Initialized 	True
  Ready 	False
  PodScheduled 	True
Volumes:
  volume-1:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	volume-1
    Optional:	false
  volume-2:
    Type:	ConfigMap (a volume populated by a ConfigMap)
    Name:	external
    Optional:	false
  default-token-xxxxx:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	default-token-xxxxx
    Optional:	false
QoS Class:	Burstable
Node-Selectors:	<none>

sudo journalctl -u kubelet | grep "my-pod"

[...]
Sep 01 17:17:56 ip-172-16-30-204 kubelet[9619]: time="2017-09-01T17:17:56Z" level=info msg="Releasing address using workloadID" Workload=my-pod-3854038851-r1hc3
Sep 01 17:17:56 ip-172-16-30-204 kubelet[9619]: time="2017-09-01T17:17:56Z" level=info msg="Releasing all IPs with handle 'my-pod-3854038851-r1hc3'"
Sep 01 17:17:56 ip-172-16-30-204 kubelet[9619]: time="2017-09-01T17:17:56Z" level=warning msg="Asked to release address but it doesn't exist. Ignoring" Workload=my-pod-3854038851-r1hc3 workloadId=my-pod-3854038851-r1hc3
Sep 01 17:17:56 ip-172-16-30-204 kubelet[9619]: time="2017-09-01T17:17:56Z" level=info msg="Teardown processing complete." Workload=my-pod-3854038851-r1hc3 endpoint=<nil>
Sep 01 17:19:06 ip-172-16-30-204 kubelet[9619]: I0901 17:19:06.591946    9619 kubelet.go:1824] SyncLoop (DELETE, "api"):my-pod-3854038851(b8cf2ecd-8f25-11e7-ba86-0a27a44c875)"

sudo journalctl -u docker | grep "docker-id-for-my-pod"

Sep 01 17:17:55 ip-172-16-30-204 dockerd[9385]: time="2017-09-01T17:17:55.695834447Z" level=error msg="Handler for POST /v1.24/containers/docker-id-for-my-pod/stop returned error: Container docker-id-for-my-pod is already stopped"
Sep 01 17:17:56 ip-172-16-30-204 dockerd[9385]: time="2017-09-01T17:17:56.698913805Z" level=error msg="Handler for POST /v1.24/containers/docker-id-for-my-pod/stop returned error: Container docker-id-for-my-pod is already stopped"

Environment:

  • Kubernetes version (use kubectl version):
    Client Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.3", GitCommit:"2c2fe6e8278a5db2d15a013987b53968c743f2a1", GitTreeState:"clean", BuildDate:"2017-08-03T15:13:53Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"darwin/amd64"}
    Server Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.6", GitCommit:"7fa1c1756d8bc963f1a389f4a6937dc71f08ada2", GitTreeState:"clean", BuildDate:"2017-06-16T18:21:54Z", GoVersion:"go1.7.6", Compiler:"gc", Platform:"linux/amd64"}

  • Cloud provider or hardware configuration**:
    AWS

  • OS (e.g. from /etc/os-release):
    NAME="CentOS Linux"
    VERSION="7 (Core)"
    ID="centos"
    ID_LIKE="rhel fedora"
    VERSION_ID="7"
    PRETTY_NAME="CentOS Linux 7 (Core)"
    ANSI_COLOR="0;31"
    CPE_NAME="cpe:/o:centos:centos:7"
    HOME_URL="https://www.centos.org/"
    BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

  • Kernel (e.g. uname -a):
    Linux ip-172-16-30-204 3.10.0-327.10.1.el7.x86_64 Unit test coverage in Kubelet is lousy. (~30%) #1 SMP Tue Feb 16 17:03:50 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

  • Install tools:
    Kops

  • Others:
    Docker version 1.12.6, build 78d1802

@kubernetes/sig-aws @kubernetes/sig-scheduling

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Sep 1, 2017
@k8s-github-robot k8s-github-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Sep 1, 2017
@igorleao
Copy link
Author

igorleao commented Sep 1, 2017

@kubernetes/sig-aws @kubernetes/sig-scheduling

@resouer
Copy link
Contributor

resouer commented Sep 3, 2017

Usually volume and network cleanup consume more time in termination. Can you find in which phase your pod is stuck? Volume cleanup for example?

@dixudx
Copy link
Member

dixudx commented Sep 3, 2017

Usually volume and network cleanup consume more time in termination.

Correct. They are always suspect.

@igorleao You can try kubectl delete pod xxx --now as well.

@igorleao
Copy link
Author

igorleao commented Sep 4, 2017

Hi @resouer and @dixudx
I'm not sure. Looking at kubelet logs for a different pod with the same problem, I found:

Sep 02 15:31:57 ip-172-16-30-208 kubelet[9620]: time="2017-09-02T15:31:57Z" level=info msg="Releasing address using workloadID" Workload=my-pod-969733955-rbxhn
Sep 02 15:31:57 ip-172-16-30-208 kubelet[9620]: time="2017-09-02T15:31:57Z" level=info msg="Releasing all IPs with handle 'my-pod-969733955-rbxhn'"
Sep 02 15:31:57 ip-172-16-30-208 kubelet[9620]: time="2017-09-02T15:31:57Z" level=warning msg="Asked to release address but it doesn't exist. Ignoring" Workload=my-pod-969733955-rbxhn workloadId=my-pod-969733955-rbxhn
Sep 02 15:31:57 ip-172-16-30-208 kubelet[9620]: time="2017-09-02T15:31:57Z" level=info msg="Teardown processing complete." Workload=my-pod-969733955-rbxhn endpoint=<nil>
Sep 02 15:31:57 ip-172-16-30-208 kubelet[9620]: I0902 15:31:57.496132    9620 qos_container_manager_linux.go:285] [ContainerManager]: Updated QoS cgroup configuration
Sep 02 15:31:57 ip-172-16-30-208 kubelet[9620]: I0902 15:31:57.968147    9620 reconciler.go:201] UnmountVolume operation started for volume "kubernetes.io/secret/GUID-default-token-wrlv3" (spec.Name: "default-token-wrlv3") from pod "GUID" (UID: "GUID").
Sep 02 15:31:57 ip-172-16-30-208 kubelet[9620]: I0902 15:31:57.968245    9620 reconciler.go:201] UnmountVolume operation started for volume "kubernetes.io/secret/GUID-token-key" (spec.Name: "token-key") from pod "GUID" (UID: "GUID").
Sep 02 15:31:57 ip-172-16-30-208 kubelet[9620]: E0902 15:31:57.968537    9620 nestedpendingoperations.go:262] Operation for "\"kubernetes.io/secret/GUID-token-key\" (\"GUID\")" failed. No retries permitted until 2017-09-02 15:31:59.968508761 +0000 UTC (durationBeforeRetry 2s). Error: UnmountVolume.TearDown failed for volume "kubernetes.io/secret/GUID-token-key" (volume.spec.Name: "token-key") pod "GUID" (UID: "GUID") with: rename /var/lib/kubelet/pods/GUID/volumes/kubernetes.io~secret/token-key /var/lib/kubelet/pods/GUID/volumes/kubernetes.io~secret/wrapped_token-key.deleting~818780979: device or resource busy
Sep 02 15:31:57 ip-172-16-30-208 kubelet[9620]: E0902 15:31:57.968744    9620 nestedpendingoperations.go:262] Operation for "\"kubernetes.io/secret/GUID-default-token-wrlv3\" (\"GUID\")" failed. No retries permitted until 2017-09-02 15:31:59.968719924 +0000 UTC (durationBeforeRetry 2s). Error: UnmountVolume.TearDown failed for volume "kubernetes.io/secret/GUID-default-token-wrlv3" (volume.spec.Name: "default-token-wrlv3") pod "GUID" (UID: "GUID") with: rename /var/lib/kubelet/pods/GUID/volumes/kubernetes.io~secret/default-token-wrlv3 /var/lib/kubelet/pods/GUID/volumes/kubernetes.io~secret/wrapped_default-token-wrlv3.deleting~940140790: device or resource busy
--
Sep 02 15:33:04 ip-172-16-30-208 kubelet[9620]: I0902 15:33:04.778742    9620 reconciler.go:363] Detached volume "kubernetes.io/secret/GUID-wrapped_default-token-wrlv3.deleting~940140790" (spec.Name: "wrapped_default-token-wrlv3.deleting~940140790") devicePath: ""
Sep 02 15:33:04 ip-172-16-30-208 kubelet[9620]: I0902 15:33:04.778753    9620 reconciler.go:363] Detached volume "kubernetes.io/secret/GUID-wrapped_token-key.deleting~850807831" (spec.Name: "wrapped_token-key.deleting~850807831") devicePath: ""
Sep 02 15:33:04 ip-172-16-30-208 kubelet[9620]: I0902 15:33:04.778764    9620 reconciler.go:363] Detached volume "kubernetes.io/secret/GUID-wrapped_token-key.deleting~413655961" (spec.Name: "wrapped_token-key.deleting~413655961") devicePath: ""
Sep 02 15:33:04 ip-172-16-30-208 kubelet[9620]: I0902 15:33:04.778774    9620 reconciler.go:363] Detached volume "kubernetes.io/secret/GUID-wrapped_token-key.deleting~818780979" (spec.Name: "wrapped_token-key.deleting~818780979") devicePath: ""
Sep 02 15:33:04 ip-172-16-30-208 kubelet[9620]: I0902 15:33:04.778784    9620 reconciler.go:363] Detached volume "kubernetes.io/secret/GUID-wrapped_token-key.deleting~348212189" (spec.Name: "wrapped_token-key.deleting~348212189") devicePath: ""
Sep 02 15:33:04 ip-172-16-30-208 kubelet[9620]: I0902 15:33:04.778796    9620 reconciler.go:363] Detached volume "kubernetes.io/secret/GUID-wrapped_token-key.deleting~848395852" (spec.Name: "wrapped_token-key.deleting~848395852") devicePath: ""
Sep 02 15:33:04 ip-172-16-30-208 kubelet[9620]: I0902 15:33:04.778808    9620 reconciler.go:363] Detached volume "kubernetes.io/secret/GUID-wrapped_default-token-wrlv3.deleting~610264100" (spec.Name: "wrapped_default-token-wrlv3.deleting~610264100") devicePath: ""
Sep 02 15:33:04 ip-172-16-30-208 kubelet[9620]: I0902 15:33:04.778820    9620 reconciler.go:363] Detached volume "kubernetes.io/secret/GUID-wrapped_token-key.deleting~960022821" (spec.Name: "wrapped_token-key.deleting~960022821") devicePath: ""
Sep 02 15:33:05 ip-172-16-30-208 kubelet[9620]: I0902 15:33:05.081380    9620 server.go:778] GET /stats/summary/: (37.027756ms) 200 [[Go-http-client/1.1] 10.0.46.202:54644]
Sep 02 15:33:05 ip-172-16-30-208 kubelet[9620]: I0902 15:33:05.185367    9620 operation_generator.go:597] MountVolume.SetUp succeeded for volume "kubernetes.io/secret/GUID-calico-token-w8tzx" (spec.Name: "calico-token-w8tzx") pod "GUID" (UID: "GUID").
Sep 02 15:33:07 ip-172-16-30-208 kubelet[9620]: I0902 15:33:07.187953    9620 kubelet.go:1824] SyncLoop (DELETE, "api"): "my-pod-969733955-rbxhn_container-4-production(GUID)"
Sep 02 15:33:13 ip-172-16-30-208 kubelet[9620]: I0902 15:33:13.879940    9620 aws.go:937] Could not determine public DNS from AWS metadata.
Sep 02 15:33:20 ip-172-16-30-208 kubelet[9620]: I0902 15:33:20.736601    9620 server.go:778] GET /metrics: (53.063679ms) 200 [[Prometheus/1.7.1] 10.0.46.198:43576]
Sep 02 15:33:23 ip-172-16-30-208 kubelet[9620]: I0902 15:33:23.898078    9620 aws.go:937] Could not determine public DNS from AWS metadata.

As you can see, this cluster has Calico for CNI.
The following lines bring my attention:

Sep 02 15:31:57 ip-172-16-30-208 kubelet[9620]: I0902 15:31:57.968245    9620 reconciler.go:201] UnmountVolume operation started for volume "kubernetes.io/secret/GUID-token-key" (spec.Name: "token-key") from pod "GUID" (UID: "GUID").
Sep 02 15:31:57 ip-172-16-30-208 kubelet[9620]: E0902 15:31:57.968537    9620 nestedpendingoperations.go:262] Operation for "\"kubernetes.io/secret/GUID-token-key\" (\"GUID\")" failed. No retries permitted until 2017-09-02 15:31:59.968508761 +0000 UTC (durationBeforeRetry 2s). Error: UnmountVolume.TearDown failed for volume "kubernetes.io/secret/GUID-token-key" (volume.spec.Name: "token-key") pod "GUID" (UID: "GUID") with: rename /var/lib/kubelet/pods/GUID/volumes/kubernetes.io~secret/token-key /var/lib/kubelet/pods/GUID/volumes/kubernetes.io~secret/wrapped_token-key.deleting~818780979: device or resource busy
Sep 02 15:31:57 ip-172-16-30-208 kubelet[9620]: E0902 15:31:57.968744    9620 nestedpendingoperations.go:262] Operation for "\"kubernetes.io/secret/GUID-default-token-wrlv3\" (\"GUID\")" failed. No retries permitted until 2017-09-02 15:31:59.968719924 +0000 UTC (durationBeforeRetry 2s). Error: UnmountVolume.TearDown failed for volume "kubernetes.io/secret/GUID-default-token-wrlv3" (volume.spec.Name: "default-token-wrlv3") pod "GUID" (UID: "GUID") with: rename 

Is there a better way find out which phase a pod is stuck?

kubectl delete pod xxx --now seems to work pretty well, but I really wish to find out its root cause and avoid human interaction.

@dixudx
Copy link
Member

dixudx commented Sep 4, 2017

rename /var/lib/kubelet/pods/GUID/volumes/kubernetes.io~secret/token-key /var/lib/kubelet/pods/GUID/volumes/kubernetes.io~secret/wrapped_token-key.deleting~818780979: device or resource busy

Seems kubelet/mount failed to mount configmap as a volume due to such file renaming.

@igorleao Is this reproducible? Or it is just not that stable, happening occasionally. I've met such errors before, just to make sure.

@igorleao
Copy link
Author

igorleao commented Sep 4, 2017

@dixudx it happens several times a day for a certain cluster. Others clusters created with the same verstion of kops and kubernetes, in the same week, work just fine.

@jingxu97
Copy link
Contributor

@igorleao As the log shows that the volume manager failed to remove the secrete directory because device is busy.
Could you please check whether the directory /var/lib/kubelet/pods/GUID/volumes/kubernetes.io~secret/token-key is still mounted or not? Thanks!

@r7vme
Copy link

r7vme commented Sep 26, 2017

@igorleao how do you run kubelet? in container? if so can you please post your systemd unit or docker config for kubelet?

We see similar behaviour. We run kubelet as container and problem was partially mitigated by mounting /var/lib/kubelet as shared (by default docker mounts volume as rslave). But still we see similar issues, but less frequent. Currently i suspect that some other mounts should be done different way (e.g. /var/lib/docker or /rootfs)

@r7vme
Copy link

r7vme commented Sep 28, 2017

@stormltf Can you please post your kubelet container configuration?

@r7vme
Copy link

r7vme commented Sep 29, 2017

@stormltf you're running kubelet in container and don't use --containerized flag (which do some tricks with mounts). Which basically means that all mounts that kubelet does will be done in container mount namespace. Good thing that they will be proposed back to host machine's namespace (as you have /var/lib/kubelet as shared), but i'm not sure what happens is namespace removed (when kubelet container removed).

Can you please for stuck pods do following:

on node where pod is running

  • docker exec -ti /kubelet /bin/bash -c "mount | grep STUCK_POD_UUID"
  • and same on node itself mount | grep STUCK_POD_UUID.

Please also do same for freshly created pod. I excpect to see some /var/lib/kubelet mounts (e.g. default-secret)

@r7vme
Copy link

r7vme commented Oct 11, 2017

@stormltf did you restart kubelet after first two pods were created?

@r7vme
Copy link

r7vme commented Oct 12, 2017

@stormltf You can try to make /var/lib/docker and /rootfs as shared (which i don't see in your docker inspect, but see inside container) mountpoint.

@ianchakeres
Copy link
Contributor

/sig storage

@k8s-ci-robot k8s-ci-robot added the sig/storage Categorizes an issue or PR as relevant to SIG Storage. label Oct 22, 2017
@k8s-github-robot k8s-github-robot removed the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Oct 22, 2017
@r7vme
Copy link

r7vme commented Oct 23, 2017

For some it might help. We are running kubelet in docker container with --containerized flag and were able to solve this issue with mounting /rootfs, /var/lib/docker and /var/lib/kubelet as shared mounts. Final mounts look like this

      -v /:/rootfs:ro,shared \
      -v /sys:/sys:ro \
      -v /dev:/dev:rw \
      -v /var/log:/var/log:rw \
      -v /run/calico/:/run/calico/:rw \
      -v /run/docker/:/run/docker/:rw \
      -v /run/docker.sock:/run/docker.sock:rw \
      -v /usr/lib/os-release:/etc/os-release \
      -v /usr/share/ca-certificates/:/etc/ssl/certs \
      -v /var/lib/docker/:/var/lib/docker:rw,shared \
      -v /var/lib/kubelet/:/var/lib/kubelet:rw,shared \
      -v /etc/kubernetes/ssl/:/etc/kubernetes/ssl/ \
      -v /etc/kubernetes/config/:/etc/kubernetes/config/ \
      -v /etc/cni/net.d/:/etc/cni/net.d/ \
      -v /opt/cni/bin/:/opt/cni/bin/ \

For some more details. This does not properly solve the problem as for every bind mount you'll get 3 mounts inside kubelet container (2 parasite). But at least shared mount allow to easily unmount them with one shot.

CoreOS does not have this problem. Because the use rkt and not docker for kubelet container. In case our case kubelet runs in Docker and every mount inside kubelet continer gets proposed into /var/lib/docker/overlay/... and /rootfs that's why we have two parasite mounts for every bind mount volume:

  • one from /rootfs in /rootfs/var/lib/kubelet/<mount>
  • one from /var/lib/docker in /var/lib/docker/overlay/.../rootfs/var/lib/kubelet/<mount>

@stormltf
Copy link

stormltf commented Oct 25, 2017

-v /dev:/dev:rw 
-v /etc/cni:/etc/cni:ro 
-v /opt/cni:/opt/cni:ro 
-v /etc/ssl:/etc/ssl:ro 
-v /etc/resolv.conf:/etc/resolv.conf 
-v /etc/pki/tls:/etc/pki/tls:ro 
-v /etc/pki/ca-trust:/etc/pki/ca-trust:ro
-v /sys:/sys:ro 
-v /var/lib/docker:/var/lib/docker:rw 
-v /var/log:/var/log:rw
-v /var/lib/kubelet:/var/lib/kubelet:shared 
-v /var/lib/cni:/var/lib/cni:shared 
-v /var/run:/var/run:rw 
-v /www:/www:rw 
-v /etc/kubernetes:/etc/kubernetes:ro 
-v /etc/os-release:/etc/os-release:ro 
-v /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime:ro

@tadas-subonis
Copy link

I have the same issue with Kubernetes 1.8.1 on Azure - after deployment is changed and new pods are have been started, the old pods are stuck at terminating.

@wardhane
Copy link

I have the same issue on Kubernetes 1.8.2 on IBM Cloud. After new pods are started the old pods are stuck in terminating.

kubectl version
Server Version: version.Info{Major:"1", Minor:"8+", GitVersion:"v1.8.2-1+d150e4525193f1", GitCommit:"d150e4525193f1c79569c04efc14599d7deb5f3e", GitTreeState:"clean", BuildDate:"2017-10-27T08:15:17Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}

I have used kubectl delete pod xxx --now as well as kubectl delete pod foo --grace-period=0 --force to no avail.

@r7vme
Copy link

r7vme commented Nov 24, 2017

If root cause still the same (improperly proposed mounts) then this is distribution specific bug imo.

Please describe how you run kubelet run in IBM cloud? systemd unit? does it have --containerized flag?

@wardhane
Copy link

it is run with --containerized flag set to false.

   kubelet.service - Kubernetes Kubelet
   Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
   Active: active (running) since Sun 2017-11-19 21:48:48 UTC; 4 days ago```


--containerized flag:  No

@r7vme
Copy link

r7vme commented Nov 24, 2017

ok, i need more info, please see my comment above #51835 (comment)

and also please show contents of /lib/systemd/system/kubelet.service and if there anything about kubelet in /etc/systemd/system please share too.

In particual, if kubelet runs in docker i want to see all bind mounts -v.

@knisbet
Copy link

knisbet commented Nov 29, 2017

Today I encountered an issue that may be the same as the one described, where we had pods on one of our customer systems getting stuck in the terminating state for several day's. We were also seeing the errors about "Error: UnmountVolume.TearDown failed for volume" with "device or resource busy" repeated for each of the stuck pods.

In our case, it appears to be an issue with docker on RHEL/Centos 7.4 based systems covered in this moby issue: moby/moby#22260 and this moby PR: https://github.com/moby/moby/pull/34886/files

For us, once we set the sysctl option fs.may_detach_mounts=1 within a couple minutes all our Terminating pods cleaned up.

@nmakhotkin
Copy link

nmakhotkin commented Nov 29, 2017

I'm also facing this problem: Pods got stuck in Terminating state on 1.8.3.

Relevant kubelet logs from the node:

Nov 28 22:48:51 <my-node> kubelet[1010]: I1128 22:48:51.616749    1010 reconciler.go:186] operationExecutor.UnmountVolume started for volume "nfs-mtkylje2oc4xlju1ls9rdwjlcmxhyi1ydw" (UniqueName: "kubernetes.io/nfs/58dc413c-d4d1-11e7-870d-3c970e298d91-nfs-mtkylje2oc4xlju1ls9rdwjlcmxhyi1ydw") pod "58dc413c-d4d1-11e7-870d-3c970e298d91" (UID: "58dc413c-d4d1-11e7-870d-3c970e298d91")
Nov 28 22:48:51 <my-node> kubelet[1010]: W1128 22:48:51.616762    1010 util.go:112] Warning: "/var/lib/kubelet/pods/58dc413c-d4d1-11e7-870d-3c970e298d91/volumes/kubernetes.io~nfs/nfs-mtkylje2oc4xlju1ls9rdwjlcmxhyi1ydw" is not a mountpoint, deleting
Nov 28 22:48:51 <my-node> kubelet[1010]: E1128 22:48:51.616828    1010 nestedpendingoperations.go:264] Operation for "\"kubernetes.io/nfs/58dc413c-d4d1-11e7-870d-3c970e298d91-nfs-mtkylje2oc4xlju1ls9rdwjlcmxhyi1ydw\" (\"58dc413c-d4d1-11e7-870d-3c970e298d91\")" failed. No retries permitted until 2017-11-28 22:48:52.616806562 -0800 PST (durationBeforeRetry 1s). Error: UnmountVolume.TearDown failed for volume "nfs-mtkylje2oc4xlju1ls9rdwjlcmxhyi1ydw" (UniqueName: "kubernetes.io/nfs/58dc413c-d4d1-11e7-870d-3c970e298d91-nfs-mtkylje2oc4xlju1ls9rdwjlcmxhyi1ydw") pod "58dc413c-d4d1-11e7-870d-3c970e298d91" (UID: "58dc413c-d4d1-11e7-870d-3c970e298d91") : remove /var/lib/kubelet/pods/58dc413c-d4d1-11e7-870d-3c970e298d91/volumes/kubernetes.io~nfs/nfs-mtkylje2oc4xlju1ls9rdwjlcmxhyi1ydw: directory not empty
Nov 28 22:48:51 <my-node> kubelet[1010]: W1128 22:48:51.673774    1010 docker_sandbox.go:343] failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod "<pod>": CNI failed to retrieve network namespace path: Cannot find network namespace for the terminated container "f58ab11527aef5133bdb320349fe14fd94211aa0d35a1da006aa003a78ce0653"

Kubelet is running as systemd unit (not in container) on Ubuntu 16.04.
As you can see, there was a mount to NFS server and somehow kubelet tried to delete the mount directory because it considers this directory as non-mounted.

Volumes spec from the pod:

volumes:
  - name: nfs-mtkylje2oc4xlju1ls9rdwjlcmxhyi1ydw
    nfs:
      path: /<path>
      server: <IP>
  - name: default-token-rzqtt
    secret:
      defaultMode: 420
      secretName: default-token-rzqtt

UPD: I faced this problem before as well on 1.6.6

@sabbour
Copy link

sabbour commented Nov 29, 2017

Experiencing the same on Azure..

NAME                        READY     STATUS        RESTARTS   AGE       IP             NODE
busybox2-7db6d5d795-fl6h9   0/1       Terminating   25         1d        10.200.1.136   worker-1
busybox3-69d4f5b66c-2lcs6   0/1       Terminating   26         1d        <none>         worker-2
busybox7-797cc644bc-n5sv2   0/1       Terminating   26         1d        <none>         worker-2
busybox8-c8f95d979-8lk27    0/1       Terminating   25         1d        10.200.1.137   worker-1
nginx-56ccc998dd-hvpng      0/1       Terminating   0          2h        <none>         worker-1
nginx-56ccc998dd-nnsvj      0/1       Terminating   0          2h        <none>         worker-2
nginx-56ccc998dd-rsrvq      0/1       Terminating   0          2h        <none>         worker-1

kubectl version

Client Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.0", GitCommit:"6e937839ac04a38cac63e6a7a306c5d035fe7b0a", GitTreeState:"clean", BuildDate:"2017-09-28T22:57:57Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.0", GitCommit:"6e937839ac04a38cac63e6a7a306c5d035fe7b0a", GitTreeState:"clean", BuildDate:"2017-09-28T22:46:41Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}

describe pod nginx-56ccc998dd-nnsvj

Name:                      nginx-56ccc998dd-nnsvj
Namespace:                 default
Node:                      worker-2/10.240.0.22
Start Time:                Wed, 29 Nov 2017 13:33:39 +0400
Labels:                    pod-template-hash=1277755488
                           run=nginx
Annotations:               kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicaSet","namespace":"default","name":"nginx-56ccc998dd","uid":"614f71db-d4e8-11e7-9c45-000d3a25e3c0","...
Status:                    Terminating (expires Wed, 29 Nov 2017 15:13:44 +0400)
Termination Grace Period:  30s
IP:
Created By:                ReplicaSet/nginx-56ccc998dd
Controlled By:             ReplicaSet/nginx-56ccc998dd
Containers:
  nginx:
    Container ID:   containerd://d00709dfb00ed5ac99dcd092978e44fc018f44cca5229307c37d11c1a4fe3f07
    Image:          nginx:1.12
    Image ID:       docker.io/library/nginx@sha256:5269659b61c4f19a3528a9c22f9fa8f4003e186d6cb528d21e411578d1e16bdb
    Port:           <none>
    State:          Terminated
      Exit Code:    0
      Started:      Mon, 01 Jan 0001 00:00:00 +0000
      Finished:     Mon, 01 Jan 0001 00:00:00 +0000
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-jm7h5 (ro)
Conditions:
  Type           Status
  Initialized    True
  Ready          False
  PodScheduled   True
Volumes:
  default-token-jm7h5:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-jm7h5
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     <none>
Events:
  Type    Reason   Age   From               Message
  ----    ------   ----  ----               -------
  Normal  Killing  41m   kubelet, worker-2  Killing container with id containerd://nginx:Need to kill Pod

sudo journalctl -u kubelet | grep "nginx-56ccc998dd-nnsvj"

Nov 29 09:33:39 worker-2 kubelet[64794]: I1129 09:33:39.124779   64794 kubelet.go:1837] SyncLoop (ADD, "api"): "nginx-56ccc998dd-nnsvj_default(6171e2a7-d4e8-11e7-9c45-000d3a25e3c0)"
Nov 29 09:33:39 worker-2 kubelet[64794]: I1129 09:33:39.160444   64794 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for volume "default-token-jm7h5" (UniqueName: "kubernetes.io/secret/6171e2a7-d4e8-11e7-9c45-000d3a25e3c0-default-token-jm7h5") pod "nginx-56ccc998dd-nnsvj" (UID: "6171e2a7-d4e8-11e7-9c45-000d3a25e3c0")
Nov 29 09:33:39 worker-2 kubelet[64794]: I1129 09:33:39.261128   64794 reconciler.go:257] operationExecutor.MountVolume started for volume "default-token-jm7h5" (UniqueName: "kubernetes.io/secret/6171e2a7-d4e8-11e7-9c45-000d3a25e3c0-default-token-jm7h5") pod "nginx-56ccc998dd-nnsvj" (UID: "6171e2a7-d4e8-11e7-9c45-000d3a25e3c0")
Nov 29 09:33:39 worker-2 kubelet[64794]: I1129 09:33:39.286574   64794 operation_generator.go:484] MountVolume.SetUp succeeded for volume "default-token-jm7h5" (UniqueName: "kubernetes.io/secret/6171e2a7-d4e8-11e7-9c45-000d3a25e3c0-default-token-jm7h5") pod "nginx-56ccc998dd-nnsvj" (UID: "6171e2a7-d4e8-11e7-9c45-000d3a25e3c0")
Nov 29 09:33:39 worker-2 kubelet[64794]: I1129 09:33:39.431485   64794 kuberuntime_manager.go:370] No sandbox for pod "nginx-56ccc998dd-nnsvj_default(6171e2a7-d4e8-11e7-9c45-000d3a25e3c0)" can be found. Need to start a new one
Nov 29 09:33:42 worker-2 kubelet[64794]: I1129 09:33:42.449592   64794 kubelet.go:1871] SyncLoop (PLEG): "nginx-56ccc998dd-nnsvj_default(6171e2a7-d4e8-11e7-9c45-000d3a25e3c0)", event: &pleg.PodLifecycleEvent{ID:"6171e2a7-d4e8-11e7-9c45-000d3a25e3c0", Type:"ContainerStarted", Data:"0f539a84b96814651bb199e91f71157bc90c6e0c26340001c3f1c9f7bd9165af"}
Nov 29 09:33:47 worker-2 kubelet[64794]: I1129 09:33:47.637988   64794 kubelet.go:1871] SyncLoop (PLEG): "nginx-56ccc998dd-nnsvj_default(6171e2a7-d4e8-11e7-9c45-000d3a25e3c0)", event: &pleg.PodLifecycleEvent{ID:"6171e2a7-d4e8-11e7-9c45-000d3a25e3c0", Type:"ContainerStarted", Data:"d00709dfb00ed5ac99dcd092978e44fc018f44cca5229307c37d11c1a4fe3f07"}
Nov 29 11:13:14 worker-2 kubelet[64794]: I1129 11:13:14.468137   64794 kubelet.go:1853] SyncLoop (DELETE, "api"): "nginx-56ccc998dd-nnsvj_default(6171e2a7-d4e8-11e7-9c45-000d3a25e3c0)"
Nov 29 11:13:14 worker-2 kubelet[64794]: E1129 11:13:14.711891   64794 kuberuntime_manager.go:840] PodSandboxStatus of sandbox "0f539a84b96814651bb199e91f71157bc90c6e0c26340001c3f1c9f7bd9165af" for pod "nginx-56ccc998dd-nnsvj_default(6171e2a7-d4e8-11e7-9c45-000d3a25e3c0)" error: rpc error: code = Unknown desc = failed to get task status for sandbox container "0f539a84b96814651bb199e91f71157bc90c6e0c26340001c3f1c9f7bd9165af": process id 0f539a84b96814651bb199e91f71157bc90c6e0c26340001c3f1c9f7bd9165af not found: not found
Nov 29 11:13:14 worker-2 kubelet[64794]: E1129 11:13:14.711933   64794 generic.go:241] PLEG: Ignoring events for pod nginx-56ccc998dd-nnsvj/default: rpc error: code = Unknown desc = failed to get task status for sandbox container "0f539a84b96814651bb199e91f71157bc90c6e0c26340001c3f1c9f7bd9165af": process id 0f539a84b96814651bb199e91f71157bc90c6e0c26340001c3f1c9f7bd9165af not found: not found
Nov 29 11:13:15 worker-2 kubelet[64794]: I1129 11:13:15.788179   64794 kubelet.go:1871] SyncLoop (PLEG): "nginx-56ccc998dd-nnsvj_default(6171e2a7-d4e8-11e7-9c45-000d3a25e3c0)", event: &pleg.PodLifecycleEvent{ID:"6171e2a7-d4e8-11e7-9c45-000d3a25e3c0", Type:"ContainerDied", Data:"d00709dfb00ed5ac99dcd092978e44fc018f44cca5229307c37d11c1a4fe3f07"}
Nov 29 11:13:15 worker-2 kubelet[64794]: I1129 11:13:15.788221   64794 kubelet.go:1871] SyncLoop (PLEG): "nginx-56ccc998dd-nnsvj_default(6171e2a7-d4e8-11e7-9c45-000d3a25e3c0)", event: &pleg.PodLifecycleEvent{ID:"6171e2a7-d4e8-11e7-9c45-000d3a25e3c0", Type:"ContainerDied", Data:"0f539a84b96814651bb199e91f71157bc90c6e0c26340001c3f1c9f7bd9165af"}
Nov 29 11:46:45 worker-2 kubelet[42337]: I1129 11:46:45.384411   42337 kubelet.go:1837] SyncLoop (ADD, "api"): "nginx-56ccc998dd-nnsvj_default(6171e2a7-d4e8-11e7-9c45-000d3a25e3c0), kubernetes-dashboard-7486b894c6-2xmd5_kube-system(e55ca22c-d416-11e7-9c45-000d3a25e3c0), busybox3-69d4f5b66c-2lcs6_default(adb05024-d412-11e7-9c45-000d3a25e3c0), kube-dns-7797cb8758-zblzt_kube-system(e925cbec-d40b-11e7-9c45-000d3a25e3c0), busybox7-797cc644bc-n5sv2_default(b7135a8f-d412-11e7-9c45-000d3a25e3c0)"
Nov 29 11:46:45 worker-2 kubelet[42337]: I1129 11:46:45.387169   42337 kubelet.go:1871] SyncLoop (PLEG): "nginx-56ccc998dd-nnsvj_default(6171e2a7-d4e8-11e7-9c45-000d3a25e3c0)", event: &pleg.PodLifecycleEvent{ID:"6171e2a7-d4e8-11e7-9c45-000d3a25e3c0", Type:"ContainerDied", Data:"d00709dfb00ed5ac99dcd092978e44fc018f44cca5229307c37d11c1a4fe3f07"}
Nov 29 11:46:45 worker-2 kubelet[42337]: I1129 11:46:45.387245   42337 kubelet.go:1871] SyncLoop (PLEG): "nginx-56ccc998dd-nnsvj_default(6171e2a7-d4e8-11e7-9c45-000d3a25e3c0)", event: &pleg.PodLifecycleEvent{ID:"6171e2a7-d4e8-11e7-9c45-000d3a25e3c0", Type:"ContainerDied", Data:"0f539a84b96814651bb199e91f71157bc90c6e0c26340001c3f1c9f7bd9165af"}

cat /etc/systemd/system/kubelet.service

[Unit]
Description=Kubernetes Kubelet
Documentation=https://github.com/GoogleCloudPlatform/kubernetes
After=cri-containerd.service
Requires=cri-containerd.service

[Service]
ExecStart=/usr/local/bin/kubelet \
  --allow-privileged=true \
  --anonymous-auth=false \
  --authorization-mode=Webhook \
  --client-ca-file=/var/lib/kubernetes/ca.pem \
  --cluster-dns=10.32.0.10 \
  --cluster-domain=cluster.local \
  --container-runtime=remote \
  --container-runtime-endpoint=unix:///var/run/cri-containerd.sock \
  --image-pull-progress-deadline=2m \
  --kubeconfig=/var/lib/kubelet/kubeconfig \
  --network-plugin=cni \
  --pod-cidr=10.200.2.0/24 \
  --register-node=true \
  --require-kubeconfig \
  --runtime-request-timeout=15m \
  --tls-cert-file=/var/lib/kubelet/worker-2.pem \
  --tls-private-key-file=/var/lib/kubelet/worker-2-key.pem \
  --v=2
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

@JordyBottelier
Copy link

@JoseFMP use kubectl to request the yaml from the namespace, it might have finalizers that are holding up the process.

@JoseFMP
Copy link

JoseFMP commented Jul 9, 2020

@JordyBottelier Thank you.

No finalizers. Still stuck Terminating

@JordyBottelier
Copy link

@JoseFMP here is a script to kill it off entirely (effectively nuke it), simply save it and run ./script_name <your_namespace>:

#!/bin/bash

set -eo pipefail

die() { echo "$*" 1>&2 ; exit 1; }

need() {
	which "$1" &>/dev/null || die "Binary '$1' is missing but required"
}

# checking pre-reqs

need "jq"
need "curl"
need "kubectl"

PROJECT="$1"
shift

test -n "$PROJECT" || die "Missing arguments: kill-ns <namespace>"

kubectl proxy &>/dev/null &
PROXY_PID=$!
killproxy () {
	kill $PROXY_PID
}
trap killproxy EXIT

sleep 1 # give the proxy a second

kubectl get namespace "$PROJECT" -o json | jq 'del(.spec.finalizers[] | select("kubernetes"))' | curl -s -k -H "Content-Type: application/json" -X PUT -o /dev/null --data-binary @- http://localhost:8001/api/v1/namespaces/$PROJECT/finalize && echo "Killed namespace: $PROJECT"```

@peppy
Copy link

peppy commented Aug 17, 2020

I've also seemingly run into this, with multiple pods stuck in terminating, including one pod which is no longer visible anywhere in my infrastructure but still running as a ghost (it serves requests and I can see requests being served even with a deployment scale of zero).

I have zero visibility nor control over this pod and ask how I am supposed to troubleshoot a situation like this without shutting down all nodes forcefully?

@donbowman
Copy link

I've also seemingly run into this, with multiple pods stuck in terminating, including one pod which is no longer visible anywhere in my infrastructure but still running as a ghost (it serves requests and I can see requests being served even with a deployment scale of zero).

I have zero visibility nor control over this pod and ask how I am supposed to troubleshoot a situation like this without shutting down all nodes forcefully?

you'll have to access docker on the node.
You can use my dink (https://github.com/Agilicus/dink) which will bring up a pod w/ a shell w/ docker access, or ssh to the pod.
docker ps -a
docker stop ####

good luck.

@peppy
Copy link

peppy commented Aug 18, 2020

Thanks for the direction.

I was eventually able to solve this, but still a bit puzzled how it could happen (for me the pod was completely invisible). As it was in production things were a bit hectic and I wasn't able to perform diagnostics, but if it happens again hopefully I can make a better bug report.

@sciffer
Copy link

sciffer commented Oct 22, 2020

Seeing a similar symptom, pods stuck in terminating(interestingly they all have exec type probe for readiness/liveliness). Looking at the logs I can see: kubelet[1445]: I1022 10:26:32.203865 1445 prober.go:124] Readiness probe for "test-service-74c4664d8d-58c96_default(822c3c3d-082a-4dc9-943c-19f04544713e):test-service" failed (failure): OCI runtime exec failed: exec failed: cannot exec a container that has stopped: unknown. This message repeats itself forever and changing the exec probe to tcpSocket seems to allow the pod to terminate(based on a test, will follow up on it). The pod seems to have one of the containers "Running" but not "Ready", the logs for the "Running" container does show as if the service stopped.

@discanto
Copy link

This happens on containerd 1.4.0 when node load is high and vm.max_map_count is set to a higher value than the default, the containerd-shim doesnt drain the stdout fifo and blocks waiting for it to be drained, while dockerd fails to ge the event/acknowledge from containerd that the processes are gone.

@jingxu97 jingxu97 added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Nov 22, 2020
@jingxu97
Copy link
Contributor

@discanto thanks for sharing this information. Is the problem being fixed or tracked?

@Random-Liu

@jingxu97
Copy link
Contributor

The bug has opened more than 3 years. Pods stuck on terminating could be caused by a variety of reasons. When reporting your case, it would be very helpful to post some of the kubelet logs to see whether the pods stuck.

@jicki
Copy link

jicki commented Dec 23, 2020

@jingxu97

not a mountpoint

UnmountVolume.TearDown failed for volume

because: directory not empty

Dec 23 02:32:43 dn005 kubelet[27247]: W1223 02:32:43.224034   27247 mount_helper_common.go:65] Warning: "/var/lib/kubelet/pods/537d3332-69bf-4e83-b1b9-21aa84dae641/volumes/kubernetes.io~glusterfs/pvc-c4bf8313-e983-458d-a5ed-59e455fd9c69" is not a mountpoint, deleting
Dec 23 02:32:43 dn005 kubelet[27247]: E1223 02:32:43.224288   27247 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/glusterfs/537d3332-69bf-4e83-b1b9-21aa84dae641-pvc-c4bf8313-e983-458d-a5ed-59e455fd9c69 podName:537d3332-69bf-4e83-b1b9-21aa84dae641 nodeName:}" failed. No retries permitted until 2020-12-23 02:34:45.224234987 +0000 UTC m=+6716758.844223277 (durationBeforeRetry 2m2s). Error: "UnmountVolume.TearDown failed for volume \"user-data\" (UniqueName: \"kubernetes.io/glusterfs/537d3332-69bf-4e83-b1b9-21aa84dae641-pvc-c4bf8313-e983-458d-a5ed-59e455fd9c69\") pod \"537d3332-69bf-4e83-b1b9-21aa84dae641\" (UID: \"537d3332-69bf-4e83-b1b9-21aa84dae641\") : remove /var/lib/kubelet/pods/537d3332-69bf-4e83-b1b9-21aa84dae641/volumes/kubernetes.io~glusterfs/pvc-c4bf8313-e983-458d-a5ed-59e455fd9c69: directory not empty"


@JoseFMP
Copy link

JoseFMP commented Jan 7, 2021

In the meantime I upgraded to 1.18.10 and I did not see these issues again. Seems.

@ehashman
Copy link
Member

Echoing @jingxu97, there are a lot of different issues being discussed in this thread. There are many possible reasons why a Pod could get stuck in terminating. We know this is a common issue with many possible root causes! :)

If you run into this issue, please ensure you file a new bug with a detailed report, including a full dump of the relevant pod YAMLs and kubelet logs. This information is necessary to debug these issues.

I am going to close this particular issue because it dates back to 1.7 and its scope is not actionable.
/close

As an FYI we had a fix just hit the master branch recently for fixing a race condition where pods created and deleted rapidly would get stuck in Terminating: #98424

The node team is letting this bake for a bit and ensuring tests are stable; I'm not sure it'll get backported as it's a large change.

@k8s-ci-robot
Copy link
Contributor

@ehashman: Closing this issue.

In response to this:

Echoing @jingxu97, there are a lot of different issues being discussed in this thread. There are many possible reasons why a Pod could get stuck in terminating. We know this is a common issue with many possible root causes! :)

If you run into this issue, please ensure you file a new bug with a detailed report, including a full dump of the relevant pod YAMLs and kubelet logs. This information is necessary to debug these issues.

I am going to close this particular issue because it dates back to 1.7 and its scope is not actionable.
/close

As an FYI we had a fix just hit the master branch recently for fixing a race condition where pods created and deleted rapidly would get stuck in Terminating: #98424

The node team is letting this bake for a bit and ensuring tests are stable; I'm not sure it'll get backported as it's a large change.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@AkshayGets
Copy link

I faced a similar issue on Azure Kubernetes, while forcefully deleting the pods would solve this we wanted to have a permanent solution. While looking at the kubelet logs, saw below error:

Mar 3 19:01:43 pod_container_deletor.go:77] Container "adba604b69b6ec18da13f9323dfa47bd5373b4934419f83f8aa206f7d61a64be" not found in pod's containers

This log suggested that the container was deleted but the kubelet process somehow wasn't able to release that lock. So, as the next step, we reimaged all the nodes which had the pods residing with the terminating state and this solved the issue for this and all the terminating pods got deleted and hopefully, we will never see this issue again.

@fergaldonlon
Copy link

Just to add to the possible causes and for the benefit of those whose google search takes them here.
I have a cluster on AWS EKS where namespaces and pods are created/terminated several times a day. Today for the first time I saw a problem where a bunch of pods got stuck in a terminating state for several hours.

I believe the cause of this was a bad underlying node:
➜ ~ kubectl get nodes -A
NAME STATUS ROLES AGE VERSION
ip-10-x-x-x.ec2.internal NotReady 4h54m v1.16.13-eks-ec92d4

(a standard AWS instance hardware/unresponsive thing)

So what I think happened is that the pods on that node got into a state whereby they could not be terminated. Terminations of pods on good nodes were unaffected.
To fix I drained the pods off the bad node and as this in an ASG I was able to just terminate the instance and spin up a new one.

@Omniscience619
Copy link

Just to add to the possible causes and for the benefit of those whose google search takes them here. I have a cluster on AWS EKS where namespaces and pods are created/terminated several times a day. Today for the first time I saw a problem where a bunch of pods got stuck in a terminating state for several hours.

I believe the cause of this was a bad underlying node: ➜ ~ kubectl get nodes -A NAME STATUS ROLES AGE VERSION ip-10-x-x-x.ec2.internal NotReady 4h54m v1.16.13-eks-ec92d4

(a standard AWS instance hardware/unresponsive thing)

So what I think happened is that the pods on that node got into a state whereby they could not be terminated. Terminations of pods on good nodes were unaffected. To fix I drained the pods off the bad node and as this in an ASG I was able to just terminate the instance and spin up a new one.

I've been seeing such a behaviour on drains started by AWS Node Termination Handler on a 100% on-spot cluster. I'm not sure if the nodes get interrupted quicker than it can drain itself, or if it's something else. Luckily, the replacement pods do get started on healthier nodes, even though the dead ones are stuck in Terminating.

Manually deleting the dead nodes works.

@chenhong0129
Copy link

#108041

@pnjihia
Copy link

pnjihia commented Apr 2, 2022

I was only able to get rid of the "stuck in terminating" pods by deleting the fnalizers:
kubectl patch -n mynamespace pod mypod -p '{"metadata":{"finalizers":null}}'
The kubectl delete pod mypod --force --grace-period=0 didn't work for me

@jinleileiking
Copy link

briantopping added a commit to gardener-attic/vsphere-csi-driver that referenced this issue Mar 7, 2023
…ted on goes down. In `controller.ControllerUnpublishVolume()`, when a volume has been unmounted and we cannot find the node that the volume should be mounted on, the volume is in fact unmounted and should reported as such without error.

This works around a bug in kubernetes/kubernetes#51835 which [appears to remain unresolved and administratively closed](kubernetes/kubernetes#51835 (comment)).

Resolves kubernetes-sigs#15

Signed-off-by: Brian Topping <brian.topping@sap.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Projects
None yet
Development

No branches or pull requests