New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pods stuck on terminating #51835
Comments
@kubernetes/sig-aws @kubernetes/sig-scheduling |
Usually volume and network cleanup consume more time in termination. Can you find in which phase your pod is stuck? Volume cleanup for example? |
Correct. They are always suspect. @igorleao You can try |
Hi @resouer and @dixudx
As you can see, this cluster has Calico for CNI.
Is there a better way find out which phase a pod is stuck?
|
Seems @igorleao Is this reproducible? Or it is just not that stable, happening occasionally. I've met such errors before, just to make sure. |
@dixudx it happens several times a day for a certain cluster. Others clusters created with the same verstion of kops and kubernetes, in the same week, work just fine. |
@igorleao As the log shows that the volume manager failed to remove the secrete directory because device is busy. |
@igorleao how do you run kubelet? in container? if so can you please post your systemd unit or docker config for kubelet? We see similar behaviour. We run kubelet as container and problem was partially mitigated by mounting |
@stormltf Can you please post your kubelet container configuration? |
@stormltf you're running kubelet in container and don't use Can you please for stuck pods do following: on node where pod is running
Please also do same for freshly created pod. I excpect to see some |
@stormltf did you restart kubelet after first two pods were created? |
@stormltf You can try to make |
/sig storage |
For some it might help. We are running kubelet in docker container with
For some more details. This does not properly solve the problem as for every bind mount you'll get 3 mounts inside kubelet container (2 parasite). But at least shared mount allow to easily unmount them with one shot. CoreOS does not have this problem. Because the use rkt and not docker for kubelet container. In case our case kubelet runs in Docker and every mount inside kubelet continer gets proposed into
|
|
I have the same issue with Kubernetes 1.8.1 on Azure - after deployment is changed and new pods are have been started, the old pods are stuck at terminating. |
I have the same issue on Kubernetes 1.8.2 on IBM Cloud. After new pods are started the old pods are stuck in terminating. kubectl version I have used |
If root cause still the same (improperly proposed mounts) then this is distribution specific bug imo. Please describe how you run kubelet run in IBM cloud? systemd unit? does it have |
it is run with --containerized flag set to false.
|
ok, i need more info, please see my comment above #51835 (comment) and also please show contents of In particual, if kubelet runs in docker i want to see all bind mounts |
Today I encountered an issue that may be the same as the one described, where we had pods on one of our customer systems getting stuck in the terminating state for several day's. We were also seeing the errors about "Error: UnmountVolume.TearDown failed for volume" with "device or resource busy" repeated for each of the stuck pods. In our case, it appears to be an issue with docker on RHEL/Centos 7.4 based systems covered in this moby issue: moby/moby#22260 and this moby PR: https://github.com/moby/moby/pull/34886/files For us, once we set the sysctl option fs.may_detach_mounts=1 within a couple minutes all our Terminating pods cleaned up. |
I'm also facing this problem: Pods got stuck in Terminating state on 1.8.3. Relevant kubelet logs from the node:
Kubelet is running as systemd unit (not in container) on Ubuntu 16.04. Volumes spec from the pod:
UPD: I faced this problem before as well on 1.6.6 |
Experiencing the same on Azure..
kubectl version
describe pod nginx-56ccc998dd-nnsvj
sudo journalctl -u kubelet | grep "nginx-56ccc998dd-nnsvj"
cat /etc/systemd/system/kubelet.service
|
@JoseFMP use kubectl to request the yaml from the namespace, it might have finalizers that are holding up the process. |
@JordyBottelier Thank you. No finalizers. Still stuck |
@JoseFMP here is a script to kill it off entirely (effectively nuke it), simply save it and run ./script_name <your_namespace>:
|
I've also seemingly run into this, with multiple pods stuck in terminating, including one pod which is no longer visible anywhere in my infrastructure but still running as a ghost (it serves requests and I can see requests being served even with a deployment scale of zero). I have zero visibility nor control over this pod and ask how I am supposed to troubleshoot a situation like this without shutting down all nodes forcefully? |
you'll have to access docker on the node. good luck. |
Thanks for the direction. I was eventually able to solve this, but still a bit puzzled how it could happen (for me the pod was completely invisible). As it was in production things were a bit hectic and I wasn't able to perform diagnostics, but if it happens again hopefully I can make a better bug report. |
Seeing a similar symptom, pods stuck in terminating(interestingly they all have exec type probe for readiness/liveliness). Looking at the logs I can see: kubelet[1445]: I1022 10:26:32.203865 1445 prober.go:124] Readiness probe for "test-service-74c4664d8d-58c96_default(822c3c3d-082a-4dc9-943c-19f04544713e):test-service" failed (failure): OCI runtime exec failed: exec failed: cannot exec a container that has stopped: unknown. This message repeats itself forever and changing the exec probe to tcpSocket seems to allow the pod to terminate(based on a test, will follow up on it). The pod seems to have one of the containers "Running" but not "Ready", the logs for the "Running" container does show as if the service stopped. |
This happens on containerd 1.4.0 when node load is high and vm.max_map_count is set to a higher value than the default, the containerd-shim doesnt drain the stdout fifo and blocks waiting for it to be drained, while dockerd fails to ge the event/acknowledge from containerd that the processes are gone. |
@discanto thanks for sharing this information. Is the problem being fixed or tracked? |
The bug has opened more than 3 years. Pods stuck on terminating could be caused by a variety of reasons. When reporting your case, it would be very helpful to post some of the kubelet logs to see whether the pods stuck. |
not a mountpoint UnmountVolume.TearDown failed for volume because: directory not empty
|
In the meantime I upgraded to |
Echoing @jingxu97, there are a lot of different issues being discussed in this thread. There are many possible reasons why a Pod could get stuck in terminating. We know this is a common issue with many possible root causes! :) If you run into this issue, please ensure you file a new bug with a detailed report, including a full dump of the relevant pod YAMLs and kubelet logs. This information is necessary to debug these issues. I am going to close this particular issue because it dates back to 1.7 and its scope is not actionable. As an FYI we had a fix just hit the master branch recently for fixing a race condition where pods created and deleted rapidly would get stuck in Terminating: #98424 The node team is letting this bake for a bit and ensuring tests are stable; I'm not sure it'll get backported as it's a large change. |
@ehashman: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I faced a similar issue on Azure Kubernetes, while forcefully deleting the pods would solve this we wanted to have a permanent solution. While looking at the kubelet logs, saw below error: Mar 3 19:01:43 pod_container_deletor.go:77] Container "adba604b69b6ec18da13f9323dfa47bd5373b4934419f83f8aa206f7d61a64be" not found in pod's containers This log suggested that the container was deleted but the kubelet process somehow wasn't able to release that lock. So, as the next step, we reimaged all the nodes which had the pods residing with the terminating state and this solved the issue for this and all the terminating pods got deleted and hopefully, we will never see this issue again. |
Just to add to the possible causes and for the benefit of those whose google search takes them here. I believe the cause of this was a bad underlying node: (a standard AWS instance hardware/unresponsive thing) So what I think happened is that the pods on that node got into a state whereby they could not be terminated. Terminations of pods on good nodes were unaffected. |
I've been seeing such a behaviour on drains started by AWS Node Termination Handler on a 100% on-spot cluster. I'm not sure if the nodes get interrupted quicker than it can drain itself, or if it's something else. Luckily, the replacement pods do get started on healthier nodes, even though the dead ones are stuck in Terminating. Manually deleting the dead nodes works. |
I was only able to get rid of the "stuck in terminating" pods by deleting the fnalizers: |
…ted on goes down. In `controller.ControllerUnpublishVolume()`, when a volume has been unmounted and we cannot find the node that the volume should be mounted on, the volume is in fact unmounted and should reported as such without error. This works around a bug in kubernetes/kubernetes#51835 which [appears to remain unresolved and administratively closed](kubernetes/kubernetes#51835 (comment)). Resolves kubernetes-sigs#15 Signed-off-by: Brian Topping <brian.topping@sap.com>
Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug
What happened:
Pods stuck on terminating for a long time
What you expected to happen:
Pods get terminated
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
Kubernetes pods stuck as
Terminating
for a few hours after getting deleted.Logs:
kubectl describe pod my-pod-3854038851-r1hc3
sudo journalctl -u kubelet | grep "my-pod"
sudo journalctl -u docker | grep "docker-id-for-my-pod"
Environment:
Kubernetes version (use
kubectl version
):Client Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.3", GitCommit:"2c2fe6e8278a5db2d15a013987b53968c743f2a1", GitTreeState:"clean", BuildDate:"2017-08-03T15:13:53Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.6", GitCommit:"7fa1c1756d8bc963f1a389f4a6937dc71f08ada2", GitTreeState:"clean", BuildDate:"2017-06-16T18:21:54Z", GoVersion:"go1.7.6", Compiler:"gc", Platform:"linux/amd64"}
Cloud provider or hardware configuration**:
AWS
OS (e.g. from /etc/os-release):
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"
CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"
Kernel (e.g.
uname -a
):Linux ip-172-16-30-204 3.10.0-327.10.1.el7.x86_64 Unit test coverage in Kubelet is lousy. (~30%) #1 SMP Tue Feb 16 17:03:50 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
Install tools:
Kops
Others:
Docker version 1.12.6, build 78d1802
@kubernetes/sig-aws @kubernetes/sig-scheduling
The text was updated successfully, but these errors were encountered: