New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pods stuck in Terminating state due to process not being killed #9834
Comments
From the logs and following a bit more of what the shim does etc, it looks like the shim thought it had actually killed the correct PID and it even prints an exit code of 143, which would've been a SIGTERM exit. Buuuut, the respective process was definitely still running. I'm puzzled. In another instance of the same issue, containerd did not print an exit code, but again the correct PID and a |
Here some more information. This time, we didn't get the umount issues (which I believe are only a followup problem on the "kill not being effective" anyway) but I got the chance to manually run
|
Just seen another occurrence of containerd talking about an exit code but the respective process still be running.
|
Here's the stacktrace of the sandbox in such a case. The 68min goroutines are when the kill signal was sent. Superficially, this looks to me as if the termination signal was indeed received but the sandbox is hung while shutting down.
|
The sandbox has one alive process:
|
runsc kill sends SETERM by default. This signal can be handled by a process. You can try to send SIGKILL. |
@avagin thanks for chiming in. After your comments, I made a few more tests around SIGTERM handling to see if there's any regression here, but no: Even if I handle SIGTERM and then ignore it in the respective container, eventually (after the |
With a bit more digging, I think what I'm looking at is that the container has been successfully removed since its processes are gone (including its state file as mentioned by the logs above) but the actual processes of the container are running via the sandbox. They are holding onto the containers The delete is then retried over and over by containerd, we keep alerting on a missing state file because we've already successfully deleted the container and the only contention left is that the processes running under the sandbox keep a hold of the rootfs. Sorry for the probably very imprecise explanation. I'm building a mental model as I go here 😅 |
@manninglucas any chance that the two rather recent commits 3ab01ae and 6a112c6 have anything to do with this? |
@markusthoemmes could you reproduce the issue and show outputs of 'ls -l /proc/PID/fd' for runsc processes? |
@avagin here we go. Sadly, as noted above, the "inner" container correctly stops and so there's no processes left that I could check there. All the python fds on the sandbox are part of the rootfs of the container image. FWIW, both sandbox: 90022bca840ef2fac36200ae8f3511800e883f07558b146290c3b6a9822090b8
|
Thanks for the debugging output. The sandbox process is holding FDs to the destroyed container's rootfs. @avagin pointed out that the gofer client (pkg/sentry/fsimpl/gofer, which is part of the sandbox process) is not cleaning up the dentry tree when the filesystem is unmounted. Historically the reason for this was that all file descriptors were owned by the gofer server (part of the gofer process), and when a client disconnected, it would clean up all file descriptors and resources for that client. So cleaning up the dentry cache tree was not required, so we avoided it. However, with directfs the gofer client owns the required file descriptors for each dentry. Hence, failure to clean up cached dentries in the tree is causing this issue. I can send a fix. |
To confirm this theory, could you configure runsc with |
@ayushr2 this is running with |
Ah, my bad I missed that. Even with directfs=false, the gofer donates a host file descriptor for regular files to the sandbox (so that the sandbox can directly read/write to the FD without making RPCs). I suspect that's what is going on, all the open FDs shown above seem to be of regular files. |
Can you check if this issue reproduces with |
Sadly, I don't have a good reproducer for this so I can't easily confirm. I'll have to get a new version into the pipeline and let it sit for a bit to see if it happens again. What doesn't quite check out to me (but might be totally down to my relative fresh-ness with gVisor in K8s) is that under the Does that still sound consistent with your theory? |
Sorry if I am re-iterating what has already been discussed above; how are you observing that the containerized python process is still running? AFAIK, all processes of all containers in a sandbox run in the sentry context. The sentry allocates a "Task Goroutine" for each application task. So observing these containerized processes from the host should make it look like the sandbox process is running the application code. The sentry does not create host sub-processes for application processes. So how do you observe that a certain python process in the to-be-deleted container is still running from the host? Or are you using application logs to confirm?
The sentry imposes a "dentry cache", hence the sentry can be holding the FDs, even if the python application has closed it. To confirm if the python process is really behind holding these FDs, you can run |
Here's some more info on the container above. Note how the container is in STOPPED state (and it's runsc and runsc-gofer processes are gone) but the sandbox with all the subprocesses is still running. Anecdotally, not sure if this helps narrowing this down or is something else entirely, I've also seen hangs where the container's runsc and runsc-gofer processes where not gone and containerd wasn't hanging umounting but it was hanging sending KILL signals but the processes not budging, apparently. The mix of these is what lead me to pick the title of the issue 😅 Sandbox: 90022bca840ef2fac36200ae8f3511800e883f07558b146290c3b6a9822090b8
|
To not clutter things too much, here's the state of the "anecdotal" container I was talking about above: https://gist.github.com/markusthoemmes/7ebf064b44b1a182f552fbe1cc1b9150. I've tried to gather as much info as I could think of on that one as well. |
Just to clarify, the @avagin and I had a chat, some updates:
|
Thanks @ayushr2 and @avagin for taking a look, it's greatly appreciated! Would the new hypothesis also explain the pod stuck like mentioned in #9834 (comment)? It seems like we're having both of these symptoms at the same time so they might be correlated or might be completely different 😅 . |
Otherwise we will leak these control FDs and the sandbox will continue holding on to it. Note that the dentry cache is not destroyed on Release(). This can prevent the container filesystems from being unmounted in multi-container scenario due to EBUSY errors. Updates #9834 Investigated-by: Andrei Vagin <avagin@google.com> PiperOrigin-RevId: 596770126
I've looked at comparing differences of the shutdown behavior of "normally behaving" pods and hanging pods today to poke at this some more. Interestingly, the line that's missing in the hanging behavior is "shim disconnected" and sure enough, there's one leaked shim on this respective node! Does this go in favor of the cause you've been speculating above or does this potentially point at a containerd issue? EDIT: Killing the shim in this case does not unwedge the situation. normal pod
hanging pod
|
Aaaaand one more datapoint in my quest to find the nugget of info that I'm looking for: The goroutine dump of the "leaked" shim: goroutine dump of leaked shim
Sadly, I'm not able to see anything fishy in here. I'm saying this shim leaked because it doesn't have any child processes. That in itself might be me just being wrong. |
I've made some progress on this: I was able to reproduce this sortakinda reliably (not reliable on my dev clusters yet) by force-exceeding the ephemeral-storage in the respective pod. The We haven't adjusted the overlay2 setting, so we should be running with Edit: I can cautiously confirm that using |
Interesting, thanks for the investigation! We have internal tests which test exceeding the ephemeral storage limits in GKE with |
Thanks @ayushr2. Here we go, I've gotten it to reproduce with the below PodSpec. It's been a little finicky so I left seemingly unrelated parts in it as well. FWIW, I've tested this on the latest gVisor release as well, just in case. Same behavior there. So as a reminder:
apiVersion: v1
kind: Pod
metadata:
name: repro
spec:
restartPolicy: Always
containers:
- name: repro
image: shlinkio/shlink@sha256:c70cf1b37087581cfcb7963d74d6c13fbee8555a7b10aa4af0493e70ade41202
env:
- name: INITIAL_API_KEY
value: foobar
- name: DEFAULT_DOMAIN
value: foo.bar
resources:
limits:
cpu: "1"
ephemeral-storage: 4G
memory: 2Gi
requests:
cpu: 200m
ephemeral-storage: 400M
memory: "858993459" You can trip the issue by running
Sometimes, this works fine, leaving the pod in an For some reason that is still beyond me, it happens much more consistently when doing the same thing to the pods of the following deployment apiVersion: v1
data:
DEFAULT_DOMAIN: Zm9vLmJhcg==
GEOLITE_LICENSE_KEY: ""
INITIAL_API_KEY: Zm9vYmFy
kind: Secret
metadata:
name: shlinkio-shlink-secrets
type: Opaque
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: shlinkio-shlink
spec:
minReadySeconds: 10
progressDeadlineSeconds: 730
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
app-component: shlinkio-shlink
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
annotations:
env-var-hash: 1c3af241bb8862b3fc57e28a31797c39
creationTimestamp: null
labels:
app-component: shlinkio-shlink
spec:
automountServiceAccountToken: false
containers:
- env:
- name: CNB_PROCESS_TYPE
value: web
- name: KUBERNETES_PORT
- name: KUBERNETES_PORT_443_TCP
- name: KUBERNETES_PORT_443_TCP_ADDR
- name: KUBERNETES_PORT_443_TCP_PORT
- name: KUBERNETES_PORT_443_TCP_PROTO
- name: KUBERNETES_SERVICE_HOST
- name: KUBERNETES_SERVICE_PORT
- name: KUBERNETES_SERVICE_PORT_HTTPS
- name: PORT
value: "8080"
envFrom:
- secretRef:
name: shlinkio-shlink-secrets
image: shlinkio/shlink@sha256:c70cf1b37087581cfcb7963d74d6c13fbee8555a7b10aa4af0493e70ade41202
imagePullPolicy: IfNotPresent
name: shlinkio-shlink
ports:
- containerPort: 8080
name: http-8080
protocol: TCP
readinessProbe:
failureThreshold: 9
initialDelaySeconds: 30
periodSeconds: 10
successThreshold: 1
tcpSocket:
port: 8080
timeoutSeconds: 5
resources:
limits:
cpu: "1"
ephemeral-storage: 4G
memory: 2Gi
requests:
cpu: 200m
ephemeral-storage: 400M
memory: "858993459"
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
enableServiceLinks: false
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
topologySpreadConstraints:
- labelSelector:
matchLabels:
app-component: shlinkio-shlink
maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway |
I have been trying to use the following GKE Reproducer:
I have only been able to repro the
Let me try the deployment... |
No luck with the deployment OR using the latest runsc from master.
@markusthoemmes Could you check the stuck container's rootfs with This file is deleted when the container is destroyed. From your logs above, you can see that the sandbox is still holding an FD to this file:
It shows that the file is deleted (which means Meanwhile, can you try reproducing with |
Actually, the other FDs shown in this comment are also from the container rootfs. They don't have the The filestore FD is closed when the sentry overlayfs mount (and hence the tmpfs upper layer) are unmounted inside the sentry. Given this FD is still open, it means that the rootfs mount is not being unmounted inside the sentry (hence leaking all these host FDs). So I don't think |
Oh I missed this edit. So The presence of the The fact that the filestore file is marked But apparently this issue does not happen with @markusthoemmes While reproducing this comment, were you running any other containers in the pod (apart from the pause container)? Or was the stuck container the only one? I am trying to understand if the other FDs to files like |
I can confirm that
This is very anecdotal. The reproducer is so unreliably that I just can't tell for sure, sadly. I've been trying to get it reproduced with full debug logging enabled to get us closer...
It was the only container in the pod. |
For some reason, my reproducer isn't reproducing for me anymore either. 😢 |
FWIW, we keep seeing this with our server on Google Cloud Run. Each request executes runsc and this seems to accumulate exe processes until the limit is reached. Each subsequent request then fails. |
We're in the process of rolling out |
|
I've tried |
@zpavlinovic are you seeing the exact same issues with failures to unmount the respective container's rootfs? |
I am seeing the same issues I've been seeing before. They manifest as extra processes that just keep piling up. I am not sure if the underlying issue is |
So sadly, we're still seeing containers getting stuck even with sandbox log
container log
It is worth noting that the metrics of those hanging containers spike up to and stick to 100% CPU. This happens before the container is even asked to shut down, suggesting to me that something goes wrong and breaks before that happens and that trying to terminate the pod just surfaces the symptom more clearly. |
We've had another instance of the process reporting 100% CPU load consistently after a certain point and know that the process literally did nothing (apart from listening to a socket, hosting an HTTP server... the usual stuff). To me, this increasingly looks like something's getting tripped in gVisor at runtime and this not actually being a problem exclusively at deletion time, which makes it even more concerning and harder to detect reliably. @ayushr2 does this ring any bells to you in terms of what the underlying issue could be? |
Nope, nothing stands out immediately. Hard to investigate without a reproducer. I have a few suggestions:
|
I fear I can't enable profiling as we're only somewhat reliably see this in prod scale. I'll see what I can do on that front. FWIW, it seems like the CPU usage comes from the |
FWIW, I don't know if overlay2=none had an impact here or not, but since doing that, all hanging containers have also shown the symptoms of #10000. |
Now that #10000 is resolved, could you try removing |
@ayushr2 that's precisely what we were thinking as well and we're going to do that. I'll let you know and/or close this issue if we don't see this come back. |
@ayushr2 finally gotten around to do this. Sadly, we still see pods hanging on deletion. The error symptoms seem to have remained identical. |
Hi @markusthoemmes, Sorry for the delay. I read through this issue again.
Is the error still If that's the case, it appears that the sandbox is not releasing the rootfs We also know that Lines 83 to 85 in 87d8df3
Note that Could you pass Also, since this (the stuck container) is the only container in the pod (apart from the pause container), when you hit this state, can you just delete the sandbox itself? Or is that hung too? |
Yes we're running with Switching |
Description
We're seeing pods getting stuck in a Terminating state permanently. Any commands done via
crictl
orctr
fail in the same way as the containerd and kubelet logs suggest (see below). It's only if we manually kill the sandbox PID (as surfaced bycrictl inspect
andcrictl inspecp
) that the deletion process unwedges. I think what's happening is:lsof
onto the rootfs and following the hierarchy of the using PID back up to 1101026, the PID of the container/sandbox)I'm a bit at a loss for further debugging to be honest and I don't know if gvisor or containerd is at fault here either. It's happening in a live system fairly regularly but we don't have a reproducer yet either. I'd appreciate any thoughts for further information gathering to pinpoint exactly what is going on.
Pod state
container state in crictl
Note that CONTAINER_RUNNING here is inconsistent with STOPPED in containerd state.
pod state in crictl
container state in ctr
pod state in ctr
Logs grepping for the pod
Logs grepping for the container
Steps to reproduce
None yet, sadly.
runsc version
docker version (if using docker)
No response
uname
6.1.0-12-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.52-1 (2023-09-07) x86_64 GNU/Linux
kubectl (if using Kubernetes)
repo state (if built from source)
No response
runsc debug logs (if available)
No response
The text was updated successfully, but these errors were encountered: