New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
failed to garbage collect required amount of images. Wanted to free 473842483 bytes, but freed 0 bytes
#71869
Comments
/sig gcp |
I just upgraded my master version and nodes to 1.11.3-gke.18 to see if that would be any help, but I'm still seeing the exact same thing. |
FWIW "Boot disk size in GB (per node)" was set to the minimum, 10 Gb. |
@samuela any update on the issue ? I see the same problem. |
@hgokavarapuz No update as far as I've heard. Def seems like a serious issue for GKE. |
@samuela I saw this issue on AWS but was able to work around by using a different AMI. Have to check what is the difference in the AMI though that it causes it to happen. |
@hgokavarapuz Interesting... maybe this has something to do with the node OS/setup then. |
Have to debug more what exactly causes this issue though.
…On Wed, Dec 12, 2018 at 1:23 PM samuela ***@***.***> wrote:
@hgokavarapuz <https://github.com/hgokavarapuz> Interesting... maybe this
has something to do with the node OS/setup then.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#71869 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AmWWLdQjFnWgM5jeutfY6YqJBQ9l2l8gks5u4XO2gaJpZM4ZJWSq>
.
--
Thank you
Hemanth
|
@hgokavarapuz check the kubelet logs for clues |
I was able to fix mine it was an issue with the AMI that I was using which has /var folder mounted to an EBS volume with some restricted size, causing the problem with Docker containers creation. Was not directly obvious from the logs but checking the space and other things made it clear. |
@hgokavarapuz Are you sure that this actually fixes the problem and doesn't just require more image downloads for the bug to occur? In my case this was happening within the GKE allowed disk sizes, so I'd say there's definitely still some sort of bug in GKE here at least. It would also be good to have some sort of official position on the minimum disk size required in order to run kubernetes on a node without getting this error. Otherwise it's not clear exactly how large the volumes must be in order to be within spec for running kubernetes. |
@samuela I haven't tried on GKE but that was the issue on the AWS with some of the AMI's. Maybe that there is an issue with GKE. |
We're hitting something similar on GKE v1.11.5-gke.4. There seems to be some issue with GC not keeping up, as seen by the following events:
Scanning the kubelet logs, I see the following entries:
It seems like something is holding the GC to reclaim the storage fast enough. The node looks like it eventually recovers, but some pods get evicted in the process. |
I am encountering the same issue. I deployed the stack with kops on AWS and my k8s version is 1.11.6. The problem is that i have an application downtime ones per week when the disk pressure happened. |
same issue here. I extended the ebs volumes thinking that would fix it. |
I faced similar issue but on AKS. When we scale down cluster with
and when I ssh into one of them I can see plenty of old images like
or
which is crazy as this as I see plenty of below errors
|
@samuela: There are no sig labels on this issue. Please add a sig label by either:
Note: Method 1 will trigger an email to the group. See the group list. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I am hitting this on Openstack using v1.11.10 Node is completely out of disk space and kubelet logs are now a loop of:
|
The issue for me was caused by a container taking a lot of disk space in a short amount of time. This happened in multiple nodes. The container was evicted (every pod in the node was), but the disk was not reclaimed by kubelet. I had to |
This is happening for me too. I have 8 nodes in an EKS cluster, and for some reason only one node is having this GC issue. This has happened twice, and the below steps are what I've done to fix the issue. Does anyone know of a better / supported method for doing this? https://kubernetes.io/docs/tasks/administer-cluster/cluster-management/#maintenance-on-a-node
|
Faced the same problem.
|
Thank you very much. It worked with me |
I just had the same issue on a customers' RKE2 ( I0607 16:27:03.708243 7302 image_gc_manager.go:310] "Disk usage on image filesystem is over the high threshold, trying to free bytes down to the low threshold" usage=89 highThreshold=85 amountToFree=4305076224
lowThreshold=80
E0607 16:27:03.710093 7302 kubelet.go:1347] "Image garbage collection failed multiple times in a row" err="failed to garbage collect required amount of images. Wanted to free 4305076224 bytes, but freed 0 bytes
" The actual problem was caused by an "underlying" full To actually resolve this issue without downtime or any impact on the data read/write operations in the mkdir /mnt/temp-root
mount --bind /var /mnt/temp-root
ls -la /mnt/temp-root/lib/longhorn
rm -rf /mnt/temp-root/lib/longhorn/*
umount /mnt/temp-root
rmdir /mnt/temp-root Source and explanation of a "bind mount": https://unix.stackexchange.com/questions/198590/what-is-a-bind-mount Thanks, @andrecp, for the hint regarding "dropped mounts" (#71869 (comment))!! Regards, |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
@SergeyKanzhelev is the issue still up for grabs? I'm a new contributor and though I might need help in getting this implemented, I would still like to give this a shot if available. |
feel free to take it! i took this ages ago, but wasn't knowledgeable enough about how this worked / go in general. |
/assign |
I'm having some Garbage Collection and disk space issues too (I run microk8s). I've changed the settings for It sounds like a not ideal default garbage collection would be better than none! I've also had this problem with docker in the past just keeping endless ephemeral disks after updating a container. |
/assign |
OI had the same problem and it was caused by the bad idea of mounting a gcs bucket with fuse as a storage for k3s. Unmounting the bucket and rest<rting the node solved the problem for me. Used gcsfuse to mount the bucket, it's a really useful tool but not for high r/w demands, ideal for backups maybe but isolated or managed not as a block stoprage. Regards from Chile. |
/assign |
/remove-lifecycle stale |
/assign |
/unassign |
What happened: I've been seeing a number of evictions recently that appear to be due to disk pressure:
Taking a look at
kubectl get events
, I see these warnings:Digging a bit deeper:
There's actually remarkably little here. This message doesn't say anything regarding why ImageGC was initiated or why it was unable recover more space.
What you expected to happen: Image GC to work correctly, or at least fail to schedule pods onto nodes that do not have sufficient disk space.
How to reproduce it (as minimally and precisely as possible): Run and stop as many pods as possible on a node in order to encourage disk pressure. Then observe these errors.
Anything else we need to know?: n/a
Environment:
kubectl version
):uname -a
):Darwin D-10-19-169-80.dhcp4.washington.edu 18.0.0 Darwin Kernel Version 18.0.0: Wed Aug 22 20:13:40 PDT 2018; root:xnu-4903.201.2~1/RELEASE_X86_64 x86_64
/kind bug
The text was updated successfully, but these errors were encountered: