-
Notifications
You must be signed in to change notification settings - Fork 581
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pods do not start on upgrade from 1.24.8-rancher1-1 to 1.24.9-rancher1-1 #3160
Comments
Update: However, I then "reversed" the upgrade steps: I tried upgrading the nodes first with a full node upgrade first, then an k8 upgrade. Big mistake. The node upgrade stopped all running pods on the cluster, and My current node provisioning script runs a full apt upgrade BEFORE installing docker using the install-docker script. It would seem that running an apt update/upgrade on the nodes after the fact puts an unsupported version of docker on the node. Which means one of my clusters is in a precarious position, running docker 23 with 1.24.9-rancher1-1. I'm really not sure why this is as difficult as it is. Either I am doing something incredibly wrong or there is a lack of stability in these updates. For now, my safest bet is to provision new nodes before I do any upgrades of any kind. Some guidance on this would be much appreciated. |
I'm staying on Docker 20.10.22 until Rancher supports Docker 23. Issue requesting support for Docker 23: rancher/rancher#40417 |
@sbonds I honestly didn't mean to upgrade to 23: the I seem to have a lot of issues when replacing nodes, similar to the above issues. There's no telling whether a small upgrade in Kubernetes will cause my pods to completely stop or not. I can't tell if a restart is required or not. |
it looked like i have similar issues, my containers where all in error, rebooting the vm's helped an let everthing work again. But i would like to upgrade without having to reboot my vm's. Because of running longhorn it takes some time before the whole cluster is up and running |
Having the same issue and also running docker 23. Was attempting to upgrade from 1.24.8 to 1.25 and encountered the same error:
|
Same issue with v1.24.10-rancher4-1, any kubelet restart causes errors in kubelet log |
@priitr-ent have you tried to restart the entire machine? What we ended up doing is run the upgrade, this message pops up and fails the upgrade. Then we rebooted our cluster machines (all of them) and triggered the |
Yeah, that also remediates the error. |
We've been hitting this issue as well. Either with upgrading from 1.23.x to 1.24.x or from 1.24.x to 1.24.x |
@priitr-ent @electrical If possible, could you post |
@kinarashah https://gist.github.com/priitr-ent/8c0129d92ef081bea4403518640a32ec |
I see the same issue on a cluster running with Azure VMs which we use to test, on the Rancher cluster itself running on vSphare and a downstream cluster of it when I upgrade to Kubernetes 1.26.6. If I go back to version 1.24.8 the issue disappears. I get this always if the kubelet container restarts on Kubernetes 1.26.6.
|
@kinarashah any update after supplying the logs? |
Same for me (of course 3280 to). RKE: 1.4.6 or 1.4.8 Ubuntu: 22.04.2 Kernel: Linux 5.15.0-78-generic #85-Ubuntu SMP Fri Jul 7 15:25:09 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux Client: Docker Engine - Community Server: Docker Engine - Community Workaround that works, switch back to cgrougv1 via Kernel parameter: systemd.unified_cgroup_hierarchy=0 This is surprising, since Suse, in the support matrix, lists Ubuntu with the Docker version for RKE 1.4.6 as functional. |
My basic ways to reproduce this have not lead to success, if there is anyone who can reproduce this with a stock cloud image, can you please share which cloud and which image so I can use it? I have used Ubuntu 22.04 on AWS and Debian 11 on DigitalOcean without success. As not everyone is hitting this issue, there needs to be some specific software version/configuration/deployment present that causes this issue. If you don't have a stock cloud image to reproduce, please share the following outputs:
If you can only reproduce on your own infra, maybe you can add verbose logging to the kubelet ( |
Thanks for the offer, @superseb. I'm using a self-built image based on https://github.com/David-VTUK/Rancher-Packer Here are the version numbers you requested:
What's really odd is that I see this in the rancher provisioning log:
I don't see any traffic between the VM that hosts the rancher docker-compose environment and the control/etcd node. btw, I also posted on the Rancher Users slack instance before finding this issue: |
@TomyLobo if still available, can you upload the full rancher (provisioning) logs? It can't take any action on the node if it can't connect so I suspect there are more logs from before that actually did something. In the Slack message you mention that there was a network issue during the upgrade, did it only happen during that time or can you reproduce it from scratch? |
I've been encountering this issue as well for all upgrades when using Rancher 2.7.6. However, there's a caveat: this only seems to be a problem with clusters with more than 3-6 nodes. I've created and destroyed a vsphere-type cluster multiple times: the initial build always succeed without issue, but when upgrading, some part of that process invariably breaks it. I can see that the kubernetes-related slices are removed from the node OS (they do not exist in /run/systemd/transient), but a restart of the kubelet container is enough to fix it. Unfortunately, this can happen multiple times to a node during the upgrade process, especially when it fails and retries at a later time once the node is healthy again. What's worse is that the nodes aren't affected every time; only about 80% of it. From what I can see, only one thing avoids this issue: using the cgroupfs driver instead of the systemd one. I can replicate this issue 100% of the time with a 3-control/3-etcd/6-worker cluster on Ubuntu 22.04, no matter which version of docker is used. |
This happened to me with a 3-node and a 4-node cluster, so it's definitely also happening with smaller clusters.
I might have mentioned this before, the 2nd cluster upgrade was not during a network issue. About logs: |
This also happens if I just change anything in the cluster config, say the vsphere CSI password. |
Any logs relating to this issue help as there is no solid lead yet to what is causing this. If you can reproduce it so reliably, can you also get the logs when in debug or even trace (watch for sensitive info being logged in trace)? Did you see the SSH tunneling error again? Also please include the output of |
I have just hit this for the first time. Upgraded our DEV and TEST clusters yesterday from 1.24.10 to 1.26.8. Both clusters have 8 worker nodes where 6 of them are Ubuntu 20.04 and 2 Ubuntu 22.04. The syslog on the 22.04 nodes show many rows like this.
|
I saw the same issue in our cluster a while ago. After some investigation we saw the problem with the change from cgroupv1 to cgorupv2, change from Debian 10 to 11 if I have it in mind right. --> https://kubernetes.io/docs/concepts/architecture/cgroups/ I was able to change the script (not persistent) like in this PR, then the issue was gone. Additionally, I run into some related issues where systemd (cgroupv2) was not able to create the cgroups right after the first boot. It was necessary to reboot the vm first. In this case, no container is able to run, also not kubelet. |
@bananflugan @kgrando Any information you have on reproducing helps here as we cannot reproduce it from just using Ubuntu 22.04. @bananflugan Are the Ubuntu 22.04 nodes using cgroupsv2? And do you have any additional info on the way the nodes were added/updated? Was it a new image with Ubuntu 22.04, was it updated from Ubuntu 20.04? Was it updated while in the cluster? Did it include a Docker update as well? From what to what version? Was the node rebooted after the update(s)? @kgrando Restart of all pods is different than pods not starting, are you seeing pods not starting after upgrade? |
@superseb Sorry, it's a while ago, I have not any more all details in mind. Definitely we had the situation where only the kubelet was running and all other pods not because of this cgroup error. |
My 22.04 nodes were added by using stock ubuntu server iso. The only non-default packages installed on them are docker, realmd for joining to AD and zabbix-agent. Except for the k8s-upgrade nothing else was changed on any host. When the upgrade was done i noticed that every container on the 22.04 hosts were not working while the rest were fine, running 20.04. I then did a reboot of the 22.04 servers and all was fine after that. The docker version im running in all clusters atm is 23.0.6 |
My nodes are ubuntu 22.04 as well with no relevant extras either. |
Hi, any news on that? We are facing this problem on our clusters too. kubernetes_version: v1.25.13-rancher1-1 If you need some logs, I could provide logs collected with log collector script. from kubelet log
Kind regards |
Only way to fix this problem for us (for the moment), was to reboot the affected node. Problem occurs during rke run on all nodes where kubelet has to be restarted. etcd, controlplane and worker role is affected. rke run on downstream stuck at kubelet healthcheck on affected node. |
@steffeneichler Could you upload the logs? You could also DM me on slack or email it to kinara.shah@suse.com if that's preferable. I want to compare the before and after kubelet process args. I haven't been able to reproduce this myself, so I don't have a good root cause for this issue yet. Seems like there are 2 issues here, kubelet restart is restarting all user pods (which I see) but I don't see the kubelet error for those pods. If anyone else has a sample workload yaml + cluster yaml they're running, that'd be helpful as well. Versions I have checked:
|
I've seen this as well with rke 1.4.6 updating 1.23.7 -> 1.25.9 with both dockerd 23.0.6 and 24.0.9. rke will ultimately fail with
at which point rebooting the nodes seems to be fix. Rerunning
until the host is rebooted. |
We updated last week 7 clusters from k8s 1.25.13 to 1.26.11. On all these clusters Ubuntu 22.04 is installed. "Unfortunately" the problem didn't occurred again. |
@steffeneichler Thank you for trying! @shalomjacob has also been trying to reproduce but couldn't. I'll look at diff between v1.4.x and v1.5.x to see if something stands out. Rancher embeds the RKE version so it'll depend on the Rancher version. Are you using Rancher v2.8.x? v2.8.x versions will use RKE v1.5.x. |
@jhoblitt Any chance you collected logs or kubelet container args? Are you able to reproduce this consistently? If so, any logs you collected would be helpful. |
@kinarashah I have a cluster update scheduled for Monday. I can collect any logs your interested in then. |
Which CNI are you using? cilium or calico? |
@jhoblitt I got the same error as you. But it happens in some nodes, not all.
Try it if your file cannot be found. |
@huv95 Thanks for sharing what worked for you, to clarify there was no path for |
@kinarashah No when error occurs. |
@huv95 Thank you for explaining the behavior, appreciate it. Yeah I've been looking into the issue you linked and kubelet's code around when it updates cgroups but didn't reproduce the error yet, will try killing the container and see if it works. |
@kinarashah Or an easier way |
@huv95 The version you're upgrading to ( Could you try upgrading to the latest RKE and try upgrading to I will take a look into upstream code more if you still reproduce the issue, but I suspect this should reduce/fix the pods restarting issue. |
@kinarashah I just upgraded from v1.25.12 to v1.26.14, using rke:1.5.7. |
We just got hit by this bug except for us it happened weeks after updating Rancher from The only thing I could find in the Rancher control plane logs was the following
|
@tunaman |
RKE version:
1.4.2
Docker version: (
docker version
,docker info
preferred)Operating system and kernel: (
cat /etc/os-release
,uname -r
preferred)Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO)
Hyper-V Ubuntu 22.04 nodes
cluster.yml file:
Steps to Reproduce:
I continue to see issues in trying to upgrade clusters, even small jumps.
Example:
I have two clusters running
1.24.8-rancher1-1
and I attempted to upgrade to1.24.9-rancher1-1
using RKE 1.4.2. Therke up
command runs successfully, but every pod basically goes into an error state with errors such as this:Results:
The only success I have had in the past is to provision new nodes and then move them, which i usually do in multiple
rke up
commands. If I sneak a k8 upgrade in when I'm provisioning new nodes and then get rid of the old ones, the cordon/drain process seems to restart everything...This is really getting worrisome, as I'm unable to perform a simple upgrade of the cluster.
The only success I have had is to provision new nodes and then move them, which i usually do in multiple
rke up
commands. If I sneak a k8 upgrade in when I'm provisioning new nodes and then get rid of the old ones, the cordon/drain process seemsThe text was updated successfully, but these errors were encountered: