-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
During the update from 1.8 to 1.9 the operator started crashing #14222
Comments
@traceroute42 seems like you are using rook to connect to external ceph custer(RHCS), I suspect why you have the mds osds pods running in the k8s env, these daemons should just remain at the external ceph cluster side and not run on k8s rook cluster |
@parth-gr K8s ceph cluster has And external |
So do you have 2 cephcluster? Can you show different cephcluster running,
And the external ceph cluster on HDD was updated successfully but the internal cluster on nmve is blocked? |
kg cephclusters.ceph.rook.io -n rook-ceph
and with running operator after restart before crash
kg cephclusters.ceph.rook.io -n rook-ceph-external
but tried few times and it throw an error too
At the moment we're updating rook-ceph , we did update from 1.7 -> 1.8 and it went flawless, and trying 1.8 -> 1.9 because it still support ceph in version 15.2. Then we're going to update ceph cluster to 16 but after update rook-ceph from 1.8 -> 1.9 operator is crashing so we didn't do update ceph from 15 -> 16 on both servers. |
there are some crash conditions, I suggest re-starting the rook operator pod. And then share the operator logs. Maybe a network latency during the updates. PS: Also provide output of
|
kubectl get secrets rook-csi-rbd-provisioner -nrook-ceph-external
kubectl get cm -nrook-ceph
Logs from operator INFO and DEBUG level |
can you delete the rook-operator pod
I see the rook-csi-rbd-provisioner secret exists but in re-concile some context got cancelled somehow and it stuck there.. |
These logs above are directly after delete operator pod |
So in conclusion Internal cluster is failing
looks like context is nil and still called here, we might need to improve this
And external cluster
Looks like a network issue |
The logs are full of |
Do you mean the logs after an operator reboot? Then after crashes with out deleting Application pods in cluster can access to internal and external cluster storage |
It failed here
|
I did disable monitoring on external cluster
Logs after delete operator and after 1st crash with out delete |
I am out of thoughts on this
@travisn do you have any idea |
The crash is happening after the goroutine has a health check. As a workaround, try disabling the health check for mon failover. See this topic. This should do it: healthCheck:
daemonHealth:
mon:
disabled: true Then if you can continue upgrading to a newer version of Rook, you can re-enable the health checks again. On a newer version if you're still seeing an issue then we can look into a fix. |
I have some general suggestions that also might help, if they apply. If you aren't on the highest .z version (e.g., vX.y.z) of kubernetes, you might try updating k8s. I recall a while back that there was a k8s issue that affected configmaps that could result in Also, be sure you are following the upgrade guides carefully when doing the upgrades. Some of the upgrades require manual steps, and we have seen users miss them on accident. In particular, make sure to update |
I did on both internal and external internal
external
but still crashing
We're on 1.26.9 right now I see its possible update to 1.26.15-1.1 , so I might try. Editafter update master nodes from 1.26.9 to 1.26.15-1.1 still crashing |
Looks like the same problem,
and seems no option to disable telemetry, |
Strange, |
@travisn can we suggest updating to 1.10? |
It requires update ceph cluster from 15.2 to 16.x minimum
This will require an update of the external and internal cluster, just can it be done if the rook-operator crashes? |
@traceroute42 Any luck or any more clues? I'm not sure what will help here. I wonder if you could upgrade Ceph if you first downgrade Rook back to 1.8. Downgrades aren't tested so I would hesitate, but if the operator is failing anyway it might be worth trying. |
During the update, the operator started crashing, rook-ceph-crashcollector were updated to rook-version=v1.9.13, osd mon mds remained in rook-version=v1.8.10.
Api server returns issues posted in the log below.
Is this a bug report or feature request?
Deviation from expected behavior:
Expected behavior:
Operator starts normally
How to reproduce it (minimal and precise):
Follow steps from https://rook.io/docs/rook/v1.9/ceph-upgrade.html#csi-version upgrade from 1.8 to 1.9 simple install
With added env flag to operator
ROOK_DISABLE_ADMISSION_CONTROLLER = true
File(s) to submit:
cluster.txt
Logs to submit:
There also logs from api server with timeout to get resources but rbac
Cluster Status to submit:
Inside rook-ceph-tools pod
but from cmd line
kubectl rook-ceph ceph status
Error: . failed to run command. unable to upgrade connection: container not found ("rook-ceph-operator")%!(EXTRA string=failed to get rook version)
Environment:
rook version
inside of a Rook Pod): : rook: v1.9.13 / go: go1.17.13ceph -v
): operator pod says its ceph version 16.2.10 (45fa1a083152e41a408d15505f594ec5f1b4fe17) pacific (stable) but in daemons its still 15.2kubectl version
): 1.26The text was updated successfully, but these errors were encountered: