Longhorn v0.6.0 Upgrade: Workaround for recovering from a rollback failure in Rancher
Please make a backup of the volumes if possible before proceeding.
Due to a Longhorn bug, some users failed to upgrade Longhorn to version v0.6.0
: https://github.com/longhorn/longhorn/issues/754
Then they tried to roll back to v0.5.0
via Rancher UI. But unfortunately another Rancher/Helm bug was trigged and users got stuck in that rollback failure state:
https://github.com/longhorn/longhorn/issues/755
Here we document the workaround to help users recover from a rollback failure of the Longhorn app.
- The Longhorn app is getting stuck in upgrade/rollback from v0.6.0 to v0.5.0.
- The error message:
Failed to install app longhorn-system. Error: UPGRADE FAILED: timed out waiting for the condition
or
Failed to install app longhorn-system. Error: UPGRADE FAILED: transport is closing
shows in the app detail page in the Rancher UI.
Delete all workloads of longhorn system in the app's detail page, including longhorn-driver-deployer
, longhorn-manager
, longhorn-post-upgrade
, and longhorn-ui
. DO NOT DELETE OTHER PODS.
- This step is to avoid the following upgrade getting stuck.
- This deletion is safe for the data in Longhorn as long as CRD objects and old engine/replica pods from v0.5.0 remain intact.
If you prefer kubectl
commands rather than Rancher UI, you can use following commands to clean up the workloads:
kubectl -n longhorn-system delete daemonset longhorn-manager
kubectl -n longhorn-system delete deployment longhorn-driver-deployer longhorn-ui
kubectl -n longhorn-system delete job longhorn-post-upgrade
Delete all ConfigMaps named longhorn-system.v<version number>
in namespace longhorn-system
. e.g. longhorn-system.v2
.
- Those are release histories of Longhorn, recorded by the Helm.
- Do not remove the config maps without the
longhorn-system.v
prefix.
These ConfigMaps can be deleted via Rancher UI or kubectl
commands.
If you prefer kubectl
commands rather than Rancher UI, you can use the following commands to find out then delete all related ConfigMaps.
kubectl -n longhorn-system get cm
kubectl -n longhorn-system delete cm <longhorn-system.vxxx>
kubectl patch -p '{"metadata":{"finalizers": null}}' crd instancemanagers.longhorn.rancher.io
kubectl delete crd instancemanagers.longhorn.rancher.io
kubectl -n longhorn-system delete cm longhorn-default-setting
- You can check this doc for the details.
Use the Rancher App page to upgrade Longhorn to the latest version. The error message in the Rancher UI will disappear.
- Since there is no release history now, Helm will apply
install
rather thanupgrade
to avoid the panic caused by force upgrade function.
Check the image version of longhorn system workloads. If it’s incorrect, back to step 1 and redo the whole workaround steps.
- The incorrect image version means Helm somehow messes up the current release. Then we need to delete those workloads and the related release history to let Helm reinstall the whole app.