Skip to content

Longhorn v0.6.0 Upgrade: Workaround for recovering from a rollback failure in Rancher

Sheng Yang edited this page Oct 8, 2019 · 19 revisions

Note

Please make a backup of the volumes if possible before proceeding.

Background

Due to a Longhorn bug, some users failed to upgrade Longhorn to version v0.6.0: https://github.com/longhorn/longhorn/issues/754

Then they tried to roll back to v0.5.0 via Rancher UI. But unfortunately another Rancher/Helm bug was trigged and users got stuck in that rollback failure state: https://github.com/longhorn/longhorn/issues/755

Here we document the workaround to help users recover from a rollback failure of the Longhorn app.

Workaround applies only if:

  1. The Longhorn app is getting stuck in upgrade/rollback from v0.6.0 to v0.5.0.
  2. The error message:
Failed to install app longhorn-system. Error: UPGRADE FAILED: timed out waiting for the condition

or

Failed to install app longhorn-system. Error: UPGRADE FAILED: transport is closing

shows in the app detail page in the Rancher UI.

Steps

1. Delete all workloads

Delete all workloads of longhorn system in the app's detail page, including longhorn-driver-deployer, longhorn-manager, longhorn-post-upgrade, and longhorn-ui. DO NOT DELETE OTHER PODS.

  • This step is to avoid the following upgrade getting stuck.
  • This deletion is safe for the data in Longhorn as long as CRD objects and old engine/replica pods from v0.5.0 remain intact.

If you prefer kubectl commands rather than Rancher UI, you can use following commands to clean up the workloads:

kubectl -n longhorn-system delete daemonset longhorn-manager
kubectl -n longhorn-system delete deployment longhorn-driver-deployer longhorn-ui
kubectl -n longhorn-system delete job longhorn-post-upgrade

2. Delete release histories of Helm

Delete all ConfigMaps named longhorn-system.v<version number> in namespace longhorn-system. e.g. longhorn-system.v2.

  • Those are release histories of Longhorn, recorded by the Helm.
  • Do not remove the config maps without the longhorn-system.v prefix.

These ConfigMaps can be deleted via Rancher UI or kubectl commands.

If you prefer kubectl commands rather than Rancher UI, you can use the following commands to find out then delete all related ConfigMaps.

kubectl -n longhorn-system get cm
kubectl -n longhorn-system delete cm <longhorn-system.vxxx>

3. Clean up the resources introduced by the failed v0.6.0 upgrade

kubectl patch -p '{"metadata":{"finalizers": null}}' crd instancemanagers.longhorn.rancher.io
kubectl delete crd instancemanagers.longhorn.rancher.io
kubectl -n longhorn-system delete cm longhorn-default-setting
  • You can check this doc for the details.

4. Upgrade to the version v0.6.2

Use the Rancher App page to upgrade Longhorn to the latest version. The error message in the Rancher UI will disappear.

5. Check the image version

Check the image version of longhorn system workloads. If it’s incorrect, back to step 1 and redo the whole workaround steps.

  • The incorrect image version means Helm somehow messes up the current release. Then we need to delete those workloads and the related release history to let Helm reinstall the whole app.

6. Verify Longhorn works as normal.

Clone this wiki locally