Helm upgrade with --wait is waiting even when all resources are ready #2426

eduardobaitello · 2017-05-10T20:46:50Z

I just installed a new release, then I upgraded the image tag for one of my Deployments inside a chart.
After that, I executed a helm upgrade mychart myrelease --wait --timeout 9999. Tiller logs only detected changes in the above described Deployment, creating a new pod and ReplicaSet with the new image. Then, the upgrade process get stuck in "waiting for resources", even when all my pods are ready and running:

[user@hostname ~]$ kubectl get po -n mynamespace
NAME                                         READY     STATUS    RESTARTS   AGE
XXXXX-2061248956-1k0xc                       1/1       Running   0          14m
XXXXXXXX-3687332959-bjb5b                    1/1       Running   0          14m
new-created-pod-2470432599-38k9j             1/1       Running   0          1m
XXXXXX-1811469710-3pm6k                      1/1       Running   0          14m
XXX-2438109-qv4m1                            1/1       Running   0          14m

2017/05/10 18:24:01 wait.go:47: beginning wait for resources with timeout of 2h46m39s

So I figured out that the problem is that a new ReplicaSet was created, but helm seems like also waiting for the old one to get all resources ready.
I manually deleted the old-replica-set-320589364, and the upgrade was completed.

[user@hostname ~]$ kubectl get rs -n mynamespace
NAME                                   DESIRED   CURRENT   READY     AGE
XXXXX-2061248956                       1         1         1         21m
XXXXXXXX-3687332959                    1         1         1         21m
new-created-replica-set-2470432599     1         1         1         7m
old-replica-set-320589364              0         0         0         21m
XXXXXX-1811469710                      1         1         1         21m
XXX-2438109                            1         1         1         21m

I'm using Kubernetes v1.6.0 with Helm v2.4.1

The text was updated successfully, but these errors were encountered:

blakebarnett · 2017-05-10T21:31:25Z

Seeing this also, k8s 1.6.2, helm 2.4.1

thomastaylor312 · 2017-05-11T03:19:05Z

@eduardobaitello Thank you for all of the details. I'll look to see what is happening

thomastaylor312 · 2017-05-11T03:31:50Z

Ok, I duplicated this and am working on fixing

The current methodology generated its own RS slice instead of using a helper method that uses a `clientset`. This caused some issues where `FindNewReplicaSet` always returned `nil`. This switches the method and removes some unneeded API calls and code. Closes helm#2426

thomastaylor312 · 2017-05-11T05:24:22Z

@eduardobaitello Can you give #2430 a try if you have some time and see if it solves the issue for you?

eduardobaitello · 2017-05-11T14:05:42Z

@thomastaylor312, thanks for the quick help!
I don't have expertise in Go lang, so I need to wait a patch release with the fix :(
Thanks again!

The current methodology generated its own RS slice instead of using a helper method that uses a `clientset`. This caused some issues where `FindNewReplicaSet` always returned `nil`. This switches the method and removes some unneeded API calls and code. Closes #2426

chancez · 2017-05-18T17:22:55Z

I'm still having problems in 2.4.2 with this. Even if I've changed nothing in my release, tiller ends up waiting the full timeout, and no pods are unready, nothing even changed, so everything is already ready, but the wait seems to not detect this.

eduardobaitello · 2017-05-18T19:55:37Z

@chancez check the maxUnavailable value of your Deployments. The --wait flag now uses this value for check the "resources ready" condition (Take a look at Notes).

donaldguy · 2017-05-22T22:18:50Z

can confirm that after upgrading to 2.4.2 I am also having a similar problem (but on a Deployment with strategy: { type: Recreate } )

donaldguy · 2017-05-22T22:24:04Z

In my case downgrading to 2.4.1 fixes my issue, so #2430 is directly implicated

donaldguy · 2017-05-23T00:57:43Z

admittedly my issue may be want of kubernetes/kubernetes#41740 (obviously I unlike the other in this thread am not on 1.6 yet)

chancez · 2017-05-23T03:05:18Z

I only tried 2.3.2 (works) and 2.4.2 (doesn't work), so I'll also give 2.4.1 a shot to see if I can help ensure it's just something in 2.4.2 (rather than the must larger 2.3.2 -> 2.4.2 set of change).

thomastaylor312 · 2017-05-23T03:24:13Z

I am going to reopen this one due to the multiple people having issues. If you have a chart that you can share (or one of the main stable charts) when this issues happens, please let me know because I tried several charts and had issues duplicating this

tyrannasaurusbanks · 2017-05-25T16:27:42Z

FWIW, I thought I'd hit this bug but upon inspection it was because I was using an external service - proposed a fix for external services here: #2497

Thought I'd mention in case anyone viewing this is using external services too.

chancez · 2017-05-25T18:35:46Z

I tested with 2.4.1, it has the same issue as 2.4.2, so it's definitely something about 2.4.x, but not sure where.

chancez · 2017-05-25T18:36:55Z

I'd also like to note I'm not using externalServices, my deployments are all at their desired replica count, no pods are unready, etc.

eduardobaitello · 2017-05-25T18:53:18Z

@chancez can you please post the helm command that are you using for the upgrade/install, and also the tiller logs of the failed release? When the release is about to timeout (but you are sure it's ready), catch the ouput of kubectl get all -n yournamespace.
For me, this issue was solved after @thomastaylor312 commit, but I'll be glad if I can help to find the problem.

chancez · 2017-05-25T19:32:37Z

Tiller logs are this, https://gist.github.com/chancez/2d632496799632298efa0ccf9fa70f9d but I don't have the output of the resources at the time. I can assure you that they're in an unchanged state since I'm testing the helm upgrade without having made any changes, as the logs indicate. I'll try to get another helm run with 2.4.2 as well as the kubectl output.

jagregory · 2017-05-26T22:23:24Z

FWIW, I'm seeing this with helm install --wait too. v2.4.2

The current methodology generated its own RS slice instead of using a helper method that uses a `clientset`. This caused some issues where `FindNewReplicaSet` always returned `nil`. This switches the method and removes some unneeded API calls and code. Closes helm#2426

thomastaylor312 · 2017-06-24T22:30:12Z

Ok, I narrowed down where this is happening. I could not duplicate this on 1.6.4 (on Minikube), but I could duplicate the issue with 1.5.3 (on Minikube). The problem is that the new replica set is returning nil due to the fact that this line doesn't return a controller for the deployment.

@adamreese I went as far down the stack as I understand, do you have any idea what could be causing this?

whereisaaron · 2017-06-27T21:43:31Z

Thanks @thomastaylor312! Hopefully you're on the trail. Did that code differ between helm 2.3.x and 2.4.x? As we didn't see this issue with 2.3.x with everything else being identical.

In case it is relevant I am also using coreos clusters (kube-aws) like @baracoder. And the cluster it failed in was k8s 1.5.x.

thomastaylor312 · 2017-06-28T00:19:00Z

@whereisaaron We made a change to be smarter with --wait and Deployment objects between 2.3 to 2.4. We found one bug that we had to patch with a similar issue. So we'll be working to figure this out

seh · 2017-06-29T17:54:53Z

If the ReplicaSet either lacks an owner reference or has it set improperly, here's a place to look for telltale signs of failure: (*DeploymentController).getReplicaSetsForDeployment.

thomastaylor312 · 2017-06-29T19:56:22Z

Is there something with how that is working in k8s 1.5?

seh · 2017-06-29T20:04:50Z

According to the VCS, that area of the code has seen some action over the last eight months. Issue kubernetes/kubernetes#33845 and PR kubernetes/kubernetes#35676 sound relevant, and the latter looks like it didn't make it in until version 1.6.

whereisaaron · 2017-07-29T19:31:01Z

Hi @thomastaylor312 just upgraded to helm 2.5.1 for a k8s 1.5 cluster and can confirm this problem still exists. A helm upgrade --install will wait long after all the resources are ready and until the timeout is reached, then report the deployment as failed, even though the deployment was successful.

Client: &version.Version{SemVer:"v2.5.1", GitCommit:"7cf31e8d9a026287041bae077b09165be247ae66", GitTreeState:"clean"}
Server: &version.Version{SemVer:"v2.5.1", GitCommit:"7cf31e8d9a026287041bae077b09165be247ae66", GitTreeState:"clean"}

helm 2.3 was the last version of helm where --wait worked for k8s 1.5 clusters. I test after each helm upgrade and --wait has never worked again.

+ helm upgrade --install --wait --timeout 600 ...
Error: UPGRADE FAILED: timed out waiting for the condition

erichaase · 2017-08-04T00:22:36Z

Upgrading to a 1.7 k8s cluster resolved this issue for me!

boosh · 2017-12-01T11:07:22Z

I'm seeing this with helm 2.7.2 on k8s 1.8.3-gke.0.

In my dev env with replicas: 1 it works fine. In staging with replicas: 2 helm hangs for the full duration of the timeout, yet all pods are running. The pod uses the following deployment strategy:

  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 1
    type: RollingUpdate

The only other thing to note is this chart runs a batch job as a post-install task. Even when that job finishes, helm still hangs. Perhaps the problem is that the job has a status of Completed, but is never marked Ready?

NAME                                                     READY     STATUS      RESTARTS   AGE
curteous-content-content-1512129000-d27zn                0/1       Completed   0          1m

thomastaylor312 · 2017-12-01T17:37:23Z

@boosh Do both of your pods show up as ready while it is hanging?

boosh · 2017-12-04T10:48:07Z

@thomastaylor312 Yes

alexppg · 2018-04-10T18:14:24Z

Hi, is there a solution to this in a 1.5.* cluster?

whereisaaron · 2018-04-11T01:28:14Z

@alexppg no, and even on the latest version I find it problematic/unreliable. I just don't use the --wait option any more. Better to deploy and then script your own wait-for-ready step.

alexppg · 2018-04-11T05:35:25Z

Shouldn't then be an opened issue about it? This or any other one.

PD: thanks, I'll do that. @whereisaaron

MalhotraVijay · 2018-09-19T13:59:27Z

We are also getting this issue with the helm version v2.10.0 version on Kubernetes version v1.8.2

Client: &version.Version{SemVer:"v2.10.0", GitCommit:"9ad53aac42165a5fadc6c87be0dea6b115f93090", GitTreeState:"clean"}
Server: &version.Version{SemVer:"v2.10.0", GitCommit:"9ad53aac42165a5fadc6c87be0dea6b115f93090", GitTreeState:"clean"}

We have an internal chart, which we intend to upgrade via the CI/CD pipeline using the following command and the upgrade process keeps on waiting.

helm upgrade <chart_release> <chart_path> --debug --wait

We then have to kill the process and the chart release history show 'PENDING_UPGRADE' for some unknown amount of time before either it says DEPLOYED during some builds or Upgrade "<chart_release>" failed: timed out waiting for the condition

Is this solved or are we missing something ?

Brightside56 · 2018-10-03T17:17:29Z

I have same issue with install/update with --wait

KIVagant · 2018-10-05T14:39:38Z

Same here.

Client: &version.Version{SemVer:"v2.11.0", GitCommit:"2e55dbe1fdb5fdb96b75ff144a339489417b146b", GitTreeState:"clean"}
Server: &version.Version{SemVer:"v2.11.0", GitCommit:"2e55dbe1fdb5fdb96b75ff144a339489417b146b", GitTreeState:"clean"}

# helm upgrade ... --wait 900 ... shows:
Error: UPGRADE FAILED: timed out waiting for the condition

...
# helm status shows
LAST DEPLOYED: Fri Oct  5 14:04:23 2018
NAMESPACE: default
STATUS: FAILED

But all resources were created.

KIVagant · 2018-10-05T21:30:08Z

Looks like this workaround suggested by @tcolgate also works for this issue. I just fixed FAILED release using helm rollback myproject 16 where 16 was "FAILED" release.

#3208 (comment)

KIVagant · 2018-10-08T13:45:15Z

Looks like in my case there was a real problem. One of the pods stuck in Pending state with the error: 0/6 nodes are available: 2 node(s) had taints that the pod didn't tolerate, 4 Insufficient cpu.

I added one more node and now it works!

Why it looks like this bug but it isn't? Because previously, when I checked the CI result, K8s has already deployed the pod that was in Pending state at the moment of helm upgrade.

smileusd · 2018-12-18T12:48:54Z

daemonset have the same problem

lm-xingxing · 2018-12-21T07:44:02Z

use --debug only(no --wait), to see which resource is pending, maybe it different from chart defined ,that cause pending

Emixam23-FCMS · 2020-10-01T13:59:20Z

Hey guys, still having the issue, any idea? Currently using helm image 3.2.0

piratf · 2021-11-05T11:58:23Z

Same issue. All pods are running, but helm upgrade still waiting.

version.BuildInfo{Version:"v3.3.4", GitCommit:"a61ce5633af99708171414353ed49547cf05013d", GitTreeState:"clean", GoVersion:"go1.14.9"}

zhilyaev · 2021-11-09T23:55:09Z

The same problem

castaneai · 2022-01-14T10:21:36Z

Hi. I got the same problem with minikube.
The cause was that the type: LoadBalancer resource did not have an External IP ( kubernetes/minikube#4113 ).
I found the cause because the --debug flag provided me with the following log.

  wait.go:48: [debug] beginning wait for 29 resources with timeout of 5m0s
  ready.go:258: [debug] Service does not have load balancer ingress IP address: agones-system/agones-allocator
  ready.go:258: [debug] Service does not have load balancer ingress IP address: agones-system/agones-allocator
  ready.go:258: [debug] Service does not have load balancer ingress IP address: agones-system/agones-allocator
  ready.go:258: [debug] Service does not have load balancer ingress IP address: agones-system/agones-allocator
  ready.go:258: [debug] Service does not have load balancer ingress IP address: agones-system/agones-allocator
...

eduardobaitello mentioned this issue May 10, 2017

Upgrade fails with timeout (after upgrading to 2.4.1), then gets stuck #2412

Closed

eduardobaitello changed the title ~~Helm upgrade with --wait is waiting even when all pods are ready~~ Helm upgrade with --wait is waiting even when all resources are ready May 10, 2017

thomastaylor312 self-assigned this May 11, 2017

thomastaylor312 added the bug label May 11, 2017

thomastaylor312 mentioned this issue May 11, 2017

upgrade stable/traefik not found, while search shows it exists #2413

Closed

thomastaylor312 mentioned this issue May 11, 2017

fix(tiller): Fixes bug with --wait and updated deployments #2430

Merged

thomastaylor312 closed this as completed in #2430 May 11, 2017

chancez mentioned this issue May 18, 2017

helm --wait option crash tiller server when deploying same chart. #2043

Closed

thomastaylor312 reopened this May 23, 2017

thomastaylor312 mentioned this issue Jun 29, 2017

Helm upgrade stuck in locking release #2627

Closed

lvicentesanchez mentioned this issue Oct 31, 2017

--wait seems to fail with kubernetes < 1.7.0 heartysoft/cloud-toolkit#6

Merged

bacongobbler closed this as completed Oct 31, 2017

yvespp mentioned this issue Nov 20, 2017

Use "kubectl rollout status" code for install/update --wait #3176

Closed

KIVagant mentioned this issue Oct 5, 2018

helm upgrade --install no longer works #3208

Closed

Helm upgrade with --wait is waiting even when all resources are ready #2426

Helm upgrade with --wait is waiting even when all resources are ready #2426

Comments

eduardobaitello commented May 10, 2017 • edited

blakebarnett commented May 10, 2017

thomastaylor312 commented May 11, 2017

thomastaylor312 commented May 11, 2017

thomastaylor312 commented May 11, 2017

eduardobaitello commented May 11, 2017 • edited

chancez commented May 18, 2017

eduardobaitello commented May 18, 2017

donaldguy commented May 22, 2017

donaldguy commented May 22, 2017

donaldguy commented May 23, 2017

chancez commented May 23, 2017

thomastaylor312 commented May 23, 2017

tyrannasaurusbanks commented May 25, 2017

chancez commented May 25, 2017

chancez commented May 25, 2017

eduardobaitello commented May 25, 2017

chancez commented May 25, 2017

jagregory commented May 26, 2017

thomastaylor312 commented Jun 24, 2017

whereisaaron commented Jun 27, 2017

thomastaylor312 commented Jun 28, 2017

seh commented Jun 29, 2017

thomastaylor312 commented Jun 29, 2017

seh commented Jun 29, 2017

whereisaaron commented Jul 29, 2017

erichaase commented Aug 4, 2017

boosh commented Dec 1, 2017 • edited

thomastaylor312 commented Dec 1, 2017

boosh commented Dec 4, 2017

alexppg commented Apr 10, 2018

whereisaaron commented Apr 11, 2018

alexppg commented Apr 11, 2018

MalhotraVijay commented Sep 19, 2018 • edited

Brightside56 commented Oct 3, 2018

KIVagant commented Oct 5, 2018 • edited

KIVagant commented Oct 5, 2018 • edited

KIVagant commented Oct 8, 2018 • edited

smileusd commented Dec 18, 2018

lm-xingxing commented Dec 21, 2018

Emixam23-FCMS commented Oct 1, 2020

piratf commented Nov 5, 2021

zhilyaev commented Nov 9, 2021

castaneai commented Jan 14, 2022

eduardobaitello commented May 10, 2017 •

edited

eduardobaitello commented May 11, 2017 •

edited

boosh commented Dec 1, 2017 •

edited

MalhotraVijay commented Sep 19, 2018 •

edited

KIVagant commented Oct 5, 2018 •

edited

KIVagant commented Oct 5, 2018 •

edited

KIVagant commented Oct 8, 2018 •

edited