Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Helm upgrade with --wait is waiting even when all resources are ready #2426

Closed
eduardobaitello opened this issue May 10, 2017 · 54 comments
Closed
Assignees

Comments

@eduardobaitello
Copy link
Contributor

eduardobaitello commented May 10, 2017

I just installed a new release, then I upgraded the image tag for one of my Deployments inside a chart.
After that, I executed a helm upgrade mychart myrelease --wait --timeout 9999. Tiller logs only detected changes in the above described Deployment, creating a new pod and ReplicaSet with the new image. Then, the upgrade process get stuck in "waiting for resources", even when all my pods are ready and running:

[user@hostname ~]$ kubectl get po -n mynamespace
NAME                                         READY     STATUS    RESTARTS   AGE
XXXXX-2061248956-1k0xc                       1/1       Running   0          14m
XXXXXXXX-3687332959-bjb5b                    1/1       Running   0          14m
new-created-pod-2470432599-38k9j             1/1       Running   0          1m
XXXXXX-1811469710-3pm6k                      1/1       Running   0          14m
XXX-2438109-qv4m1                            1/1       Running   0          14m

2017/05/10 18:24:01 wait.go:47: beginning wait for resources with timeout of 2h46m39s

So I figured out that the problem is that a new ReplicaSet was created, but helm seems like also waiting for the old one to get all resources ready.
I manually deleted the old-replica-set-320589364, and the upgrade was completed.

[user@hostname ~]$ kubectl get rs -n mynamespace
NAME                                   DESIRED   CURRENT   READY     AGE
XXXXX-2061248956                       1         1         1         21m
XXXXXXXX-3687332959                    1         1         1         21m
new-created-replica-set-2470432599     1         1         1         7m
old-replica-set-320589364              0         0         0         21m
XXXXXX-1811469710                      1         1         1         21m
XXX-2438109                            1         1         1         21m

I'm using Kubernetes v1.6.0 with Helm v2.4.1

@eduardobaitello eduardobaitello changed the title Helm upgrade with --wait is waiting even when all pods are ready Helm upgrade with --wait is waiting even when all resources are ready May 10, 2017
@blakebarnett
Copy link

Seeing this also, k8s 1.6.2, helm 2.4.1

@thomastaylor312
Copy link
Contributor

@eduardobaitello Thank you for all of the details. I'll look to see what is happening

@thomastaylor312
Copy link
Contributor

Ok, I duplicated this and am working on fixing

thomastaylor312 added a commit to thomastaylor312/helm that referenced this issue May 11, 2017
The current methodology generated its own RS slice instead of using
a helper method that uses a `clientset`. This caused some issues where
`FindNewReplicaSet` always returned `nil`. This switches the method and
removes some unneeded API calls and code.

Closes helm#2426
@thomastaylor312
Copy link
Contributor

@eduardobaitello Can you give #2430 a try if you have some time and see if it solves the issue for you?

@eduardobaitello
Copy link
Contributor Author

eduardobaitello commented May 11, 2017

@thomastaylor312, thanks for the quick help!
I don't have expertise in Go lang, so I need to wait a patch release with the fix :(
Thanks again!

adamreese pushed a commit that referenced this issue May 17, 2017
The current methodology generated its own RS slice instead of using
a helper method that uses a `clientset`. This caused some issues where
`FindNewReplicaSet` always returned `nil`. This switches the method and
removes some unneeded API calls and code.

Closes #2426
@chancez
Copy link

chancez commented May 18, 2017

I'm still having problems in 2.4.2 with this. Even if I've changed nothing in my release, tiller ends up waiting the full timeout, and no pods are unready, nothing even changed, so everything is already ready, but the wait seems to not detect this.

@eduardobaitello
Copy link
Contributor Author

@chancez check the maxUnavailable value of your Deployments. The --wait flag now uses this value for check the "resources ready" condition (Take a look at Notes).

@donaldguy
Copy link

can confirm that after upgrading to 2.4.2 I am also having a similar problem (but on a Deployment with strategy: { type: Recreate } )

@donaldguy
Copy link

In my case downgrading to 2.4.1 fixes my issue, so #2430 is directly implicated

@donaldguy
Copy link

admittedly my issue may be want of kubernetes/kubernetes#41740 (obviously I unlike the other in this thread am not on 1.6 yet)

@chancez
Copy link

chancez commented May 23, 2017

I only tried 2.3.2 (works) and 2.4.2 (doesn't work), so I'll also give 2.4.1 a shot to see if I can help ensure it's just something in 2.4.2 (rather than the must larger 2.3.2 -> 2.4.2 set of change).

@thomastaylor312
Copy link
Contributor

I am going to reopen this one due to the multiple people having issues. If you have a chart that you can share (or one of the main stable charts) when this issues happens, please let me know because I tried several charts and had issues duplicating this

@tyrannasaurusbanks
Copy link
Contributor

FWIW, I thought I'd hit this bug but upon inspection it was because I was using an external service - proposed a fix for external services here: #2497

Thought I'd mention in case anyone viewing this is using external services too.

@chancez
Copy link

chancez commented May 25, 2017

I tested with 2.4.1, it has the same issue as 2.4.2, so it's definitely something about 2.4.x, but not sure where.

@chancez
Copy link

chancez commented May 25, 2017

I'd also like to note I'm not using externalServices, my deployments are all at their desired replica count, no pods are unready, etc.

@eduardobaitello
Copy link
Contributor Author

@chancez can you please post the helm command that are you using for the upgrade/install, and also the tiller logs of the failed release? When the release is about to timeout (but you are sure it's ready), catch the ouput of kubectl get all -n yournamespace.
For me, this issue was solved after @thomastaylor312 commit, but I'll be glad if I can help to find the problem.

@chancez
Copy link

chancez commented May 25, 2017

Tiller logs are this, https://gist.github.com/chancez/2d632496799632298efa0ccf9fa70f9d but I don't have the output of the resources at the time. I can assure you that they're in an unchanged state since I'm testing the helm upgrade without having made any changes, as the logs indicate. I'll try to get another helm run with 2.4.2 as well as the kubectl output.

@jagregory
Copy link

FWIW, I'm seeing this with helm install --wait too. v2.4.2

flynnduism pushed a commit to flynnduism/helm that referenced this issue May 28, 2017
The current methodology generated its own RS slice instead of using
a helper method that uses a `clientset`. This caused some issues where
`FindNewReplicaSet` always returned `nil`. This switches the method and
removes some unneeded API calls and code.

Closes helm#2426
@thomastaylor312
Copy link
Contributor

Ok, I narrowed down where this is happening. I could not duplicate this on 1.6.4 (on Minikube), but I could duplicate the issue with 1.5.3 (on Minikube). The problem is that the new replica set is returning nil due to the fact that this line doesn't return a controller for the deployment.

@adamreese I went as far down the stack as I understand, do you have any idea what could be causing this?

@whereisaaron
Copy link
Contributor

Thanks @thomastaylor312! Hopefully you're on the trail. Did that code differ between helm 2.3.x and 2.4.x? As we didn't see this issue with 2.3.x with everything else being identical.

In case it is relevant I am also using coreos clusters (kube-aws) like @baracoder. And the cluster it failed in was k8s 1.5.x.

@thomastaylor312
Copy link
Contributor

@whereisaaron We made a change to be smarter with --wait and Deployment objects between 2.3 to 2.4. We found one bug that we had to patch with a similar issue. So we'll be working to figure this out

@seh
Copy link
Contributor

seh commented Jun 29, 2017

If the ReplicaSet either lacks an owner reference or has it set improperly, here's a place to look for telltale signs of failure: (*DeploymentController).getReplicaSetsForDeployment.

@thomastaylor312
Copy link
Contributor

Is there something with how that is working in k8s 1.5?

@seh
Copy link
Contributor

seh commented Jun 29, 2017

According to the VCS, that area of the code has seen some action over the last eight months. Issue kubernetes/kubernetes#33845 and PR kubernetes/kubernetes#35676 sound relevant, and the latter looks like it didn't make it in until version 1.6.

@whereisaaron
Copy link
Contributor

Hi @thomastaylor312 just upgraded to helm 2.5.1 for a k8s 1.5 cluster and can confirm this problem still exists. A helm upgrade --install will wait long after all the resources are ready and until the timeout is reached, then report the deployment as failed, even though the deployment was successful.

Client: &version.Version{SemVer:"v2.5.1", GitCommit:"7cf31e8d9a026287041bae077b09165be247ae66", GitTreeState:"clean"}
Server: &version.Version{SemVer:"v2.5.1", GitCommit:"7cf31e8d9a026287041bae077b09165be247ae66", GitTreeState:"clean"}

helm 2.3 was the last version of helm where --wait worked for k8s 1.5 clusters. I test after each helm upgrade and --wait has never worked again.

+ helm upgrade --install --wait --timeout 600 ...
Error: UPGRADE FAILED: timed out waiting for the condition

@erichaase
Copy link

Upgrading to a 1.7 k8s cluster resolved this issue for me!

@boosh
Copy link

boosh commented Dec 1, 2017

I'm seeing this with helm 2.7.2 on k8s 1.8.3-gke.0.

In my dev env with replicas: 1 it works fine. In staging with replicas: 2 helm hangs for the full duration of the timeout, yet all pods are running. The pod uses the following deployment strategy:

  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 1
    type: RollingUpdate

The only other thing to note is this chart runs a batch job as a post-install task. Even when that job finishes, helm still hangs. Perhaps the problem is that the job has a status of Completed, but is never marked Ready?

NAME                                                     READY     STATUS      RESTARTS   AGE
curteous-content-content-1512129000-d27zn                0/1       Completed   0          1m

@thomastaylor312
Copy link
Contributor

@boosh Do both of your pods show up as ready while it is hanging?

@boosh
Copy link

boosh commented Dec 4, 2017

@thomastaylor312 Yes

@alexppg
Copy link

alexppg commented Apr 10, 2018

Hi, is there a solution to this in a 1.5.* cluster?

@whereisaaron
Copy link
Contributor

@alexppg no, and even on the latest version I find it problematic/unreliable. I just don't use the --wait option any more. Better to deploy and then script your own wait-for-ready step.

@alexppg
Copy link

alexppg commented Apr 11, 2018

Shouldn't then be an opened issue about it? This or any other one.

PD: thanks, I'll do that. @whereisaaron

@MalhotraVijay
Copy link

MalhotraVijay commented Sep 19, 2018

We are also getting this issue with the helm version v2.10.0 version on Kubernetes version v1.8.2

Client: &version.Version{SemVer:"v2.10.0", GitCommit:"9ad53aac42165a5fadc6c87be0dea6b115f93090", GitTreeState:"clean"}
Server: &version.Version{SemVer:"v2.10.0", GitCommit:"9ad53aac42165a5fadc6c87be0dea6b115f93090", GitTreeState:"clean"}

We have an internal chart, which we intend to upgrade via the CI/CD pipeline using the following command and the upgrade process keeps on waiting.

helm upgrade <chart_release> <chart_path> --debug --wait

We then have to kill the process and the chart release history show 'PENDING_UPGRADE' for some unknown amount of time before either it says DEPLOYED during some builds or Upgrade "<chart_release>" failed: timed out waiting for the condition

Is this solved or are we missing something ?

@Brightside56
Copy link

I have same issue with install/update with --wait

@KIVagant
Copy link

KIVagant commented Oct 5, 2018

Same here.

Client: &version.Version{SemVer:"v2.11.0", GitCommit:"2e55dbe1fdb5fdb96b75ff144a339489417b146b", GitTreeState:"clean"}
Server: &version.Version{SemVer:"v2.11.0", GitCommit:"2e55dbe1fdb5fdb96b75ff144a339489417b146b", GitTreeState:"clean"}
# helm upgrade ... --wait 900 ... shows:
Error: UPGRADE FAILED: timed out waiting for the condition

...
# helm status shows
LAST DEPLOYED: Fri Oct  5 14:04:23 2018
NAMESPACE: default
STATUS: FAILED

But all resources were created.

@KIVagant
Copy link

KIVagant commented Oct 5, 2018

Looks like this workaround suggested by @tcolgate also works for this issue. I just fixed FAILED release using helm rollback myproject 16 where 16 was "FAILED" release.

#3208 (comment)

@KIVagant
Copy link

KIVagant commented Oct 8, 2018

Looks like in my case there was a real problem. One of the pods stuck in Pending state with the error: 0/6 nodes are available: 2 node(s) had taints that the pod didn't tolerate, 4 Insufficient cpu.

I added one more node and now it works!

Why it looks like this bug but it isn't? Because previously, when I checked the CI result, K8s has already deployed the pod that was in Pending state at the moment of helm upgrade.

@smileusd
Copy link

daemonset have the same problem

@lm-xingxing
Copy link

use --debug only(no --wait), to see which resource is pending, maybe it different from chart defined ,that cause pending

@Emixam23-FCMS
Copy link

Hey guys, still having the issue, any idea? Currently using helm image 3.2.0

@piratf
Copy link

piratf commented Nov 5, 2021

Same issue. All pods are running, but helm upgrade still waiting.

version.BuildInfo{Version:"v3.3.4", GitCommit:"a61ce5633af99708171414353ed49547cf05013d", GitTreeState:"clean", GoVersion:"go1.14.9"}

@zhilyaev
Copy link

zhilyaev commented Nov 9, 2021

The same problem

@castaneai
Copy link

Hi. I got the same problem with minikube.
The cause was that the type: LoadBalancer resource did not have an External IP ( kubernetes/minikube#4113 ).
I found the cause because the --debug flag provided me with the following log.

  wait.go:48: [debug] beginning wait for 29 resources with timeout of 5m0s
  ready.go:258: [debug] Service does not have load balancer ingress IP address: agones-system/agones-allocator
  ready.go:258: [debug] Service does not have load balancer ingress IP address: agones-system/agones-allocator
  ready.go:258: [debug] Service does not have load balancer ingress IP address: agones-system/agones-allocator
  ready.go:258: [debug] Service does not have load balancer ingress IP address: agones-system/agones-allocator
  ready.go:258: [debug] Service does not have load balancer ingress IP address: agones-system/agones-allocator
...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests