Helm Installs Don't Honor Timeout #2025

pluttrell · 2017-02-24T21:43:15Z

I have tried to install Spinnaker via Helm many times. Most fail. A lot of them with this error:

$ helm install stable/spinnaker --namespace spinnaker --wait --timeout 1500
E0224 13:38:51.978584   19536 portforward.go:175] lost connection to pod
Error: transport is closing

Note that they fail in just a couple of minutes, not the 25 minutes specified. Shouldn't they be honoring the specified timeout?

The text was updated successfully, but these errors were encountered:

technosophos · 2017-02-27T20:55:35Z

I wonder if the port forwarder also needs to be given the timeout window. @adamreese do tunnels close if idle for some amount of time?

adamreese · 2017-03-02T16:58:00Z

@technosophos Tunnels will remain open until closed
@pluttrell Did the pod restart? I wonder if the connection dropped becuase the pod was killed.

pluttrell · 2017-03-03T04:15:52Z

No I am not seeing the pods restart.

jarovo · 2017-03-27T11:59:30Z

I have the same problem without any using helm. Just issuing this to connect openshift:
oc -n podify port-forward postgresql-1-hal3g. It seems it is able to work at least about an hour then it starts failing.

In /etc/origin/node/node-config.yaml I tried to set
streamingConnectionIdleTimeout: 0
And restarted the node, but this didn't help

technosophos · 2017-04-28T23:39:00Z

So this may be an upstream issue rather than a Helm-specific one. Can we get a list of Kubernetes versions/distributions that this is showing up in?

a-chernykh · 2017-08-10T20:46:39Z

Having the same problem, helm upgrade --wait --timeout 600 times out after 5 minutes. I see the following:

+ helm upgrade --install --wait --set minReplicas=1 --set maxReplicas=16 --set env=stage --values values.json --timeout 600 --debug --version 0.0.16 --recreate-pods merlin jiff/docker-compose
[debug] Created tunnel using local port: '38247'

[debug] SERVER: "localhost:38247"

[debug] Fetched jiff/docker-compose to /root/.helm/cache/archive/docker-compose-0.0.16.tgz

E0810 20:32:00.022184     135 portforward.go:178] lost connection to pod
Error: UPGRADE FAILED: transport is closing

Helm version: 2.5.1
Kubernetes version: 1.6.2

tjquinno · 2017-10-02T17:06:30Z

As @bacongobbler suggested I'm summarizing my experience with the same problem here.

helm 2.6.1 client and server
k8s 1.7.4 (same behavior with 1.7.0 also)

When I run helm delete --purge xxx helm reports

portforward.go:178] lost connection to pod
Error: transport is closing

Yet the delete seems to have succeeded. All the k8s resources defined by the chart are cleaned up as expected. The tiller log looks like this:

[storage] 2017/09/28 14:14:19 getting release history for "xxx"
[tiller] 2017/09/28 14:14:19 uninstall: Deleting xxx
[tiller] 2017/09/28 14:14:19 executing 0 pre-delete hooks for xxx
[tiller] 2017/09/28 14:14:19 hooks complete for pre-delete xxx
[storage] 2017/09/28 14:14:19 updating release "xxx.v1"
 (many lines of "Starting delete for yyy" and "Using reaper for deleting yyy" omitted here)
[tiller] 2017/09/28 14:16:42 executing 0 post-delete hooks for xxx
[tiller] 2017/09/28 14:16:42 hooks complete for post-delete xxx
[tiller] 2017/09/28 14:16:42 purge requested for xxx
[storage] 2017/09/28 14:16:42 deleting release "xxx.v1"

That's the end of the log. It contains no errors or exceptions.

k8s reports no restarts of the tiller pod.

The elapsed time, from outside and from the log, is about 2m 30s, well under the default timeout value for the delete operation.

Also, on #2983 @bacongobbler asked "Can you check on the load balancer fronting your kubernetes master API ...?" To my knowledge in this dev cluster there is no LB in front of the master, but I'll double-check and if there is, is there anything specific I should be looking for?

RiaanLab · 2018-01-05T22:28:32Z

Same issue on k8s 1.6.7 when installing coreos/prometheus-operator

Command is :
helm install coreos/prometheus-operator --name prometheus-operator --namespace monitoring --timeout 60000 --wait

Error is :
E0106 00:25:51.908673 18541 portforward.go:178] lost connection to pod
Error: transport is closing

ljnelson · 2018-01-20T00:32:31Z

Per @technosophos' request:

Can we get a list of Kubernetes versions/distributions that this is showing up in?

Server Version: version.Info{Major:"1", Minor:"8+", GitVersion:"v1.8.5-2+76a8892212cc28", GitCommit:"76a8892212cc28b8c628867f41f73ce8b755685e", GitTreeState:"clean", BuildDate:"2018-01-02T16:40:36Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}

Helm 2.7.2 but I don't think that matters.

Relevant line in Kubernetes is here: https://github.com/kubernetes/kubernetes/blob/v1.8.3/staging/src/k8s.io/client-go/tools/portforward/portforward.go#L177-L178

I guess we need to figure out what's firing something on that close chan.

ljnelson · 2018-01-20T01:10:05Z

I may be way off, but ultimately the connection in question is created here: https://github.com/kubernetes/client-go/blob/master/tools/portforward/portforward.go#L138

It is an httpstream.Connection, as can be seen here: https://github.com/kubernetes/client-go/blob/master/tools/portforward/portforward.go#L46

That interface specifies a SetIdleTimeout function here: https://github.com/kubernetes/apimachinery/blob/master/pkg/util/httpstream/httpstream.go#L80

It is probably? implemented here: https://github.com/kubernetes/apimachinery/blob/master/pkg/util/httpstream/spdy/connection.go#L141-L145

I'm wondering if maybe the genuine bug we're seeing here is that the idle timeout is left unset and so buried somewhere in Docker's spdystream package something is forcibly closing the connection. It appears that you could set the idle timeout on the Docker spdystream connection, but I don't think anyone does.

At the moment there's no way to put an idle timeout on that connection in between its creation and usage in forward().

Am I on the right track? This issue will prevent helm chart hierarchies above a certain size from being installed, I think.

a-chernykh · 2018-01-21T04:28:16Z

I think it's inevitable to have such kind of problems when doing long-running HTTP queries. It's normal for load balancers to kill idle connections. For example, if kube-apiserver is sitting behind ELB, consider increasing idle_timeout.

I think a proper solution for this problem is to switch to polling. I.e. wait a few minutes for resources to be ready, close the connection and set-up a timer to poll tiller periodically. Disadvantage of this approach - helm will have to establish a new connection to tiller every few seconds but that's certainly better than timing out.

helgi · 2018-03-09T20:35:10Z

This was fixed in 2.8.0 and now 2.8.2 is coming out with a improved version of that (the transport closing fixes)

bacongobbler · 2018-03-09T20:41:52Z

It might've been fixed by proxy, but I don't think this was directly fixed. Same error, different area of the code

helgi · 2018-03-09T20:44:59Z

@bacongobbler oh hmm which part? When I was running into transport closed issues it was with portforward as well since AWS NAT Gateways were closing it (#3182), that's why I am curious what area it is

bacongobbler · 2018-03-09T20:51:09Z

We fixed helm init --wait timeout issues, but not specifically helm install --timeout.

install --timeout uses the helm.InstallTimeout option, whereas init --wait uses the helm.TillerConnectionTimeout option.

bacongobbler · 2018-03-09T20:52:24Z

they're sorta confusing but essentially:

TillerConnectionTimeout specifies the timeout duration to connect to tiller
InstallTimeout specifies the timeout to wait for resources to become ready during helm install

fejta-bot · 2018-06-07T21:11:20Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

EddFigueiredo · 2018-06-15T13:39:30Z

I'm having all kinds of timeout, both on tiller and on install (transport is closing as well), using helm 2.9.0. It's really annoying, there's azure LB involved, but I already set the idle timeout to 10 minutes.

Even using --wait --timeout 600 and --tiller-connection-timeout 600 doesn't seem to fix the problem.

fejta-bot · 2018-07-15T13:57:10Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

uxon123 · 2018-08-23T16:34:16Z

Still encountering the same problem with install --timeout (however I think the timeout is much shorter than 5 minutes, it's non-configurable 1 or 2 minutes).
Nobody is looking at this?

uxon123 · 2018-08-24T15:50:15Z

/remove-lifecycle rotten

bacongobbler · 2018-08-24T15:51:18Z

as far as I understand, nobody in the community is currently looking into this particular issue. If you determine what the underlying issue is, we'd appreciate a patch!

badloop · 2018-12-14T14:54:40Z

From what I can tell, the timeout is only not honored when attempting to do things remotely. If I log into one of my kubernetes nodes and issue the commands from there, it will sit there until the operation completes fully, whereas if I try to do it remotely it times out fairly quickly.... this lends credence to the theory that the issue lies with whatever loadbalancer is in front of kubernetes.

jamstar · 2018-12-21T17:17:51Z

i hit this as well, its kinda annoying since we have a big deployment.

andrvin · 2019-06-12T12:47:20Z

Still getting this error when tries to upgrade helm release with --wait option.

helm upgrade release-name app --wait
E0612 15:28:57.991943   12415 portforward.go:233] lost connection to pod

helm version
Client: &version.Version{SemVer:"v2.14.1", GitCommit:"5270352a09c7e8b6e8c9593002a73535276507c0", GitTreeState:"clean"}
Server: &version.Version{SemVer:"v2.14.1", GitCommit:"5270352a09c7e8b6e8c9593002a73535276507c0", GitTreeState:"clean"}

The only workaround that I found is to add your tiller service address as --host option:

helm upgrade release-name app --wait --host=10.104.4.82:44134

gvenka008c · 2019-07-16T18:49:37Z

We are also seeing similar issue when we do helm install/upgrade/delete/ls. Any thoughts?

For Example:

# helm upgrade  -i app app/ncs --version 10.4.3318-2 --namespace ns 
E0716 17:44:58.671861   11327 portforward.go:178] lost connection to pod
Error: UPGRADE FAILED: transport is closing

# helm version
Client: &version.Version{SemVer:"v2.10.0", GitCommit:"9ad53aac42165a5fadc6c87be0dea6b115f93090", GitTreeState:"clean"}
Server: &version.Version{SemVer:"v2.10.0", GitCommit:"9ad53aac42165a5fadc6c87be0dea6b115f93090", GitTreeState:"clean"}

bacongobbler · 2019-07-16T19:56:23Z

@gvenka008c have you had a look at #2025 (comment)?

gvenka008c · 2019-07-16T20:33:19Z

@bacongobbler Yes, checked that. We have one node where helm is installed and tiller runs on our k8 cluster nodes (3 nodes). we do all helm install from this server which talks to k8 cluster. Not sure if our proxy is blocking the traffic or timing out. Let me check on that. Thanks.

torubylist · 2019-09-19T02:21:36Z

@bacongobbler Yes, checked that. We have one node where helm is installed and tiller runs on our k8 cluster nodes (3 nodes). we do all helm install from this server which talks to k8 cluster. Not sure if our proxy is blocking the traffic or timing out. Let me check on that. Thanks.

any progress?

kotamahesh · 2019-11-13T21:25:10Z

what is the update ?

monotek · 2019-11-13T21:54:11Z

try with helm 3.0 ;-)

bacongobbler · 2019-11-13T22:03:01Z

I wouldn't hold out on upgrading to 3.0 being the fix. If @kotamahesh and @torubylist are experiencing the same issues as @badloop described back in 2018 (#2025 (comment)), then the issue isn't with Helm, but it's with the load balancer fronting the Kubernetes API server that's closing the long-running connection too early.

It's worth giving a shot, at the very least.

bacongobbler · 2019-11-13T22:07:22Z

@torubylist and @kotamahesh, if you wouldn't mind sharing your experiences, that would be more helpful. That way we can help try to diagnose the issue you are seeing, and to direct you towards a potential solution.

kotamahesh · 2019-11-15T11:10:23Z

@bacongobbler, Thanks for the response,
I think helm upgrade won't resolve the issue, even I suspect this was due to connectivity issues in between Kube API and LB, as our lab has some network issues. I will get back to you if I can reproduce the issue after network problem is resolved.

kotamahesh · 2019-11-20T11:10:14Z

Hi @bacongobbler after the network issues are resolved, I am unable to reproduce the issue.
what you said "but it's with the load balancer fronting the Kubernetes API server that's closing the long-running connection too early. " was right.

mKlaris · 2020-01-31T14:26:27Z

@bacongobbler Hello, we had the same issue. We are using Openstack on-premis. I had this problem only on Kubernetes multi master deployment, where kube-api-server is deployed with LB (Octavia). Our resolution is to increase timeout_client_data and timeout_member_data HAProxy parametres. Default is 5000.

openstack loadbalancer listener set --timeout-client-data 500000 <ID>
openstack loadbalancer listener set --timeout-member-data 500000 <ID>

stanislav-zaprudskiy · 2020-04-03T09:34:31Z

Running into similar issue with AWS EKS, and helm (2.16.1) upgrade with --timeout set to 604800 (7d). I have a Job running as post-install/pre-upgrade hook, which requires few hours to complete, but Helm reports the deployment has failed in about an hour:

INSTALL FAILED
PURGING CHART
Error: Failed to deploy release-name
Successfully purged a chart!
Error: Failed to deploy release-name

With the following accompanying tiller log:

[tiller] 2020/04/02 12:47:11 warning: Release release-name post-install chart-name/templates/helm-hook-job.yaml could not complete: Failed to deploy release-name

As per hook configuration the job isn't deleted by Helm, and in Kubernetes it continues to actually run, and completes in a couple of hours though.

I tend to think this is caused by what @andreychernih mentioned above in #2025 (comment), but how could I verify that this is actually AWS EKS API web server which terminates the "watch" operation for the job and fails the helm deployment?

technosophos · 2020-04-03T16:31:02Z

You could start a pod inside of your cluster, install the Helm client there, and run the deployment entirely inside of the cluster. That would only rule in/out some things (like whether a load balancer in the middle was terminating the connection), but it is at least a good debugging step that should provide some useful information.

lsoica · 2020-05-04T12:58:10Z

Still getting this error when tries to upgrade helm release with --wait option.
helm upgrade release-name app --wait
E0612 15:28:57.991943   12415 portforward.go:233] lost connection to pod
helm version
Client: &version.Version{SemVer:"v2.14.1", GitCommit:"5270352a09c7e8b6e8c9593002a73535276507c0", GitTreeState:"clean"}
Server: &version.Version{SemVer:"v2.14.1", GitCommit:"5270352a09c7e8b6e8c9593002a73535276507c0", GitTreeState:"clean"}
The only workaround that I found is to add your tiller service address as --host option:

helm upgrade release-name app --wait --host=10.104.4.82:44134

@andrvin Any clue on why specifying the --host would workaround the problem ?

rajatjindal · 2020-07-22T16:39:43Z

Hi @technosophos

This issue might be similar to what is reported in this issue: kubernetes/kubernetes#67817

we were running into this issue as well, and fixed it with this PR: proofpoint#16

I would be more than happy to submit the PR here if that looks ok.

Thanks

bacongobbler · 2020-08-14T17:18:35Z

This issue should (finally) be fixed with #8507, which will become available in Helm 3.4.0. Let us know if that does not fix the issue present here. Thanks!

…2025.

pluttrell changed the title ~~Helm Installs of Spinnaker Consistently Fail with "lost connection to pod"~~ Helm Installs Don't Honor Timeout Feb 24, 2017

thomastaylor312 added the bug label Apr 7, 2017

technosophos added question/support and removed bug labels Apr 28, 2017

bacongobbler mentioned this issue Sep 29, 2017

helm delete reports "lost connection to pod" "Error: transport is closing" despite apparent success #2983

Closed

bacongobbler mentioned this issue Nov 1, 2017

Helm Delete Tiller Timeout Before Completion #2578

Closed

yvespp mentioned this issue Nov 20, 2017

Use "kubectl rollout status" code for install/update --wait #3176

Closed

k8s-ci-robot added the lifecycle/stale label Jun 7, 2018

k8s-ci-robot added lifecycle/rotten and removed lifecycle/stale labels Jul 15, 2018

bacongobbler removed the lifecycle/rotten label Aug 24, 2018

bacongobbler mentioned this issue Dec 12, 2019

helm3 - --timeout flag ignored #7210

Closed

alexanderbuhler mentioned this issue Dec 12, 2019

Tiller timeout fluxcd/helm-operator#144

Closed

rajatjindal mentioned this issue Jul 24, 2020

fix watch error due to elb/proxy timeout #8507

Merged

1 task

bacongobbler closed this as completed Aug 14, 2020

leonk added a commit to ministryofjustice/prisoner-content-hub-backend that referenced this issue Aug 23, 2021

Upgrading version of helm to include fix for timeouts, see helm/helm#…

9f56564

…2025.

Helm Installs Don't Honor Timeout #2025

Helm Installs Don't Honor Timeout #2025

Comments

pluttrell commented Feb 24, 2017 • edited

technosophos commented Feb 27, 2017

adamreese commented Mar 2, 2017

pluttrell commented Mar 3, 2017

jarovo commented Mar 27, 2017 • edited

technosophos commented Apr 28, 2017

a-chernykh commented Aug 10, 2017

tjquinno commented Oct 2, 2017 • edited

RiaanLab commented Jan 5, 2018

ljnelson commented Jan 20, 2018

ljnelson commented Jan 20, 2018 • edited

a-chernykh commented Jan 21, 2018

helgi commented Mar 9, 2018

bacongobbler commented Mar 9, 2018

helgi commented Mar 9, 2018

bacongobbler commented Mar 9, 2018

bacongobbler commented Mar 9, 2018

fejta-bot commented Jun 7, 2018

EddFigueiredo commented Jun 15, 2018

fejta-bot commented Jul 15, 2018

uxon123 commented Aug 23, 2018

uxon123 commented Aug 24, 2018

bacongobbler commented Aug 24, 2018 • edited

badloop commented Dec 14, 2018

jamstar commented Dec 21, 2018

andrvin commented Jun 12, 2019

gvenka008c commented Jul 16, 2019 • edited

bacongobbler commented Jul 16, 2019

gvenka008c commented Jul 16, 2019

torubylist commented Sep 19, 2019

kotamahesh commented Nov 13, 2019

monotek commented Nov 13, 2019

bacongobbler commented Nov 13, 2019

bacongobbler commented Nov 13, 2019 • edited

kotamahesh commented Nov 15, 2019

kotamahesh commented Nov 20, 2019

mKlaris commented Jan 31, 2020

stanislav-zaprudskiy commented Apr 3, 2020 • edited

technosophos commented Apr 3, 2020

lsoica commented May 4, 2020

rajatjindal commented Jul 22, 2020

bacongobbler commented Aug 14, 2020

pluttrell commented Feb 24, 2017 •

edited

jarovo commented Mar 27, 2017 •

edited

tjquinno commented Oct 2, 2017 •

edited

ljnelson commented Jan 20, 2018 •

edited

bacongobbler commented Aug 24, 2018 •

edited

gvenka008c commented Jul 16, 2019 •

edited

bacongobbler commented Nov 13, 2019 •

edited

stanislav-zaprudskiy commented Apr 3, 2020 •

edited