Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Helm Installs Don't Honor Timeout #2025

Closed
pluttrell opened this issue Feb 24, 2017 · 41 comments
Closed

Helm Installs Don't Honor Timeout #2025

pluttrell opened this issue Feb 24, 2017 · 41 comments

Comments

@pluttrell
Copy link

pluttrell commented Feb 24, 2017

I have tried to install Spinnaker via Helm many times. Most fail. A lot of them with this error:

$ helm install stable/spinnaker --namespace spinnaker --wait --timeout 1500
E0224 13:38:51.978584   19536 portforward.go:175] lost connection to pod
Error: transport is closing

Note that they fail in just a couple of minutes, not the 25 minutes specified. Shouldn't they be honoring the specified timeout?

@pluttrell pluttrell changed the title Helm Installs of Spinnaker Consistently Fail with "lost connection to pod" Helm Installs Don't Honor Timeout Feb 24, 2017
@technosophos
Copy link
Member

I wonder if the port forwarder also needs to be given the timeout window. @adamreese do tunnels close if idle for some amount of time?

@adamreese
Copy link
Member

@technosophos Tunnels will remain open until closed
@pluttrell Did the pod restart? I wonder if the connection dropped becuase the pod was killed.

@pluttrell
Copy link
Author

No I am not seeing the pods restart.

@jarovo
Copy link

jarovo commented Mar 27, 2017

I have the same problem without any using helm. Just issuing this to connect openshift:
oc -n podify port-forward postgresql-1-hal3g. It seems it is able to work at least about an hour then it starts failing.

In /etc/origin/node/node-config.yaml I tried to set
streamingConnectionIdleTimeout: 0
And restarted the node, but this didn't help

@technosophos
Copy link
Member

So this may be an upstream issue rather than a Helm-specific one. Can we get a list of Kubernetes versions/distributions that this is showing up in?

@a-chernykh
Copy link
Contributor

Having the same problem, helm upgrade --wait --timeout 600 times out after 5 minutes. I see the following:

+ helm upgrade --install --wait --set minReplicas=1 --set maxReplicas=16 --set env=stage --values values.json --timeout 600 --debug --version 0.0.16 --recreate-pods merlin jiff/docker-compose
[debug] Created tunnel using local port: '38247'

[debug] SERVER: "localhost:38247"

[debug] Fetched jiff/docker-compose to /root/.helm/cache/archive/docker-compose-0.0.16.tgz

E0810 20:32:00.022184     135 portforward.go:178] lost connection to pod
Error: UPGRADE FAILED: transport is closing

Helm version: 2.5.1
Kubernetes version: 1.6.2

@tjquinno
Copy link

tjquinno commented Oct 2, 2017

As @bacongobbler suggested I'm summarizing my experience with the same problem here.

helm 2.6.1 client and server
k8s 1.7.4 (same behavior with 1.7.0 also)

When I run helm delete --purge xxx helm reports

portforward.go:178] lost connection to pod
Error: transport is closing

Yet the delete seems to have succeeded. All the k8s resources defined by the chart are cleaned up as expected. The tiller log looks like this:

[storage] 2017/09/28 14:14:19 getting release history for "xxx"
[tiller] 2017/09/28 14:14:19 uninstall: Deleting xxx
[tiller] 2017/09/28 14:14:19 executing 0 pre-delete hooks for xxx
[tiller] 2017/09/28 14:14:19 hooks complete for pre-delete xxx
[storage] 2017/09/28 14:14:19 updating release "xxx.v1"
 (many lines of "Starting delete for yyy" and "Using reaper for deleting yyy" omitted here)
[tiller] 2017/09/28 14:16:42 executing 0 post-delete hooks for xxx
[tiller] 2017/09/28 14:16:42 hooks complete for post-delete xxx
[tiller] 2017/09/28 14:16:42 purge requested for xxx
[storage] 2017/09/28 14:16:42 deleting release "xxx.v1"

That's the end of the log. It contains no errors or exceptions.

k8s reports no restarts of the tiller pod.

The elapsed time, from outside and from the log, is about 2m 30s, well under the default timeout value for the delete operation.

Also, on #2983 @bacongobbler asked "Can you check on the load balancer fronting your kubernetes master API ...?" To my knowledge in this dev cluster there is no LB in front of the master, but I'll double-check and if there is, is there anything specific I should be looking for?

@RiaanLab
Copy link

RiaanLab commented Jan 5, 2018

Same issue on k8s 1.6.7 when installing coreos/prometheus-operator

Command is :
helm install coreos/prometheus-operator --name prometheus-operator --namespace monitoring --timeout 60000 --wait

Error is :
E0106 00:25:51.908673 18541 portforward.go:178] lost connection to pod
Error: transport is closing

@ljnelson
Copy link

Per @technosophos' request:

Can we get a list of Kubernetes versions/distributions that this is showing up in?

Server Version: version.Info{Major:"1", Minor:"8+", GitVersion:"v1.8.5-2+76a8892212cc28", GitCommit:"76a8892212cc28b8c628867f41f73ce8b755685e", GitTreeState:"clean", BuildDate:"2018-01-02T16:40:36Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}

Helm 2.7.2 but I don't think that matters.

Relevant line in Kubernetes is here: https://github.com/kubernetes/kubernetes/blob/v1.8.3/staging/src/k8s.io/client-go/tools/portforward/portforward.go#L177-L178

I guess we need to figure out what's firing something on that close chan.

@ljnelson
Copy link

ljnelson commented Jan 20, 2018

I may be way off, but ultimately the connection in question is created here: https://github.com/kubernetes/client-go/blob/master/tools/portforward/portforward.go#L138

It is an httpstream.Connection, as can be seen here: https://github.com/kubernetes/client-go/blob/master/tools/portforward/portforward.go#L46

That interface specifies a SetIdleTimeout function here: https://github.com/kubernetes/apimachinery/blob/master/pkg/util/httpstream/httpstream.go#L80

It is probably? implemented here: https://github.com/kubernetes/apimachinery/blob/master/pkg/util/httpstream/spdy/connection.go#L141-L145

I'm wondering if maybe the genuine bug we're seeing here is that the idle timeout is left unset and so buried somewhere in Docker's spdystream package something is forcibly closing the connection. It appears that you could set the idle timeout on the Docker spdystream connection, but I don't think anyone does.

At the moment there's no way to put an idle timeout on that connection in between its creation and usage in forward().

Am I on the right track? This issue will prevent helm chart hierarchies above a certain size from being installed, I think.

@a-chernykh
Copy link
Contributor

I think it's inevitable to have such kind of problems when doing long-running HTTP queries. It's normal for load balancers to kill idle connections. For example, if kube-apiserver is sitting behind ELB, consider increasing idle_timeout.

I think a proper solution for this problem is to switch to polling. I.e. wait a few minutes for resources to be ready, close the connection and set-up a timer to poll tiller periodically. Disadvantage of this approach - helm will have to establish a new connection to tiller every few seconds but that's certainly better than timing out.

@helgi
Copy link
Contributor

helgi commented Mar 9, 2018

This was fixed in 2.8.0 and now 2.8.2 is coming out with a improved version of that (the transport closing fixes)

@bacongobbler
Copy link
Member

It might've been fixed by proxy, but I don't think this was directly fixed. Same error, different area of the code

@helgi
Copy link
Contributor

helgi commented Mar 9, 2018

@bacongobbler oh hmm which part? When I was running into transport closed issues it was with portforward as well since AWS NAT Gateways were closing it (#3182), that's why I am curious what area it is

@bacongobbler
Copy link
Member

We fixed helm init --wait timeout issues, but not specifically helm install --timeout.

install --timeout uses the helm.InstallTimeout option, whereas init --wait uses the helm.TillerConnectionTimeout option.

@bacongobbler
Copy link
Member

they're sorta confusing but essentially:

  • TillerConnectionTimeout specifies the timeout duration to connect to tiller
  • InstallTimeout specifies the timeout to wait for resources to become ready during helm install

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@EddFigueiredo
Copy link

I'm having all kinds of timeout, both on tiller and on install (transport is closing as well), using helm 2.9.0. It's really annoying, there's azure LB involved, but I already set the idle timeout to 10 minutes.

Even using --wait --timeout 600 and --tiller-connection-timeout 600 doesn't seem to fix the problem.

@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

@uxon123
Copy link

uxon123 commented Aug 23, 2018

Still encountering the same problem with install --timeout (however I think the timeout is much shorter than 5 minutes, it's non-configurable 1 or 2 minutes).
Nobody is looking at this?

@uxon123
Copy link

uxon123 commented Aug 24, 2018

/remove-lifecycle rotten

@bacongobbler
Copy link
Member

bacongobbler commented Aug 24, 2018

as far as I understand, nobody in the community is currently looking into this particular issue. If you determine what the underlying issue is, we'd appreciate a patch!

@badloop
Copy link

badloop commented Dec 14, 2018

From what I can tell, the timeout is only not honored when attempting to do things remotely. If I log into one of my kubernetes nodes and issue the commands from there, it will sit there until the operation completes fully, whereas if I try to do it remotely it times out fairly quickly.... this lends credence to the theory that the issue lies with whatever loadbalancer is in front of kubernetes.

@jamstar
Copy link

jamstar commented Dec 21, 2018

i hit this as well, its kinda annoying since we have a big deployment.

@andrvin
Copy link

andrvin commented Jun 12, 2019

Still getting this error when tries to upgrade helm release with --wait option.

helm upgrade release-name app --wait
E0612 15:28:57.991943   12415 portforward.go:233] lost connection to pod
helm version
Client: &version.Version{SemVer:"v2.14.1", GitCommit:"5270352a09c7e8b6e8c9593002a73535276507c0", GitTreeState:"clean"}
Server: &version.Version{SemVer:"v2.14.1", GitCommit:"5270352a09c7e8b6e8c9593002a73535276507c0", GitTreeState:"clean"}

The only workaround that I found is to add your tiller service address as --host option:

helm upgrade release-name app --wait --host=10.104.4.82:44134

@gvenka008c
Copy link

gvenka008c commented Jul 16, 2019

We are also seeing similar issue when we do helm install/upgrade/delete/ls. Any thoughts?

For Example:

# helm upgrade  -i app app/ncs --version 10.4.3318-2 --namespace ns 
E0716 17:44:58.671861   11327 portforward.go:178] lost connection to pod
Error: UPGRADE FAILED: transport is closing

# helm version
Client: &version.Version{SemVer:"v2.10.0", GitCommit:"9ad53aac42165a5fadc6c87be0dea6b115f93090", GitTreeState:"clean"}
Server: &version.Version{SemVer:"v2.10.0", GitCommit:"9ad53aac42165a5fadc6c87be0dea6b115f93090", GitTreeState:"clean"}

@bacongobbler
Copy link
Member

@gvenka008c have you had a look at #2025 (comment)?

@gvenka008c
Copy link

@bacongobbler Yes, checked that. We have one node where helm is installed and tiller runs on our k8 cluster nodes (3 nodes). we do all helm install from this server which talks to k8 cluster. Not sure if our proxy is blocking the traffic or timing out. Let me check on that. Thanks.

@torubylist
Copy link

@bacongobbler Yes, checked that. We have one node where helm is installed and tiller runs on our k8 cluster nodes (3 nodes). we do all helm install from this server which talks to k8 cluster. Not sure if our proxy is blocking the traffic or timing out. Let me check on that. Thanks.

any progress?

@kotamahesh
Copy link

what is the update ?

@monotek
Copy link

monotek commented Nov 13, 2019

try with helm 3.0 ;-)

@bacongobbler
Copy link
Member

I wouldn't hold out on upgrading to 3.0 being the fix. If @kotamahesh and @torubylist are experiencing the same issues as @badloop described back in 2018 (#2025 (comment)), then the issue isn't with Helm, but it's with the load balancer fronting the Kubernetes API server that's closing the long-running connection too early.

It's worth giving a shot, at the very least.

@bacongobbler
Copy link
Member

bacongobbler commented Nov 13, 2019

@torubylist and @kotamahesh, if you wouldn't mind sharing your experiences, that would be more helpful. That way we can help try to diagnose the issue you are seeing, and to direct you towards a potential solution.

@kotamahesh
Copy link

@bacongobbler, Thanks for the response,
I think helm upgrade won't resolve the issue, even I suspect this was due to connectivity issues in between Kube API and LB, as our lab has some network issues. I will get back to you if I can reproduce the issue after network problem is resolved.

@kotamahesh
Copy link

Hi @bacongobbler after the network issues are resolved, I am unable to reproduce the issue.
what you said "but it's with the load balancer fronting the Kubernetes API server that's closing the long-running connection too early. " was right.

@mKlaris
Copy link

mKlaris commented Jan 31, 2020

@bacongobbler Hello, we had the same issue. We are using Openstack on-premis. I had this problem only on Kubernetes multi master deployment, where kube-api-server is deployed with LB (Octavia). Our resolution is to increase timeout_client_data and timeout_member_data HAProxy parametres. Default is 5000.

openstack loadbalancer listener set --timeout-client-data 500000 <ID>
openstack loadbalancer listener set --timeout-member-data 500000 <ID>

@stanislav-zaprudskiy
Copy link

stanislav-zaprudskiy commented Apr 3, 2020

Running into similar issue with AWS EKS, and helm (2.16.1) upgrade with --timeout set to 604800 (7d). I have a Job running as post-install/pre-upgrade hook, which requires few hours to complete, but Helm reports the deployment has failed in about an hour:

INSTALL FAILED
PURGING CHART
Error: Failed to deploy release-name
Successfully purged a chart!
Error: Failed to deploy release-name

With the following accompanying tiller log:

[tiller] 2020/04/02 12:47:11 warning: Release release-name post-install chart-name/templates/helm-hook-job.yaml could not complete: Failed to deploy release-name

As per hook configuration the job isn't deleted by Helm, and in Kubernetes it continues to actually run, and completes in a couple of hours though.

I tend to think this is caused by what @andreychernih mentioned above in #2025 (comment), but how could I verify that this is actually AWS EKS API web server which terminates the "watch" operation for the job and fails the helm deployment?

@technosophos
Copy link
Member

You could start a pod inside of your cluster, install the Helm client there, and run the deployment entirely inside of the cluster. That would only rule in/out some things (like whether a load balancer in the middle was terminating the connection), but it is at least a good debugging step that should provide some useful information.

@lsoica
Copy link

lsoica commented May 4, 2020

Still getting this error when tries to upgrade helm release with --wait option.

helm upgrade release-name app --wait
E0612 15:28:57.991943   12415 portforward.go:233] lost connection to pod
helm version
Client: &version.Version{SemVer:"v2.14.1", GitCommit:"5270352a09c7e8b6e8c9593002a73535276507c0", GitTreeState:"clean"}
Server: &version.Version{SemVer:"v2.14.1", GitCommit:"5270352a09c7e8b6e8c9593002a73535276507c0", GitTreeState:"clean"}

The only workaround that I found is to add your tiller service address as --host option:

helm upgrade release-name app --wait --host=10.104.4.82:44134

@andrvin Any clue on why specifying the --host would workaround the problem ?

@rajatjindal
Copy link
Contributor

Hi @technosophos

This issue might be similar to what is reported in this issue: kubernetes/kubernetes#67817

we were running into this issue as well, and fixed it with this PR: proofpoint#16

I would be more than happy to submit the PR here if that looks ok.

Thanks

@bacongobbler
Copy link
Member

This issue should (finally) be fixed with #8507, which will become available in Helm 3.4.0. Let us know if that does not fix the issue present here. Thanks!

leonk added a commit to ministryofjustice/prisoner-content-hub-backend that referenced this issue Aug 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests