New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
Number of open connections spikes when retryable errors occur #2087
Comments
I talked to Tomas about this and I think I understand the current functionality - one potential improvement would be to add a backoff to the retries. |
Thanks for opening this @dtuck9. Did you see it happening in any 8.x versions of the JS client, even if it was less often? |
Unfortunately given the scale of ESS, I am unable to identify/diagnose this issue on a widespread basis unless there's an incident or customer ticket. I would assume that it's happening in 8.x, and based on having seen a similar issue originating from beats user agents, I'd also assume similar logic in the go client. |
Going to pick this up next week for investigation/remediation. From a Slack convo with @dgieselaar and some Kibana folks:
I believe that that last detail, that the request is cancelled, may be the critical point. Especially if it's always happening at 30 seconds, since that is the default request timeout set by the client. Both 7.x and 8.x versions of the client only retry by default for:
Those last two are the primary suspects. I need to dig into what conditions cause |
This is true, but not by accident: the fact that all It wouldn't be too hard to add a What I can also do is, as is suggested above, add an incremental backoff to the retries. Right now it retries immediately, so adding an exponential backoff with a little jitter could ease some pain. I'd feel comfortable enough introducing that as the default behavior without considering it a breaking change. I'll start implementing the backoff, and discuss the default "retry on timeout" behavior with internal stakeholders (Kibana core, Cloud SRE, etc.). |
The general consensus from Kibana core and SRE folks who were involved in the conversation is that it's not a breaking change to turn off retry-on-timeout by default. elastic/elastic-transport-js#100 turns it off, with a new option I also need to add a |
馃悰 Bug Report
The retry logic appears to open connections without explicitly closing them, and consequently exhausts resources.
Example
We had an incident in which an ESS customer cluster reached its high disk watermark, and the cluster began returning 408s and 429s, as well as 5XX as the cluster became overloaded and unable to respond. We have also seen this occur with auth errors (api key not found) via beats agents.
The distinct number of clients hovered around 2,000 on average before and during the incident. So the customer did not seem to apply additional clients to explain the sudden increase number of connections.
With the ~2000 clients, the requests reached the Proxy-imposed 5000 per-Proxy connection limit across all 62 Proxies in the region, so roughly 310,000 open connections from 2000 clients.
The graph below shows the distinct number of clients (green), concurrent connections per Proxy (blue), and the request errors as defined by
status_code
>= 400 (pink, at 1/1000 scale):https://platform-logging.kb.eastus.azure.elastic-cloud.com/app/r/s/e46ep
The top 2 user agents with the most open connections are:
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Connections should be closed to avoid a build up of open connections over time
Your Environment
@elastic/elasticsearch
version: >=7.0.0The text was updated successfully, but these errors were encountered: