New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Server becomes unresponsive because max_connections is reached during read #1282
Comments
Hi @ozcelgozde, thank you for report this issue. Let us investigate this |
Hi @jenrryyou , please help to see why there are too many number of brpc connection open? |
Hi @ozcelgozde , our developer are not sure where is those high number of rpc come from. We think that it is not come from brpc. To monitor the number of |
A few samples: 2024.03.08 15:22:21.173052 [ 7038 ] {} CnchLock: acquire 1 locks in 2 ms As you can see it blocks everything :) |
Hi @ozcelgozde , thank you for your cooperation. There should be no interaction between workers and servers via http. I wonder if the http connections come from liveness check or metrics scraping like Prometheus |
And may I know in which place you see that it is set to 16? |
And could you tell how do you find out large number of connection is http? I think maybe the way you did may have a mistake |
mentioned in #1153 |
Im not sure its directly related but I was able to make server go into unresponsive state by running queries from clickhouse-client. Im also trying to find what causes this :) The metric i looked for http is (cnch_current_metrics_http_connection). |
I traced 16 from contrib/pocpFoundation/src/ThreadPool.cpp
I couldnt find anywhere we set the max capacity and default is 16 so I assumed |
I did not find this metrics name in ByConity source code. I wonder whether this metrics derive from |
yes, I'm using S3 |
The S3 should not related here because the the S3 client will send http request from workers only. I think if you encounter the issue again, could you help to execute the command |
Here is some of the command I use with netstat to quickly count the number of http request
My http port is 18685, and you can see there are 2 sockets because i connect to the server from local |
I was finally able to reproduce it reliably when I send a lot of queries over clickhouse client at once. I wrote a simple script that sends 500 simultaneous queries repeatedly up to 4000 queries. This, as expected, makes transaction and tcp socket count spike. netstat also reflects a lot of tcp connections. A few errors I can see from read and server replicas are:
Is there a way to know what is my simultaneous query limits over tcp? |
We have settings to limit the number of query like |
A lot of HTTP connections come from HTTP insert. We have ingestion over http, which on peak times causes the same effect I would think. |
I see, I've tried to send many queries to reproduce and also encounter the resource issue that thread can't be created. We are going to merge some MR to improve the QPS but I'm not sure if it gonna change much. So I think maybe you have to try to squash multiple HTTP request into one to reduce the QPS into the system |
Thank you @dmthuc for your help and interest! |
Question
When I run a couple of select queries that try to find rare terms on column indexed by tokenbf, I see server readiness probe start to fail on kubernetes because a lot of sockets are opened on the server side for rpc calls. Server also becomes unresponsive to all tcp calls.
After a few minutes, they mostly recover. Breakdown of the sockets for server-0 for example:
1318 byconity-vw-vw-def
697 10-64-116-211.byco
180 ip-10-64-66-224.u
142 ip-10-64-231-57.u
77 ip-10-64-166-98.u
73 s3-us-west-2-r-w.
34 ip-10-64-231-57.us
28 ip-10-64-66-224.us
22 ip-10-64-166-98.us
so a lot of sockets opened to the read workers on servers. Is there a way to limit this to leave server functional? What causes new sockets to be opened for read workers?
The text was updated successfully, but these errors were encountered: