Server becomes unresponsive because max_connections is reached during read #1282

ozcelgozde · 2024-03-05T17:22:39Z

Question

When I run a couple of select queries that try to find rare terms on column indexed by tokenbf, I see server readiness probe start to fail on kubernetes because a lot of sockets are opened on the server side for rpc calls. Server also becomes unresponsive to all tcp calls.

After a few minutes, they mostly recover. Breakdown of the sockets for server-0 for example:
1318 byconity-vw-vw-def
697 10-64-116-211.byco
180 ip-10-64-66-224.u
142 ip-10-64-231-57.u
77 ip-10-64-166-98.u
73 s3-us-west-2-r-w.
34 ip-10-64-231-57.us
28 ip-10-64-66-224.us
22 ip-10-64-166-98.us
so a lot of sockets opened to the read workers on servers. Is there a way to limit this to leave server functional? What causes new sockets to be opened for read workers?

dmthuc · 2024-03-06T07:24:12Z

Hi @ozcelgozde, thank you for report this issue. Let us investigate this

dmthuc · 2024-03-06T07:31:00Z

Hi @jenrryyou , please help to see why there are too many number of brpc connection open?

dmthuc · 2024-03-06T08:21:08Z

Hi @ozcelgozde , our developer are not sure where is those high number of rpc come from. We think that it is not come from brpc. To monitor the number of rpc_socket_count we can scrape it from url http://pod_ip:rpc_port/vars or http://pod_ip:rpc_port/brpc_metrics for Prometheus. Can you make a graph and see whether the high number of rpc connection really cause by brpc

ozcelgozde · 2024-03-07T11:30:54Z

Actually what u said looks to be correct, the problem seems to be coming from the tcp connections and not the rcp connections.

Let me dig deeper

ozcelgozde · 2024-03-08T15:48:34Z

I notice a lot of http requests:

And this causes Poco to throw Got exception while starting thread for connection.No thread available. From what I can trace thru the code, this is set to 16 by default and I couldnt find a way to change it do you have an idea?

ozcelgozde · 2024-03-08T15:52:40Z

A few samples:
2024.03.08 15:22:21.065071 [ 11817 ] {} Catalog: Finish set commit time for txn 448242930570231913, elapsed 8 ms.
Got exception while starting thread for connection. Error code: 0, message: 'No thread available'

2024.03.08 15:22:21.173052 [ 7038 ] {} CnchLock: acquire 1 locks in 2 ms
Got exception while starting thread for connection. Error code: 0, message: 'No thread available'

As you can see it blocks everything :)

dmthuc · 2024-03-11T02:30:30Z

A few samples: 2024.03.08 15:22:21.065071 [ 11817 ] {} Catalog: Finish set commit time for txn 448242930570231913, elapsed 8 ms. Got exception while starting thread for connection. Error code: 0, message: 'No thread available'

2024.03.08 15:22:21.173052 [ 7038 ] {} CnchLock: acquire 1 locks in 2 ms Got exception while starting thread for connection. Error code: 0, message: 'No thread available'

As you can see it blocks everything :)

Hi @ozcelgozde , thank you for your cooperation. There should be no interaction between workers and servers via http. I wonder if the http connections come from liveness check or metrics scraping like Prometheus

dmthuc · 2024-03-11T02:35:57Z

And may I know in which place you see that it is set to 16?

dmthuc · 2024-03-11T02:45:22Z

And could you tell how do you find out large number of connection is http? I think maybe the way you did may have a mistake

dmthuc · 2024-03-11T07:11:48Z

mentioned in #1153

ozcelgozde · 2024-03-11T08:04:23Z

Im not sure its directly related but I was able to make server go into unresponsive state by running queries from clickhouse-client. Im also trying to find what causes this :) The metric i looked for http is (cnch_current_metrics_http_connection).

ozcelgozde · 2024-03-11T08:17:00Z

I traced 16 from contrib/pocpFoundation/src/ThreadPool.cpp

PooledThread* ThreadPool::getThread()
{
	FastMutex::ScopedLock lock(_mutex);

	if (++_age == 32)
		housekeep();

	PooledThread* pThread = 0;
	for (ThreadVec::iterator it = _threads.begin(); !pThread && it != _threads.end(); ++it)
	{
		if ((*it)->idle())
			pThread = *it;
	}
	if (!pThread)
	{
		if (_threads.size() < _maxCapacity)
		{
			pThread = createThread();
			try
			{
				pThread->start();
				_threads.push_back(pThread);
			} catch (...)
			{
				delete pThread;
				throw;
			}
		}
		else
			throw NoThreadAvailableException();
	}
	pThread->activate();
	return pThread;
}

I couldnt find anywhere we set the max capacity and default is 16 so I assumed

dmthuc · 2024-03-12T08:00:23Z

cnch_current_metrics_http_connection

I did not find this metrics name in ByConity source code. I wonder whether this metrics derive from ProfileEvents::CreatedHTTPConnections .btw, are your cluster using S3 storage. Then I think the number of http connection may related to S3 storage

ozcelgozde · 2024-03-12T11:54:45Z

yes, I'm using S3

dmthuc · 2024-03-14T03:51:21Z

yes, I'm using S3

The S3 should not related here because the the S3 client will send http request from workers only. I think if you encounter the issue again, could you help to execute the command netstat on the server to see the source of the connection and its corresponding port. That should be the way to identify if the connection is HTTP or not. Maybe you can post the output of netstat command here so we could analyse together

dmthuc · 2024-03-14T03:57:48Z

Here is some of the command I use with netstat to quickly count the number of http request

$ netstat --tcp | grep "ESTABLISHED" | grep "18685"
tcp6       0      0 localhost:18685         localhost:31556         ESTABLISHED
tcp6       0      0 localhost:31556         localhost:18685         ESTABLISHED

My http port is 18685, and you can see there are 2 sockets because i connect to the server from local

ozcelgozde · 2024-03-14T19:57:53Z

I was finally able to reproduce it reliably when I send a lot of queries over clickhouse client at once. I wrote a simple script that sends 500 simultaneous queries repeatedly up to 4000 queries. This, as expected, makes transaction and tcp socket count spike. netstat also reflects a lot of tcp connections. A few errors I can see from read and server replicas are:

Failed to query, err: couldn't query clickhouse, query id: 30c83f8c-e23b-11ee-892c-02c7f6dbdbda, got err: "code: 2010, message: Query [30c83f8c-e23b-11ee-892c-02c7f6dbdbda] failed with RootCause: SegmentId: 0, ErrorCode:2010, Message: Fail to call DB.Protos.RegistryService.registry, error code: 2001, msg: [E2001][10.170.161.98:123456789]Create stream ExchangeDataKey[448383008663142595_1_0_18446744073709551615] for query 30c83f8c-e23b-11ee-892c-02c7f6dbdbda failed by exception: Code: 2010, e.displayText() = DB::Exception: Interrput accept for ExchangeDataKey[448383008663142595_1_0_18446744073709551615] SQLSTATE: HY000 (version 21.8.7.1); \n AdditionalErrors: SegmentId: 1, ErrorCode:2010, Message: Worker host:10.170.146.66:8124, exception:Code: 2010, e.displayText() = DB::Exception: Fail to call DB.Protos.RegistryService.registry, error code: 1008, msg: [E1008]Reached timeout=10000ms @10.170.161.98:8124 SQLSTATE: HY000 (version 21.8.7.1)  SQLSTATE: HY000"

Failed to query, err: couldn't query clickhouse, query id: 30c81674-e23b-11ee-8918-02c7f6dbdbda, got err: "read: read tcp 10.255.20.155:59968->10.170.185.174:9000: i/o timeout"

Failed to query, err: couldn't query clickhouse, query id: a55e6764-e23a-11ee-b252-02c7f6dbdbda, got err: "read: EOF"

Is there a way to know what is my simultaneous query limits over tcp?

dmthuc · 2024-03-15T03:17:19Z

I was finally able to reproduce it reliably when I send a lot of queries over clickhouse client at once. I wrote a simple script that sends 500 simultaneous queries repeatedly up to 4000 queries. This, as expected, makes transaction and tcp socket count spike. netstat also reflects a lot of tcp connections. A few errors I can see from read and server replicas are:

Failed to query, err: couldn't query clickhouse, query id: 30c83f8c-e23b-11ee-892c-02c7f6dbdbda, got err: "code: 2010, message: Query [30c83f8c-e23b-11ee-892c-02c7f6dbdbda] failed with RootCause: SegmentId: 0, ErrorCode:2010, Message: Fail to call DB.Protos.RegistryService.registry, error code: 2001, msg: [E2001][10.170.161.98:123456789]Create stream ExchangeDataKey[448383008663142595_1_0_18446744073709551615] for query 30c83f8c-e23b-11ee-892c-02c7f6dbdbda failed by exception: Code: 2010, e.displayText() = DB::Exception: Interrput accept for ExchangeDataKey[448383008663142595_1_0_18446744073709551615] SQLSTATE: HY000 (version 21.8.7.1); \n AdditionalErrors: SegmentId: 1, ErrorCode:2010, Message: Worker host:10.170.146.66:8124, exception:Code: 2010, e.displayText() = DB::Exception: Fail to call DB.Protos.RegistryService.registry, error code: 1008, msg: [E1008]Reached timeout=10000ms @10.170.161.98:8124 SQLSTATE: HY000 (version 21.8.7.1)  SQLSTATE: HY000"

Failed to query, err: couldn't query clickhouse, query id: 30c81674-e23b-11ee-8918-02c7f6dbdbda, got err: "read: read tcp 10.255.20.155:59968->10.170.185.174:9000: i/o timeout"

Failed to query, err: couldn't query clickhouse, query id: a55e6764-e23a-11ee-b252-02c7f6dbdbda, got err: "read: EOF"

Is there a way to know what is my simultaneous query limits over tcp?

We have settings to limit the number of query like max_concurrent_queries_for_all_users, but the capacity is depend on the implementation. But the issue of sending too many queries is different from the issue you tell from the start. Like we are interest where are so many HTTP connections come from?

ozcelgozde · 2024-03-15T10:49:05Z

A lot of HTTP connections come from HTTP insert. We have ingestion over http, which on peak times causes the same effect I would think.

dmthuc · 2024-03-18T02:44:46Z

A lot of HTTP connections come from HTTP insert. We have ingestion over http, which on peak times causes the same effect I would think.

I see, I've tried to send many queries to reproduce and also encounter the resource issue that thread can't be created. We are going to merge some MR to improve the QPS but I'm not sure if it gonna change much. So I think maybe you have to try to squash multiple HTTP request into one to reduce the QPS into the system

ozcelgozde · 2024-03-18T09:32:13Z

Thank you @dmthuc for your help and interest!

ozcelgozde added the question Further information is requested label Mar 5, 2024

dmthuc assigned jenrryyou Mar 6, 2024

kevinthfang added the bug Something isn't working label Mar 11, 2024

kevinthfang assigned dmthuc Mar 11, 2024

dmthuc added performance The issue caused by the load is too high for the capacity of the system and removed bug Something isn't working labels Mar 18, 2024

kevinthfang closed this as completed May 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Server becomes unresponsive because max_connections is reached during read #1282

Server becomes unresponsive because max_connections is reached during read #1282

ozcelgozde commented Mar 5, 2024

dmthuc commented Mar 6, 2024

dmthuc commented Mar 6, 2024

dmthuc commented Mar 6, 2024 •

edited

ozcelgozde commented Mar 7, 2024

ozcelgozde commented Mar 8, 2024

ozcelgozde commented Mar 8, 2024

dmthuc commented Mar 11, 2024

dmthuc commented Mar 11, 2024

dmthuc commented Mar 11, 2024

dmthuc commented Mar 11, 2024

ozcelgozde commented Mar 11, 2024

ozcelgozde commented Mar 11, 2024

dmthuc commented Mar 12, 2024 •

edited

ozcelgozde commented Mar 12, 2024

dmthuc commented Mar 14, 2024

dmthuc commented Mar 14, 2024

ozcelgozde commented Mar 14, 2024

dmthuc commented Mar 15, 2024

ozcelgozde commented Mar 15, 2024

dmthuc commented Mar 18, 2024 •

edited

ozcelgozde commented Mar 18, 2024

Server becomes unresponsive because max_connections is reached during read #1282

Server becomes unresponsive because max_connections is reached during read #1282

Comments

ozcelgozde commented Mar 5, 2024

Question

dmthuc commented Mar 6, 2024

dmthuc commented Mar 6, 2024

dmthuc commented Mar 6, 2024 • edited

ozcelgozde commented Mar 7, 2024

ozcelgozde commented Mar 8, 2024

ozcelgozde commented Mar 8, 2024

dmthuc commented Mar 11, 2024

dmthuc commented Mar 11, 2024

dmthuc commented Mar 11, 2024

dmthuc commented Mar 11, 2024

ozcelgozde commented Mar 11, 2024

ozcelgozde commented Mar 11, 2024

dmthuc commented Mar 12, 2024 • edited

ozcelgozde commented Mar 12, 2024

dmthuc commented Mar 14, 2024

dmthuc commented Mar 14, 2024

ozcelgozde commented Mar 14, 2024

dmthuc commented Mar 15, 2024

ozcelgozde commented Mar 15, 2024

dmthuc commented Mar 18, 2024 • edited

ozcelgozde commented Mar 18, 2024

dmthuc commented Mar 6, 2024 •

edited

dmthuc commented Mar 12, 2024 •

edited

dmthuc commented Mar 18, 2024 •

edited