Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server becomes unresponsive because max_connections is reached during read #1282

Closed
ozcelgozde opened this issue Mar 5, 2024 · 21 comments
Closed
Assignees
Labels
performance The issue caused by the load is too high for the capacity of the system question Further information is requested

Comments

@ozcelgozde
Copy link
Contributor

Question

When I run a couple of select queries that try to find rare terms on column indexed by tokenbf, I see server readiness probe start to fail on kubernetes because a lot of sockets are opened on the server side for rpc calls. Server also becomes unresponsive to all tcp calls.

Screenshot 2024-03-05 at 20 18 30

After a few minutes, they mostly recover. Breakdown of the sockets for server-0 for example:
1318 byconity-vw-vw-def
697 10-64-116-211.byco
180 ip-10-64-66-224.u
142 ip-10-64-231-57.u
77 ip-10-64-166-98.u
73 s3-us-west-2-r-w.
34 ip-10-64-231-57.us
28 ip-10-64-66-224.us
22 ip-10-64-166-98.us
so a lot of sockets opened to the read workers on servers. Is there a way to limit this to leave server functional? What causes new sockets to be opened for read workers?

@ozcelgozde ozcelgozde added the question Further information is requested label Mar 5, 2024
@dmthuc
Copy link
Collaborator

dmthuc commented Mar 6, 2024

Hi @ozcelgozde, thank you for report this issue. Let us investigate this

@dmthuc
Copy link
Collaborator

dmthuc commented Mar 6, 2024

Hi @jenrryyou , please help to see why there are too many number of brpc connection open?

@dmthuc
Copy link
Collaborator

dmthuc commented Mar 6, 2024

Hi @ozcelgozde , our developer are not sure where is those high number of rpc come from. We think that it is not come from brpc. To monitor the number of rpc_socket_count we can scrape it from url http://pod_ip:rpc_port/vars or http://pod_ip:rpc_port/brpc_metrics for Prometheus. Can you make a graph and see whether the high number of rpc connection really cause by brpc

@ozcelgozde
Copy link
Contributor Author

Actually what u said looks to be correct, the problem seems to be coming from the tcp connections and not the rcp connections.
Screenshot 2024-03-06 at 20 35 01
Let me dig deeper

@ozcelgozde
Copy link
Contributor Author

I notice a lot of http requests:
Screenshot 2024-03-08 at 18 48 17

And this causes Poco to throw Got exception while starting thread for connection.No thread available. From what I can trace thru the code, this is set to 16 by default and I couldnt find a way to change it do you have an idea?

@ozcelgozde
Copy link
Contributor Author

A few samples:
2024.03.08 15:22:21.065071 [ 11817 ] {} Catalog: Finish set commit time for txn 448242930570231913, elapsed 8 ms.
Got exception while starting thread for connection. Error code: 0, message: 'No thread available'

2024.03.08 15:22:21.173052 [ 7038 ] {} CnchLock: acquire 1 locks in 2 ms
Got exception while starting thread for connection. Error code: 0, message: 'No thread available'

As you can see it blocks everything :)

@dmthuc
Copy link
Collaborator

dmthuc commented Mar 11, 2024

A few samples: 2024.03.08 15:22:21.065071 [ 11817 ] {} Catalog: Finish set commit time for txn 448242930570231913, elapsed 8 ms. Got exception while starting thread for connection. Error code: 0, message: 'No thread available'

2024.03.08 15:22:21.173052 [ 7038 ] {} CnchLock: acquire 1 locks in 2 ms Got exception while starting thread for connection. Error code: 0, message: 'No thread available'

As you can see it blocks everything :)

Hi @ozcelgozde , thank you for your cooperation. There should be no interaction between workers and servers via http. I wonder if the http connections come from liveness check or metrics scraping like Prometheus

@dmthuc
Copy link
Collaborator

dmthuc commented Mar 11, 2024

And may I know in which place you see that it is set to 16?

@dmthuc
Copy link
Collaborator

dmthuc commented Mar 11, 2024

And could you tell how do you find out large number of connection is http? I think maybe the way you did may have a mistake

@kevinthfang kevinthfang added the bug Something isn't working label Mar 11, 2024
@dmthuc
Copy link
Collaborator

dmthuc commented Mar 11, 2024

mentioned in #1153

@ozcelgozde
Copy link
Contributor Author

Im not sure its directly related but I was able to make server go into unresponsive state by running queries from clickhouse-client. Im also trying to find what causes this :) The metric i looked for http is (cnch_current_metrics_http_connection).

@ozcelgozde
Copy link
Contributor Author

I traced 16 from contrib/pocpFoundation/src/ThreadPool.cpp

PooledThread* ThreadPool::getThread()
{
	FastMutex::ScopedLock lock(_mutex);

	if (++_age == 32)
		housekeep();

	PooledThread* pThread = 0;
	for (ThreadVec::iterator it = _threads.begin(); !pThread && it != _threads.end(); ++it)
	{
		if ((*it)->idle())
			pThread = *it;
	}
	if (!pThread)
	{
		if (_threads.size() < _maxCapacity)
		{
			pThread = createThread();
			try
			{
				pThread->start();
				_threads.push_back(pThread);
			} catch (...)
			{
				delete pThread;
				throw;
			}
		}
		else
			throw NoThreadAvailableException();
	}
	pThread->activate();
	return pThread;
}

I couldnt find anywhere we set the max capacity and default is 16 so I assumed

@dmthuc
Copy link
Collaborator

dmthuc commented Mar 12, 2024

cnch_current_metrics_http_connection

I did not find this metrics name in ByConity source code. I wonder whether this metrics derive from ProfileEvents::CreatedHTTPConnections .btw, are your cluster using S3 storage. Then I think the number of http connection may related to S3 storage

@ozcelgozde
Copy link
Contributor Author

yes, I'm using S3

@dmthuc
Copy link
Collaborator

dmthuc commented Mar 14, 2024

yes, I'm using S3

The S3 should not related here because the the S3 client will send http request from workers only. I think if you encounter the issue again, could you help to execute the command netstat on the server to see the source of the connection and its corresponding port. That should be the way to identify if the connection is HTTP or not. Maybe you can post the output of netstat command here so we could analyse together

@dmthuc
Copy link
Collaborator

dmthuc commented Mar 14, 2024

Here is some of the command I use with netstat to quickly count the number of http request

$ netstat --tcp | grep "ESTABLISHED" | grep "18685"
tcp6       0      0 localhost:18685         localhost:31556         ESTABLISHED
tcp6       0      0 localhost:31556         localhost:18685         ESTABLISHED

My http port is 18685, and you can see there are 2 sockets because i connect to the server from local

@ozcelgozde
Copy link
Contributor Author

I was finally able to reproduce it reliably when I send a lot of queries over clickhouse client at once. I wrote a simple script that sends 500 simultaneous queries repeatedly up to 4000 queries. This, as expected, makes transaction and tcp socket count spike. netstat also reflects a lot of tcp connections. A few errors I can see from read and server replicas are:

Failed to query, err: couldn't query clickhouse, query id: 30c83f8c-e23b-11ee-892c-02c7f6dbdbda, got err: "code: 2010, message: Query [30c83f8c-e23b-11ee-892c-02c7f6dbdbda] failed with RootCause: SegmentId: 0, ErrorCode:2010, Message: Fail to call DB.Protos.RegistryService.registry, error code: 2001, msg: [E2001][10.170.161.98:123456789]Create stream ExchangeDataKey[448383008663142595_1_0_18446744073709551615] for query 30c83f8c-e23b-11ee-892c-02c7f6dbdbda failed by exception: Code: 2010, e.displayText() = DB::Exception: Interrput accept for ExchangeDataKey[448383008663142595_1_0_18446744073709551615] SQLSTATE: HY000 (version 21.8.7.1); \n AdditionalErrors: SegmentId: 1, ErrorCode:2010, Message: Worker host:10.170.146.66:8124, exception:Code: 2010, e.displayText() = DB::Exception: Fail to call DB.Protos.RegistryService.registry, error code: 1008, msg: [E1008]Reached timeout=10000ms @10.170.161.98:8124 SQLSTATE: HY000 (version 21.8.7.1)  SQLSTATE: HY000"
Failed to query, err: couldn't query clickhouse, query id: 30c81674-e23b-11ee-8918-02c7f6dbdbda, got err: "read: read tcp 10.255.20.155:59968->10.170.185.174:9000: i/o timeout"
Failed to query, err: couldn't query clickhouse, query id: a55e6764-e23a-11ee-b252-02c7f6dbdbda, got err: "read: EOF"

Is there a way to know what is my simultaneous query limits over tcp?

@dmthuc
Copy link
Collaborator

dmthuc commented Mar 15, 2024

I was finally able to reproduce it reliably when I send a lot of queries over clickhouse client at once. I wrote a simple script that sends 500 simultaneous queries repeatedly up to 4000 queries. This, as expected, makes transaction and tcp socket count spike. netstat also reflects a lot of tcp connections. A few errors I can see from read and server replicas are:

Failed to query, err: couldn't query clickhouse, query id: 30c83f8c-e23b-11ee-892c-02c7f6dbdbda, got err: "code: 2010, message: Query [30c83f8c-e23b-11ee-892c-02c7f6dbdbda] failed with RootCause: SegmentId: 0, ErrorCode:2010, Message: Fail to call DB.Protos.RegistryService.registry, error code: 2001, msg: [E2001][10.170.161.98:123456789]Create stream ExchangeDataKey[448383008663142595_1_0_18446744073709551615] for query 30c83f8c-e23b-11ee-892c-02c7f6dbdbda failed by exception: Code: 2010, e.displayText() = DB::Exception: Interrput accept for ExchangeDataKey[448383008663142595_1_0_18446744073709551615] SQLSTATE: HY000 (version 21.8.7.1); \n AdditionalErrors: SegmentId: 1, ErrorCode:2010, Message: Worker host:10.170.146.66:8124, exception:Code: 2010, e.displayText() = DB::Exception: Fail to call DB.Protos.RegistryService.registry, error code: 1008, msg: [E1008]Reached timeout=10000ms @10.170.161.98:8124 SQLSTATE: HY000 (version 21.8.7.1)  SQLSTATE: HY000"
Failed to query, err: couldn't query clickhouse, query id: 30c81674-e23b-11ee-8918-02c7f6dbdbda, got err: "read: read tcp 10.255.20.155:59968->10.170.185.174:9000: i/o timeout"
Failed to query, err: couldn't query clickhouse, query id: a55e6764-e23a-11ee-b252-02c7f6dbdbda, got err: "read: EOF"

Is there a way to know what is my simultaneous query limits over tcp?

We have settings to limit the number of query like max_concurrent_queries_for_all_users, but the capacity is depend on the implementation. But the issue of sending too many queries is different from the issue you tell from the start. Like we are interest where are so many HTTP connections come from?

@ozcelgozde
Copy link
Contributor Author

A lot of HTTP connections come from HTTP insert. We have ingestion over http, which on peak times causes the same effect I would think.

@dmthuc
Copy link
Collaborator

dmthuc commented Mar 18, 2024

A lot of HTTP connections come from HTTP insert. We have ingestion over http, which on peak times causes the same effect I would think.

I see, I've tried to send many queries to reproduce and also encounter the resource issue that thread can't be created. We are going to merge some MR to improve the QPS but I'm not sure if it gonna change much. So I think maybe you have to try to squash multiple HTTP request into one to reduce the QPS into the system

@dmthuc dmthuc added performance The issue caused by the load is too high for the capacity of the system and removed bug Something isn't working labels Mar 18, 2024
@ozcelgozde
Copy link
Contributor Author

Thank you @dmthuc for your help and interest!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance The issue caused by the load is too high for the capacity of the system question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants