byconity server getting stuck when multiple servers are used #1153

dogauzuncukoglu · 2024-02-05T13:15:06Z

Bug Report

Briefly describe the bug

We have observed this bug is happening when using multiple servers. One server stops responding and goes into deadlock like state. When this happens there will be intermittent errors with requests. Server eventually logs below errors.

2024.01.30 10:26:03.569045 [ 1867 ] {} <Debug> TCPHandler: Done processing connection.
2024.01.30 10:43:26.899189 [ 1852 ] {} <Debug> MetaChecker: Start to run metadata synchronization task.
Got exception while starting thread for connection. Error code: 0, message: 'No thread available'
Got exception while starting thread for connection. Error code: 0, message: 'No thread available'
Got exception while starting thread for connection. Error code: 0, message: 'No thread available'
Got exception while starting thread for connection. Error code: 0, message: 'No thread available'
Got exception while starting thread for connection. Error code: 0, message: 'No thread available'
Got exception while starting thread for connection. Error code: 0, message: 'No thread available'
Got exception while starting thread for connection. Error code: 0, message: 'No thread available'
Got exception while starting thread for connection. Error code: 0, message: 'No thread available'
Got exception while starting thread for connection. Error code: 0, message: 'No thread available'
Got exception while starting thread for connection. Error code: 0, message: 'No thread available'
Got exception while starting thread for connection. Error code: 0, message: 'No thread available'
Got exception while starting thread for connection. Error code: 0, message: 'No thread available'
Got exception while starting thread for connection. Error code: 0, message: 'No thread available'
Got exception while starting thread for connection. Error code: 0, message: 'No thread available'
Got exception while starting thread for connection. Error code: 0, message: 'No thread available'
Got exception while starting thread for connection. Error code: 0, message: 'No thread available'
Got exception while starting thread for connection. Error code: 0, message: 'No thread available'
Got exception while starting thread for connection. Error code: 0, message: 'No thread available'

here is all the logs from the server when error happened.
byconity_server_error.log

After around ~1h later we observed the above Got exception while starting thread for connection. Error code: 0, message: 'No thread available' logs and server started working again.

The result you expected

If a server becomes unavailable for any reason some tables are affected which means multiple servers do not make the system HA. It will just limit the surface area to specific tasks the server was handling. We found this setting server_write_ha but not sure whether setting it to true would solve our issue, or where should we set it if we are deploying to kubernetes via helm?

How to Reproduce

Sending a lot of insert requests in parallel via http requests to server seems to put them into this state.

Version

cc4e467

The text was updated successfully, but these errors were encountered:

nudles · 2024-02-06T01:58:51Z

@dmthuc does server_write_ha help for this case?

dmthuc · 2024-02-07T02:50:42Z

Hi @dogauzuncukoglu , thank you for reporting the issue. The setting server_write_ha will allow the insert happen in the non-host server and the setting enable_write_non_host_server will control whether to redirect write request to host server or not. We can use these settings to allow write happen to non-host server. But I think it would not solve your problem. To solve your problem we need to detect which server is not able to receive write query to avoid send write request to that server. In my opinion I think this can be implement with a correct readiness probe in k8s that detect when server is not ready to receive the insert request. And when we find servers to serve insert request via k8s DNS, it will not show the server that is not ready

dmthuc · 2024-02-07T02:56:55Z

I think the solution can be improve further after discuss with my colleagues but most of them are on leave now. @Andygogo15 , you can take a look when you have time.

dmthuc · 2024-02-07T03:12:40Z

Hi @dogauzuncukoglu , I think the correct solution for your case is to send insert request directly to worker. It will reduce to load in server. When send insert request to worker you need to use the settings prefer_cnch_catalog. Please refer to the example in https://github.com/ByConity/ByConity/blob/master/tests/queries/4_cnch_stateless_no_tenant/50010_direct_insert.sh

dogauzuncukoglu · 2024-02-07T08:59:35Z

@dmthuc thank you very much for the informative answer. It helps a lot.

To give more context about the issue, we were previously running into an issue with materialized views documented here: #774

To work around that issue we were manually selecting materialized columns in the materialized views. This works when we send insert request to server but it gives an error when same request sent to write worker directly.

500 Internal Server Error, Error: Code: 44, e.displayText() = DB::Exception: Cannot insert column k8s.namespace.name, because it is MATERIALIZED column.

for context this would be the minimal setting basically.

CREATE TABLE ed.test_table (    
 `timestamp` DateTime64(3) CODEC(Delta(8), ZSTD(1)),    
 `test_keys` Array(LowCardinality(String)) CODEC(ZSTD(1)),     
`test_values` Array(String) CODEC(ZSTD(1)),     
`test.column` LowCardinality(String) MATERIALIZED test_values[indexOf(test_keys, 'environment')] CODEC(LZ4),    
 `id` UUID DEFAULT generateUUIDv4() 
) 
ENGINE = CnchMergeTree 
ORDER BY timestamp 
SETTINGS storage_policy = 'cnch_default_s3', index_granularity = 8192


CREATE TABLE ed.test_table_samp_10
(
    `timestamp` DateTime64(3) CODEC(Delta(8), ZSTD(1)),
    `test_keys` Array(LowCardinality(String)) CODEC(ZSTD(1)),
    `test_values` Array(String) CODEC(ZSTD(1)),
    `test.column` LowCardinality(String) MATERIALIZED test_values[indexOf(test_keys, 'environment')] CODEC(LZ4),
    `id` UUID DEFAULT generateUUIDv4()
)
ENGINE = CnchMergeTree
ORDER BY timestamp
SETTINGS storage_policy = 'cnch_default_s3', index_granularity = 8192


CREATE MATERIALIZED VIEW ed.test_mv TO ed.test_table_samp_10 AS WITH exp2(64) - 1 AS MAX_UINT64 SELECT * FROM ed.test_table WHERE cityHash64(id) < (MAX_UINT64 / 10)

Assume you use SELECT timestamp, test_keys, test_values, test.column, id in the materialized view ed.test_mv instead of SELECT *

in this case if you send write request to server it works. But if you send it directly to write worker it gives the error I copy/pasted above

Example insert request

insert into ed.test_table (timestamp, test_keys, test_values) VALUES ('2023-09-24 06:07:40.954',['environment','non-mat-columns'], ['staging', 'columnsnsns']);

dmthuc · 2024-02-07T09:55:06Z

Hi @dogauzuncukoglu , byconity servers maintains the Topology which is a array of working servers. When a insert request is send to a server, based on the target table of the insert request, it will calculate the host server for the table and forward the insert query to that host server. The host server is the server among servers in the Topology. The topology is maintain by periodically scan via k8s DNS. So when a server is marked not ready by k8s, it will be removed from Topology. So i think a solution for your case is that you don't need to change any setting and just need to write a readiness probe check suitably. For example, the readiness probe will send a query to system table system.processes and count the current number of insert query happen on a server at the moment. If this count value is higher on a threshold, the readiness probe return false. Hence we will able to temporary remove server with high number of insert request out of topology.

dogauzuncukoglu · 2024-02-07T10:19:44Z

@dmthuc thanks for the suggestion. It makes sense to me let me try and see what happens.

dogauzuncukoglu added the bug Something isn't working label Feb 5, 2024

kevinthfang assigned dmthuc Feb 20, 2024

dmthuc mentioned this issue Mar 11, 2024

Server becomes unresponsive because max_connections is reached during read #1282

Closed

kevinthfang added the duplicate This issue or pull request already exists label Mar 18, 2024

kevinthfang closed this as completed May 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

byconity server getting stuck when multiple servers are used #1153

byconity server getting stuck when multiple servers are used #1153

dogauzuncukoglu commented Feb 5, 2024 •

edited

nudles commented Feb 6, 2024

dmthuc commented Feb 7, 2024 •

edited

dmthuc commented Feb 7, 2024

dmthuc commented Feb 7, 2024 •

edited

dogauzuncukoglu commented Feb 7, 2024 •

edited

dmthuc commented Feb 7, 2024

dogauzuncukoglu commented Feb 7, 2024

byconity server getting stuck when multiple servers are used #1153

byconity server getting stuck when multiple servers are used #1153

Comments

dogauzuncukoglu commented Feb 5, 2024 • edited

Bug Report

Briefly describe the bug

The result you expected

How to Reproduce

Version

nudles commented Feb 6, 2024

dmthuc commented Feb 7, 2024 • edited

dmthuc commented Feb 7, 2024

dmthuc commented Feb 7, 2024 • edited

dogauzuncukoglu commented Feb 7, 2024 • edited

dmthuc commented Feb 7, 2024

dogauzuncukoglu commented Feb 7, 2024

dogauzuncukoglu commented Feb 5, 2024 •

edited

dmthuc commented Feb 7, 2024 •

edited

dmthuc commented Feb 7, 2024 •

edited

dogauzuncukoglu commented Feb 7, 2024 •

edited