Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

byconity server getting stuck when multiple servers are used #1153

Closed
dogauzuncukoglu opened this issue Feb 5, 2024 · 7 comments
Closed
Assignees
Labels
bug Something isn't working duplicate This issue or pull request already exists

Comments

@dogauzuncukoglu
Copy link
Contributor

dogauzuncukoglu commented Feb 5, 2024

Bug Report

Briefly describe the bug

We have observed this bug is happening when using multiple servers. One server stops responding and goes into deadlock like state. When this happens there will be intermittent errors with requests. Server eventually logs below errors.

2024.01.30 10:26:03.569045 [ 1867 ] {} <Debug> TCPHandler: Done processing connection.
2024.01.30 10:43:26.899189 [ 1852 ] {} <Debug> MetaChecker: Start to run metadata synchronization task.
Got exception while starting thread for connection. Error code: 0, message: 'No thread available'
Got exception while starting thread for connection. Error code: 0, message: 'No thread available'
Got exception while starting thread for connection. Error code: 0, message: 'No thread available'
Got exception while starting thread for connection. Error code: 0, message: 'No thread available'
Got exception while starting thread for connection. Error code: 0, message: 'No thread available'
Got exception while starting thread for connection. Error code: 0, message: 'No thread available'
Got exception while starting thread for connection. Error code: 0, message: 'No thread available'
Got exception while starting thread for connection. Error code: 0, message: 'No thread available'
Got exception while starting thread for connection. Error code: 0, message: 'No thread available'
Got exception while starting thread for connection. Error code: 0, message: 'No thread available'
Got exception while starting thread for connection. Error code: 0, message: 'No thread available'
Got exception while starting thread for connection. Error code: 0, message: 'No thread available'
Got exception while starting thread for connection. Error code: 0, message: 'No thread available'
Got exception while starting thread for connection. Error code: 0, message: 'No thread available'
Got exception while starting thread for connection. Error code: 0, message: 'No thread available'
Got exception while starting thread for connection. Error code: 0, message: 'No thread available'
Got exception while starting thread for connection. Error code: 0, message: 'No thread available'
Got exception while starting thread for connection. Error code: 0, message: 'No thread available'

here is all the logs from the server when error happened.
byconity_server_error.log

After around ~1h later we observed the above Got exception while starting thread for connection. Error code: 0, message: 'No thread available' logs and server started working again.

The result you expected

If a server becomes unavailable for any reason some tables are affected which means multiple servers do not make the system HA. It will just limit the surface area to specific tasks the server was handling. We found this setting server_write_ha but not sure whether setting it to true would solve our issue, or where should we set it if we are deploying to kubernetes via helm?

How to Reproduce

Sending a lot of insert requests in parallel via http requests to server seems to put them into this state.

Version

cc4e467

@dogauzuncukoglu dogauzuncukoglu added the bug Something isn't working label Feb 5, 2024
@nudles
Copy link
Collaborator

nudles commented Feb 6, 2024

@dmthuc does server_write_ha help for this case?

@dmthuc
Copy link
Collaborator

dmthuc commented Feb 7, 2024

Hi @dogauzuncukoglu , thank you for reporting the issue. The setting server_write_ha will allow the insert happen in the non-host server and the setting enable_write_non_host_server will control whether to redirect write request to host server or not. We can use these settings to allow write happen to non-host server. But I think it would not solve your problem. To solve your problem we need to detect which server is not able to receive write query to avoid send write request to that server. In my opinion I think this can be implement with a correct readiness probe in k8s that detect when server is not ready to receive the insert request. And when we find servers to serve insert request via k8s DNS, it will not show the server that is not ready

@dmthuc
Copy link
Collaborator

dmthuc commented Feb 7, 2024

I think the solution can be improve further after discuss with my colleagues but most of them are on leave now. @Andygogo15 , you can take a look when you have time.

@dmthuc
Copy link
Collaborator

dmthuc commented Feb 7, 2024

Hi @dogauzuncukoglu , I think the correct solution for your case is to send insert request directly to worker. It will reduce to load in server. When send insert request to worker you need to use the settings prefer_cnch_catalog. Please refer to the example in https://github.com/ByConity/ByConity/blob/master/tests/queries/4_cnch_stateless_no_tenant/50010_direct_insert.sh

@dogauzuncukoglu
Copy link
Contributor Author

dogauzuncukoglu commented Feb 7, 2024

@dmthuc thank you very much for the informative answer. It helps a lot.

To give more context about the issue, we were previously running into an issue with materialized views documented here: #774

To work around that issue we were manually selecting materialized columns in the materialized views. This works when we send insert request to server but it gives an error when same request sent to write worker directly.

500 Internal Server Error, Error: Code: 44, e.displayText() = DB::Exception: Cannot insert column k8s.namespace.name, because it is MATERIALIZED column.

for context this would be the minimal setting basically.

CREATE TABLE ed.test_table (    
 `timestamp` DateTime64(3) CODEC(Delta(8), ZSTD(1)),    
 `test_keys` Array(LowCardinality(String)) CODEC(ZSTD(1)),     
`test_values` Array(String) CODEC(ZSTD(1)),     
`test.column` LowCardinality(String) MATERIALIZED test_values[indexOf(test_keys, 'environment')] CODEC(LZ4),    
 `id` UUID DEFAULT generateUUIDv4() 
) 
ENGINE = CnchMergeTree 
ORDER BY timestamp 
SETTINGS storage_policy = 'cnch_default_s3', index_granularity = 8192


CREATE TABLE ed.test_table_samp_10
(
    `timestamp` DateTime64(3) CODEC(Delta(8), ZSTD(1)),
    `test_keys` Array(LowCardinality(String)) CODEC(ZSTD(1)),
    `test_values` Array(String) CODEC(ZSTD(1)),
    `test.column` LowCardinality(String) MATERIALIZED test_values[indexOf(test_keys, 'environment')] CODEC(LZ4),
    `id` UUID DEFAULT generateUUIDv4()
)
ENGINE = CnchMergeTree
ORDER BY timestamp
SETTINGS storage_policy = 'cnch_default_s3', index_granularity = 8192


CREATE MATERIALIZED VIEW ed.test_mv TO ed.test_table_samp_10 AS WITH exp2(64) - 1 AS MAX_UINT64 SELECT * FROM ed.test_table WHERE cityHash64(id) < (MAX_UINT64 / 10)

Assume you use SELECT timestamp, test_keys, test_values, test.column, id in the materialized view ed.test_mv instead of SELECT *

in this case if you send write request to server it works. But if you send it directly to write worker it gives the error I copy/pasted above

Example insert request

insert into ed.test_table (timestamp, test_keys, test_values) VALUES ('2023-09-24 06:07:40.954',['environment','non-mat-columns'], ['staging', 'columnsnsns']);

@dmthuc
Copy link
Collaborator

dmthuc commented Feb 7, 2024

Hi @dogauzuncukoglu , byconity servers maintains the Topology which is a array of working servers. When a insert request is send to a server, based on the target table of the insert request, it will calculate the host server for the table and forward the insert query to that host server. The host server is the server among servers in the Topology. The topology is maintain by periodically scan via k8s DNS. So when a server is marked not ready by k8s, it will be removed from Topology. So i think a solution for your case is that you don't need to change any setting and just need to write a readiness probe check suitably. For example, the readiness probe will send a query to system table system.processes and count the current number of insert query happen on a server at the moment. If this count value is higher on a threshold, the readiness probe return false. Hence we will able to temporary remove server with high number of insert request out of topology.

@dogauzuncukoglu
Copy link
Contributor Author

@dmthuc thanks for the suggestion. It makes sense to me let me try and see what happens.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working duplicate This issue or pull request already exists
Projects
None yet
Development

No branches or pull requests

4 participants