Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry query on another node if execution status != 200 #115

Open
wlp7s0 opened this issue Apr 23, 2021 · 4 comments
Open

Retry query on another node if execution status != 200 #115

wlp7s0 opened this issue Apr 23, 2021 · 4 comments

Comments

@wlp7s0
Copy link

wlp7s0 commented Apr 23, 2021

Hello, I'm currently testing chproxy in test environment and I have a question about query execution.
Let's say I have 2 nodes with replication and 4 zookeepers with 1 chproxy to balance rw queries between two nodes.
Also, I have a stream of data from dozens of servers to chproxy.
I have configured a health check to select specific path in the replicated table to make sure that both nodes have this tables and database itself.
But, in my test env I've removed access to zookeeper from one of the node, what rendered database on the node readonly and health check select didn't mark the node as faulty. At the same time all INSERT requests to the readonly node exited with error code 500 and all failed INSERT requests are lost.
Using /metrics I can see that chproxy can check for the query execution status, but I can't see any way to execute the fault query on another node if the response status from the node was not 200. Or, may be to store them for manual recovery.
Am I missing something?
Thanks!

@gontarzpawel
Copy link
Contributor

Hello @wlp7s0, I'll try to reproduce it.
I'd advice you to add a retry strategy on client side and rely on message bus before your insertion services - to be resilient to Clickhouse downtime.

@gontarzpawel
Copy link
Contributor

Hi @wlp7s0 ,

I performed following test scenario:

  • setup clickhouse cluster consisting of 4 nodes
  • chproxy targets that cluster. 4 nodes marked as healthy
  • manually kill one node
  • chproxy marked correctly killed node us unhealthy
  • chproxy excluded it from the list of available nodes

I fail to reproduce scenario you described. Could you please provide how to reproduce it?

@ranjbaryshahab
Copy link

ranjbaryshahab commented Jan 5, 2023

Hello @gontarzpawel
How about another scenario status code 404 or etc?
for example, I have 3 nodes and 2 tables [A, B]
A table is replicated table and exists on all nodes, B table isn't replicated table and only exists on one node.
When I execute "select * from B" sometimes I have got the exception: Table B doesn't exist. (UNKNOWN_TABLE)
Is there any way when a table doesn't exist Chproxy try again on other nodes?
Also, I changed this line

if rw.StatusCode() == http.StatusBadGateway {

to "if rw.StatusCode() != http.StatusOK"
but it hasn't worked yet.

@mga-chka
Copy link
Collaborator

mga-chka commented Jan 7, 2023

IHMO in this situation you should fix your clickhouse config or rewrite your query to specify the server that contains table B using the remote syntaxe https://clickhouse.com/docs/en/sql-reference/table-functions/remote/

Regarding the retry-ability, we looked at the error codes returned by clickhouse and decided to do it only if it makes sens (i.e if a retry can make the failed query work). If we allow a retry on 404, everytime someone does a mistake, it will be retry despite the fact that it won't work and therefore it will slowdown the query response time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

4 participants