Single-replica shard restart updates remote_servers and leads to undesirable behaviour #1363

cpg314 · 2024-03-05T15:10:56Z

In a deployment with multiple shards and a single replica each, I seem to have observed the following:

One of the shards has its pod restarting (in my case because Clickhouse segfaults).
The operator changes the remote_servers configuration to remove that server. [*]
Until the shard is re-added, INSERTs on distributed tables will distribute the data on shards differently (modulo the number of remaining shards), which can be undesirable when trying to enforce shard-locality of data for distributed joins. In the worst case, this can lead to local queries unexpectedly giving different results than local ones.

I have described the issue in more details there, where people suggested this was more likely an issue with the operator: ClickHouse/ClickHouse#60219

The workaround I described there is to override the remote_servers configuration so that the shards are not removed. Instead of inserting to the "wrong" shard, the insert will fail.

[*] I have not been able to see the /etc/clickhouse-server/config.d/chop-generated-remote_servers.xml configuration change when I manually force a pod to restart by killing the process, but the fact the workaround seems to solve the issue hints that this is what happens.

Is the operator indeed removing servers with all replicas having pods in non-ready mode?
If so, it would probably be a good idea to make this behavior optional: as long as the pod exists (even if it is currently restarting), the shard should not be removed from the remote_servers.

The text was updated successfully, but these errors were encountered:

alex-zaitsev · 2024-03-10T08:07:09Z

@cpg314 , this was done intentional. If pod is being recreated (could not be caused by ClickHouse restarts though), it is not resolvable in DNS. ClickHouse fails to work in this case: distributed queries fail, and even startup may fail, skip_unavailable_shards does not help.

From the other hand, we are currently using services for cluster configuration, so it is probably not so visible anymore. Adding an option is a good idea, thanks.

cpg314 · 2024-03-10T11:27:59Z

Thanks for the details!
Yes, in my use case, where I want to guarantee that data ends up on the right shard for distribued JOINs, having queries fail is better than them succeeding on the remain shards. In my client-side handling, I just retry these queries until the shard is back online.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Single-replica shard restart updates remote_servers and leads to undesirable behaviour #1363

Single-replica shard restart updates remote_servers and leads to undesirable behaviour #1363

cpg314 commented Mar 5, 2024

alex-zaitsev commented Mar 10, 2024

cpg314 commented Mar 10, 2024

Single-replica shard restart updates remote_servers and leads to undesirable behaviour #1363

Single-replica shard restart updates remote_servers and leads to undesirable behaviour #1363

Comments

cpg314 commented Mar 5, 2024

alex-zaitsev commented Mar 10, 2024

cpg314 commented Mar 10, 2024