-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TiKV should delay request snapshot if it exceeds receiving snapshot limit #15972
Labels
Comments
Hi @overvenus Is this open to contribution? |
@zzl-7 Yes, TiKV is always open to contribution if it's not assigned yet. |
/cc @hbisheng |
ti-chi-bot bot
added a commit
that referenced
this issue
May 20, 2024
ref #15972 This commit starts to introduce a snapshot precheck mechanism to reduce unnecessary snapshot drops and generations. The mechanism functions as follows: Before a leader sends a snapshot to a follower, it first sends a precheck message. This message serves as a preliminary inquiry to the follower, seeking confirmation of its readiness to receive a snapshot. Upon receiving the message, the follower consults its concurrency limiter and returns a response to the leader. The leader will only proceed to generate the snapshot after the precheck is passed. A passed precheck means the leader has reserved a spot on the follower so the subsequent snapshot send should succeed. The reservation has a TTL and the leader is supposed to complete the snapshot generation and transmission within the TTL timeframe. Note that this commit implements the concurrency limiter without actually using it. A follow-up commit will update the snapshot sending process and make use of this concurrency limiter. Signed-off-by: Bisheng Huang <hbisheng@gmail.com> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com> Co-authored-by: Neil Shen <overvenus@gmail.com>
9 tasks
ti-chi-bot bot
added a commit
that referenced
this issue
May 23, 2024
ref #15972 An issue with the previous implementation of the concurrency limiter is that its APIs are not idempotent. If a single region leader keeps sending precheck requests to the same receiver (this may happen if the precheck response is somehow lost on the network), it would consume all reservations, thereby blocking all other snapshots. This commit addresses this problem by updating the concurrency limiter to deduplicate requests based on `region_id`. If the limiter already has a valid reservation for a region, calling try_recv() again will not allocate a new one. Signed-off-by: Bisheng Huang <hbisheng@gmail.com> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
ti-chi-bot bot
added a commit
that referenced
this issue
Jun 12, 2024
…17019) close #15972 This commit implements a snapshot precheck process that leverages the snapshot concurrency limiter introduced in #17015. The leader only proceeds to generate the snapshot after it receives a precheck succeed message from the follower. The precheck request and response are sent via the raft `ExtraMessage` where two new message types are added in kvproto. Signed-off-by: Bisheng Huang <hbisheng@gmail.com> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Bug Report
From the log below, we found leaders fail to send snapshots, because their receive-ends have reached snapshot task limits.
Snapshots generated by leaders can consumes lots of disk space, and may be wasted if leaders compacts their logs.
It's better to let follower delay requests snapshot, if they find the snapshot limit is reached.
TiKV Logs
tikv-20 got lots of send snapshot failures.
tikv-21 had reached snapshot limit.
What version of TiKV are you using?
v7.4.0
Steps to reproduce
Scale in, scale out or some scenarios that involve lots of region balance.
What did you expect?
No "too many recving snapshot tasks" logs.
What did happened?
Lots of "too many recving snapshot tasks" and wasted snapshots.
The text was updated successfully, but these errors were encountered: