Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Massive unnecessary Raft log rollback causes spike latency. #5881

Open
UndefinedSy opened this issue May 11, 2024 · 1 comment
Open

Massive unnecessary Raft log rollback causes spike latency. #5881

UndefinedSy opened this issue May 11, 2024 · 1 comment
Labels
affects/none PR/issue: this bug affects none version. severity/none Severity of bug type/bug Type: something is unexpected

Comments

@UndefinedSy
Copy link

Please check the FAQ documentation before raising an issue

Describe the bug (required)
I've found such a scenario with spike latency for seconds: graphd's log show that requests to a certain storaged host are timeout. when look into the storaged's log, it shows that many partitions (maybe all parts as the follower) have encountered a RaftLog Rollback.

However, there is no logs indicating leader re-election or leader change, which means it should not involve inconsistency.

Your Environments (required)
a private branch dispatched from the master branch for long, but the related code looks the same as the master branch.

How To Reproduce(required)
No idea, it happens occasionally.
In my case, it happens when the storaged is under heavy load caused by write pressure test.

Expected behavior
Should not triggle massive RaftLog Rollback and causes the storaged unresponsible for seconds.

Additional context
I've taked a look at the RaftPart Impl and have some thoughts about the issue.

  1. there are massive Follower Raft Parts doing rollback, and the leaders of these parts are not changed. So this may not be caused by a log inconsistency.
  2. from reading the source code, based on my understanding, it seems that there is a case that may causes the rollback:
1. A previous AppendLog was sent to Follower with: LogEntries [100, 103], commitId: 99
2. Due to the network or something else, the response was lost or not return to the Leader in time. as the result, the Leader's `last_log_id_sent` is still 99
3. Next time, the leader will send AppendLog to the Follower: LogEntries [100, 110], commitId 99
4. At this time, although [100, 103] of the local wal file is consistent, rollbackToLog 102 will still be triggered.

the corresponding code is:
企业微信截图_37c49ef5-e561-4608-bb6e-1b7808f25b80

if this is the case, a simple solution might be:
企业微信截图_0deb5d6e-43e4-46f2-b82a-9e3d57e64905

@UndefinedSy UndefinedSy added the type/bug Type: something is unexpected label May 11, 2024
@github-actions github-actions bot added affects/none PR/issue: this bug affects none version. severity/none Severity of bug labels May 11, 2024
@UndefinedSy
Copy link
Author

I think this also leads to a parallel rollback issue when restarting a crashed storage instance, an issue we've observed in our production environment. Here's the sequence of events:

  1. The crashed instance has some logs that have been written to the WAL, but have not been committed.
  2. Upon recovery, the crashed instance receives the new appendLog request with some overlapping log entries.
  3. It then proceeds to perform a rollback operation, takes a lot of CPU time, and eventually causes more problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects/none PR/issue: this bug affects none version. severity/none Severity of bug type/bug Type: something is unexpected
Projects
None yet
Development

No branches or pull requests

1 participant