Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of order call to state_machine::create_snapshot() when manually triggering a snapshot #478

Open
adotsch opened this issue Nov 13, 2023 · 4 comments

Comments

@adotsch
Copy link

adotsch commented Nov 13, 2023

Hi,

I only create manual snapshots in my application calling raft_server::create_snapshot().
I have seen state_machine::create_snapshot() called with a snapshot index that is smaller(!) than the last committed index, i.e. when the state machine was already ahead of the point it should have created a snapshot for.
I guess this is a bug.
I have my workarounds but it would be nice to have this fixed so that I can clean up my code.

Regards,
Andras

@greensky00
Copy link
Contributor

Hi @adotsch,
Thanks for bringing this issue.

Creating snapshot is done by a separate background commit thread according to the current "state machine's commit index", so the snapshot index can be lagging behind the last commit index of log store. State machine is always catching up the Raft's commit index.

Nonetheless, the current behavior is obviously a bug. If there is more recent snapshot that is manually created, auto creation should skip any log index smaller than that. I will provide the fix for it soon.

@adotsch
Copy link
Author

adotsch commented Dec 8, 2023

Correct me if I am wrong, but I think you just added an extra check to ignore manual create_snapshot() calls when inconvenient, i.e. create_snapshot() will work when we are lucky, but won't be reliable.
Could you rather trigger an event in the commit thread to create a snapshot correctly and reliably or just use a new/existing mutex to avoid the race condition?

@greensky00
Copy link
Contributor

@adotsch
create_snapshot() is a best effort API and does not guarantee the success of creation, as multi-thread racing is not the only cause of failure. Also we can't use a mutex as snapshot creation is asynchronous task. Basically applications need to do retry upon the result of the API call.

If what you want is to schedule the snapshot creation on next earliest available log index, then I'd rather add a new API, for instance schedule_snapshot_creation(). But still, being a universal library, below case may happen:

  • At log index 100, the commit thread initiates snapshot creation. It is running.
  • At log index 101, schedule_snapshot_creation() is invoked by the user. But it is postponed as the previous creation is in progress.
  • ... The previous creation takes a few hours to finish ...
  • At log index 99999, the previous creation is done. Now the commit thread initiates the postponed snapshot creation request, on log index 99999.

Please let me know if it is ok for your case.

@adotsch
Copy link
Author

adotsch commented Dec 13, 2023

Thanks for your response!
This is exactly what I do, I retry creating the snapshot if it failed. It just doesn't feel right.
Yes, I think there should be a schedule_snapshot_creation() API that creates a snapshot at the earliest possible index. I don't think create_snapshot() in it's current form is a very good API, it requires a lot of logic at the user's end to get it working due to it's (best effort) nature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants