Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow shutdown during network snapshot transfer #588

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

mauri870
Copy link

@mauri870 mauri870 commented Mar 7, 2024

With the current implementation a network transfer of the snapshot can block for a long time depending on the size of the snapshot. During this time, if a shutdown is initiated it will block until the network transfer is done.

With this commit it will spawn the network transfer in a separate goroutine, and in a select we can check for either shutdown signal or error/success of the snapshot transfer.

Fixes #585

@mauri870 mauri870 requested a review from a team as a code owner March 7, 2024 13:51
@mauri870 mauri870 requested review from jmurret and removed request for a team March 7, 2024 13:51
@hashicorp-cla
Copy link

hashicorp-cla commented Mar 7, 2024

CLA assistant check
All committers have signed the CLA.

@mauri870 mauri870 force-pushed the hotfix/shutdown-during-snapshot-transfer branch from 9e01890 to 1017c47 Compare March 7, 2024 13:53
Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @mauri870 !

One possible concern I see here is that I'm not sure that we document that SnapshotSink implementations have to be thread safe... Until now they've never been used concurrently IIRC (but if you know of a place we do then please correct me).

So it's subtle but it could cause issues if the outer goroutine calls sink.Cancel on a shutdown concurrently with one of the sink.Cancel or sink.Close calls. (Or maybe even just the implicit calls to Write that io.Copy will be making.

It would be possible to reduce the number of Sink methods that need to be concurrency safe to just the Write call (i.e. keep all the error checking and calls to Cancel and Close in the main routine here. But it's still imposing a new assumption on Sink's that it's safe to call Cancel or Close concurrently with Write 🤔.

We could check the current implementations in this library and our other tools to see if that would be an issue for our stuff, but even if not, I still wonder if we might break other users of this library in subtle ways?

To be clear it's an edge case - would only happen during raft shutdown. But in some programs you might shutdown raft without intending to exit the program so a race bug that causes invalid state or a panic is still not great.

Not decided if this is a deal breaker for this approach or not but curious to hear your thoughts?

Perhaps we could have a different optional sink interface and only take this approach if the sink "opts in" to being safe for concurrent writes and closes? Then all our build in sinks could implement that (with any changes needed to make that true). And most users would get the fix "for free" while those who implemented custom sinks wouldn't have a subtle new race bug silently creep in on an upgrade?

@mauri870
Copy link
Author

Thanks for the feedback @banks. I ended up moving the Close and Cancel out of the goroutine, so they can't be called concurrently.

@banks
Copy link
Member

banks commented Mar 11, 2024

@mauri870 Thanks!

I happened to be looking at file_snapshot.go on Friday and I'm pretty sure our current implementation in this repo is still not safe to have Write and Close called concurrently - it's writing to a bufio.Writer but that is not go-routine safe between Write and Flush calls and Close would cause the buffer to be flushed etc.

I think we'll need a more comprehensive plan for how we can avoid changing the (implicit) interface assumptions that the snapshot sink must allow concurrent writes and close/cancel calls before we can feel OK about merging this. Sorry that's a pain for what seemed like a simple fix but I'm pretty sure we'd introduce new race bugs and possibly panics into our software and other users of this library with the PR as it is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow network transfer of snapshot to be cancelled during shutdown
3 participants