Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve reliability of restore #4627

Closed
5 tasks done
MichaelEischer opened this issue Jan 7, 2024 · 1 comment
Closed
5 tasks done

Improve reliability of restore #4627

MichaelEischer opened this issue Jan 7, 2024 · 1 comment
Labels
category: restore type: bug type: tracking tracks and sums up other issues on a specific topic

Comments

@MichaelEischer
Copy link
Member

MichaelEischer commented Jan 7, 2024

What should restic do differently? Which functionality do you think we should add?

There are a few corner cases that currently can cause restore to fail. Judging from https://forum.restic.net/t/errors-restoring-with-restic-on-windows-server-s3/6943 and https://forum.restic.net/t/restic-restore-failing-on-large-data-from-s3-with-error-an-existing-connection-was-forcibly-closed-by-remote-host/7062 , an individual blob that takes a long time to process can cause the network connection used by StreamPack to be closed unexpectedly.

The simplest "fix" would be to modify StreamPack such that it just downloads the whole pack file first and only starts processing it afterwards. However, that would lead to memory usage problems when larger pack files are used. Thus, we have to resort to the following bunch of fixes:

  • Improver restorer error reporting #4624 already ensures that a retry in StreamPack does not reprocess already downloaded blobs, as that would just trigger the same problem again.
  • A comprehensive fix also requires implementing Set timeouts for backend connections #4193 and to give the retries more time than the currently used 15 minutes. The latter part is no longer relevant by changing StreamPack to only request a size-limited chunk of the pack file and fully download that immediately.
  • finally Rework repository.StreamPacks & better restorer error handling #4605 , changes StreamPack such that if streaming the whole pack file fails, then it falls back to individually retrieving each requested blob. With the previous list of changes that's like not necessary, but can be useful nevertheless.
  • Rework backend retries #4784 . retries should be able to conceal a network connection that's interrupted for a few minutes. Ideally without endlessly delaying the shutdown of restic if the lock file cleanup fails.
  • Improve reliability of large restores #4626 mostly sidesteps the timeout problem by separately downloading frequently referenced blobs, which take a long time to write during restore. From a conceptual viewpoint this workaround has the problem that StreamPack fails to isolate its caller from the repository/backend implementation details.
@MichaelEischer
Copy link
Member Author

All PRs are merged by now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: restore type: bug type: tracking tracks and sums up other issues on a specific topic
Projects
Status: Done
Development

No branches or pull requests

1 participant