Improve reliability of restore #4627

MichaelEischer · 2024-01-07T13:47:04Z

What should restic do differently? Which functionality do you think we should add?

There are a few corner cases that currently can cause restore to fail. Judging from https://forum.restic.net/t/errors-restoring-with-restic-on-windows-server-s3/6943 and https://forum.restic.net/t/restic-restore-failing-on-large-data-from-s3-with-error-an-existing-connection-was-forcibly-closed-by-remote-host/7062 , an individual blob that takes a long time to process can cause the network connection used by StreamPack to be closed unexpectedly.

The simplest "fix" would be to modify StreamPack such that it just downloads the whole pack file first and only starts processing it afterwards. However, that would lead to memory usage problems when larger pack files are used. Thus, we have to resort to the following bunch of fixes:

Improver restorer error reporting #4624 already ensures that a retry in StreamPack does not reprocess already downloaded blobs, as that would just trigger the same problem again.
A comprehensive fix also requires implementing Set timeouts for backend connections #4193 ~~and to give the retries more time than the currently used 15 minutes~~. The latter part is no longer relevant by changing StreamPack to only request a size-limited chunk of the pack file and fully download that immediately.
finally Rework repository.StreamPacks & better restorer error handling #4605 , changes StreamPack such that if streaming the whole pack file fails, then it falls back to individually retrieving each requested blob. With the previous list of changes that's like not necessary, but can be useful nevertheless.
Rework backend retries #4784 . retries should be able to conceal a network connection that's interrupted for a few minutes. Ideally without endlessly delaying the shutdown of restic if the lock file cleanup fails.
Improve reliability of large restores #4626 mostly sidesteps the timeout problem by separately downloading frequently referenced blobs, which take a long time to write during restore. From a conceptual viewpoint this workaround has the problem that StreamPack fails to isolate its caller from the repository/backend implementation details.

The text was updated successfully, but these errors were encountered:

MichaelEischer · 2024-05-24T18:52:07Z

All PRs are merged by now.

MichaelEischer added type: bug category: restore type: tracking tracks and sums up other issues on a specific topic labels Jan 7, 2024

This was referenced Apr 21, 2024

Set timeouts for backend connections #4193

Closed

backend: enforce backend HTTP requests make progress #4778

Closed

This was referenced Apr 29, 2024

Rework backend retries #4784

Merged

backend: enforce that backend HTTP requests make progress #4792

Merged

MichaelEischer mentioned this issue May 9, 2024

Retry loading of corrupted data from backend / cache #4800

Merged

7 tasks

MichaelEischer closed this as completed May 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve reliability of restore #4627

Improve reliability of restore #4627

MichaelEischer commented Jan 7, 2024 •

edited

MichaelEischer commented May 24, 2024

Improve reliability of restore #4627

Improve reliability of restore #4627

Comments

MichaelEischer commented Jan 7, 2024 • edited

What should restic do differently? Which functionality do you think we should add?

MichaelEischer commented May 24, 2024

MichaelEischer commented Jan 7, 2024 •

edited