Rework backend retries #4784

MichaelEischer · 2024-04-29T20:04:03Z

What does this PR change? What problem does it solve?

Most restic commands are long running and would have to start over in case of a failed backend operation. Thus, retries are necessary to work reliably even if some requests fail.

From what I'm aware of, there are two main uses cases for retries in restic:

retry a few failed operations. Failed operations are rare.
tolerate a network connection that's interrupted for a few minutes

The PR makes several changes:

Increase the initial retry interval (1s) and also the multiplier (2x). This still allows the first few retries to happen within a few seconds. All later retries are primarily useful to tolerate long lasting problems, thus a long interval is perfectly fine. This reduces the overall number of retries.
Only limit the retry duration instead of the number of retries. The values in the PR result in approximately 21 retries within the 15 minutes retry period. The internal concurrency of restic is bounded. Therefore, request retries reduce the number of requests per time sent to the backend. This should help overloaded backend recover.
Always retry at least once. In some cases the first request already takes longer than 15 minutes. In this case, previously no retry happened.
Wait at most 1 minute to unlock the repository if the user interrupted the current operation. This limits the maximum unlocking delay when using restic interactively, while giving background operations more time to clean up their lock file.

Was the change previously discussed in an issue or on the forum?

Replaces #2515.
Mentioned in #4627.

Checklist

I have read the contribution guidelines.
I have enabled maintainer edits.
I have added tests for all code changes.
I have added documentation for relevant changes (in the manual).
There's a new file in changelog/unreleased/ that describes the changes for our users (see template).
I have run gofmt on the code in all commits.
All commit messages are formatted in the same style as the other commits in the repo.
I'm done! This pull request is ready for review.

This simplifies finding the request in the log output that cause an operation to fail.

Depending on how long an operation takes to fail, the total retry duration can currently vary between 1.5 and 15 minutes. In particular for temporarily interrupted network connections, the former timeout is too short. Thus always use a limit of 15 minutes.

Previously, if an operation failed after 15 minutes, then it would never be retried. This means that large backend requests are more unreliable than smaller ones.

Retries in restic try to solve two main problems: - retry a temporarily failed operation - tolerate temporary network interruptions The first problem only requires a few retries, whereas the last one benefits primarily from spreading the requests over a longer duration. Increasing the default multiplier and the initial interval works for both cases. The first few retries only take a few seconds, while later retries quickly reach the maximum interval of one minute. This ensures that the total number of retries issued by restic will remain at around 21 retries for a 15 minute period. As the concurrency in restic is bounded, retries drastically reduce the number of requests sent to a backend. This helps to prevent overloading the backend.

The toplevel context in restic only canceled if the user interrupts a restic operation. If the network connection has failed this can require waiting the full retry duration of 15 minutes which is a bad user experience for interactive usage. Thus limit the delay to one minute in this case.

MichaelEischer

LGTM

MichaelEischer mentioned this pull request May 1, 2024

Improve reliability of restore #4627

Closed

5 tasks

This was referenced May 9, 2024

Retry loading of corrupted data from backend / cache #4800

Merged

Do not retry permanent backend failures #4803

Merged

MichaelEischer force-pushed the rework-backend-retries branch 2 times, most recently from 22fe701 to a927b29 Compare May 20, 2024 18:37

MichaelEischer added 7 commits May 24, 2024 20:24

retry: explicitly log failed requests

a3633ca

This simplifies finding the request in the log output that cause an operation to fail.

retry: ensure that there's always at least one retry

512cd6e

Previously, if an operation failed after 15 minutes, then it would never be retried. This means that large backend requests are more unreliable than smaller ones.

add changelog for longer retries

723247c

backend/retry: feature flag new retry behavior

e4a4808

MichaelEischer force-pushed the rework-backend-retries branch from a927b29 to e4a4808 Compare May 24, 2024 18:24

MichaelEischer commented May 24, 2024

View reviewed changes

MichaelEischer merged commit 16ef4d5 into restic:master May 24, 2024
13 checks passed

MichaelEischer deleted the rework-backend-retries branch May 24, 2024 18:29

This was referenced May 25, 2024

restic gives up to fast on rate-limited S3 backends that use HTTP 503 to slow clients down #4018

Closed

mount: support fuse-t on macOS #4825

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework backend retries #4784

Rework backend retries #4784

MichaelEischer commented Apr 29, 2024 •

edited

MichaelEischer left a comment

Rework backend retries #4784

Rework backend retries #4784

Conversation

MichaelEischer commented Apr 29, 2024 • edited

What does this PR change? What problem does it solve?

Was the change previously discussed in an issue or on the forum?

Checklist

MichaelEischer left a comment

Choose a reason for hiding this comment

MichaelEischer commented Apr 29, 2024 •

edited