Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework backend retries #4784

Merged
merged 7 commits into from
May 24, 2024

Conversation

MichaelEischer
Copy link
Member

@MichaelEischer MichaelEischer commented Apr 29, 2024

What does this PR change? What problem does it solve?

Most restic commands are long running and would have to start over in case of a failed backend operation. Thus, retries are necessary to work reliably even if some requests fail.

From what I'm aware of, there are two main uses cases for retries in restic:

  • retry a few failed operations. Failed operations are rare.
  • tolerate a network connection that's interrupted for a few minutes

The PR makes several changes:

  • Increase the initial retry interval (1s) and also the multiplier (2x). This still allows the first few retries to happen within a few seconds. All later retries are primarily useful to tolerate long lasting problems, thus a long interval is perfectly fine. This reduces the overall number of retries.
  • Only limit the retry duration instead of the number of retries. The values in the PR result in approximately 21 retries within the 15 minutes retry period. The internal concurrency of restic is bounded. Therefore, request retries reduce the number of requests per time sent to the backend. This should help overloaded backend recover.
  • Always retry at least once. In some cases the first request already takes longer than 15 minutes. In this case, previously no retry happened.
  • Wait at most 1 minute to unlock the repository if the user interrupted the current operation. This limits the maximum unlocking delay when using restic interactively, while giving background operations more time to clean up their lock file.

Was the change previously discussed in an issue or on the forum?

Replaces #2515.
Mentioned in #4627.

Checklist

  • I have read the contribution guidelines.
  • I have enabled maintainer edits.
  • I have added tests for all code changes.
  • I have added documentation for relevant changes (in the manual).
  • There's a new file in changelog/unreleased/ that describes the changes for our users (see template).
  • I have run gofmt on the code in all commits.
  • All commit messages are formatted in the same style as the other commits in the repo.
  • I'm done! This pull request is ready for review.

This simplifies finding the request in the log output that cause an
operation to fail.
Depending on how long an operation takes to fail, the total retry
duration can currently vary between 1.5 and 15 minutes. In particular
for temporarily interrupted network connections, the former timeout is
too short. Thus always use a limit of 15 minutes.
Previously, if an operation failed after 15 minutes, then it would never
be retried. This means that large backend requests are more unreliable
than smaller ones.
Retries in restic try to solve two main problems:
- retry a temporarily failed operation
- tolerate temporary network interruptions

The first problem only requires a few retries, whereas the last one benefits
primarily from spreading the requests over a longer duration.

Increasing the default multiplier and the initial interval works for
both cases. The first few retries only take a few seconds, while later
retries quickly reach the maximum interval of one minute. This ensures
that the total number of retries issued by restic will remain at around
21 retries for a 15 minute period. As the concurrency in restic is
bounded, retries drastically reduce the number of requests sent to a
backend. This helps to prevent overloading the backend.
The toplevel context in restic only canceled if the user interrupts a
restic operation. If the network connection has failed this can require
waiting the full retry duration of 15 minutes which is a bad user
experience for interactive usage. Thus limit the delay to one minute in
this case.
Copy link
Member Author

@MichaelEischer MichaelEischer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@MichaelEischer MichaelEischer merged commit 16ef4d5 into restic:master May 24, 2024
13 checks passed
@MichaelEischer MichaelEischer deleted the rework-backend-retries branch May 24, 2024 18:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

None yet

1 participant