Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Re-Read on checksum-fail (check --data-read) #4774

Open
demlak opened this issue Apr 20, 2024 · 2 comments
Open

Feature Request: Re-Read on checksum-fail (check --data-read) #4774

demlak opened this issue Apr 20, 2024 · 2 comments

Comments

@demlak
Copy link

demlak commented Apr 20, 2024

Output of restic version

restic 0.16.4 compiled with go1.21.6 on windows/amd64

What should restic do differently? Which functionality do you think we should add?

Re-Read a single file on check --data-read, when an error is detected.

What are you trying to do? What problem would this solve?

There are several possible Issues which can lead to a faulty result on hashing files.

For example: the chance of having bit-flip on non-ECC-RAM grows with the number of pack files.

My Case:
My 900gb Repo with nearly 200k packs is stored on a storage share at my hoster.. for redundancy i rsync from time to time localy to an old PC with non-ECC-RAM. After that, i start check --read-data localy for obvious reasons =)
Sometimes i get an error about wrong checksum of a blob.

But:

  • Manualy checking the file for example via sha256sum shows no problem.
  • So i start again restic check --read-data and the check exits with an error on another file..
  • or it exits without any problem.

This shows me, there is no problem on the disk but somewhere else.

I think, it would increase the comfort, when restic re-reads files, where it got an error.. maybe 1 or 2 times.. just for getting deeper into the problem.
If the error still exists..

  • and is the same error, it is clear, that the file is corrupt on the disk.
  • if it is a different error on the same file at different re-reads, it is also clear, there is a problem (but well.. i don't know which one =) )
  • if the error is gone on a second scan, everything seems to be good and the whole --read-data can continue

on small repos, this will not increase comfort.. but on large-ones like mine, which needs about 4 hours to be scanned, it would increase comfort much.

One could now say: change your faulty hardware! yes.. but.. if this is a non-ECC-RAM problem, it probably affects many others, too. non-ECC-RAM is not that rare on the planet =)
And also it would fasten the error-analysis.

On-top, maybe it is a nice idea to have controll over this feature via commandline.. for example disable, counts of re-reads, exit/continue on error, etc..

Did restic help you today? Did it make you happy in any way?

yeah.. restics helps me several times a week! beside this, it helped me today to get in touch with you, cheers! <3

@olhotak
Copy link

olhotak commented Apr 24, 2024

Since I upgraded from restic 0.14.0 to 0.16.4, I get many Pack ID does not match errors on restic check --read-data (around 100 on a 1 TB repo on a local disk). sha256sum always gives the correct checksum for those packs. When I run restic check --read-data a second time, I get errors on different packs, never the same pack twice. Obviously this could be my hardware, but:

  • It didn't happen with 0.14.0.
  • I ran a memtest86+ and no issues.
  • There's no persistent corruption, so backing up seems to be working fine. The problem only shows up with check. If the hardware was bad, I would expect some of the packs to get corrupted when they were being written to disk.
  • No other symptoms of bad hardware (crashes, corrupted files or filesystems).

Some way to make check read the packs more carefully would be welcome.

@MichaelEischer
Copy link
Member

I've rebuilt the handling of transient errors in #4800. For now that PR lets check reread pack files that caused an error and report whether that error disappeared.

One could now say: change your faulty hardware! yes.. but.. if this is a non-ECC-RAM problem, it probably affects many others, too. non-ECC-RAM is not that rare on the planet =)

At least on my non-ECC hardware, I've never seen restic report a bitflip. So I'm relatively sure that your error rate is on the upper end of that of typical system (excluding those with a such high error rate that the system becomes unusable).

On-top, maybe it is a nice idea to have controll over this feature via commandline.. for example disable, counts of re-reads, exit/continue on error, etc..

I doubt that most of these options would be useful, except maybe a "ignore-retried-errors" option that does not let check fail if it had to re-read a few pack files. The general assumption is that bitflips are infrequent enough that they don't totally corrupt the memory of restic, thus a single retry should hopefully always be enough. If not, then the system is probably too unstable to be of much use.

Since I upgraded from restic 0.14.0 to 0.16.4, I get many Pack ID does not match errors on restic check --read-data (around 100 on a 1 TB repo on a local disk).

There's only been little change in the code of the check command between restic 0.14.0 and 0.16.4. If this were a widespread problem, then (hopefully) someone else should also have reported this exact problem since the release of restic 0.16.4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants