Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read error due to assumed corrupt block with all-zero nonce #532

Open
EinBur opened this issue Dec 11, 2020 · 19 comments
Open

Read error due to assumed corrupt block with all-zero nonce #532

EinBur opened this issue Dec 11, 2020 · 19 comments

Comments

@EinBur
Copy link

EinBur commented Dec 11, 2020

I've got a rather big gocryptfs mount (a bit over 8 TB data). Reading a few lately added files (each with a few GB each) I get Eingabe-/Ausgabefehler (input/output error) when accessing a little part (I guess: exactly one block) inside them.

Running gocryptfs -fsck /mnt/vault_cipherdir I get this output (names shortend, numbers original):

doRead 71020: corrupt block #29792: all-zero nonce
fsck: error reading file "foo/bar1/abc" (inum 71020): 5=input/output error
doRead 71017: corrupt block #2328: all-zero nonce
fsck: error reading file "foo/bar2/def" (inum 71017): 5=input/output error
doRead 71012: corrupt block #4968: all-zero nonce
fsck: error reading file "foo/bar3/ghi" (inum 71012): 5=input/output error
doRead 71009: corrupt block #2480: all-zero nonce
fsck: error reading file "foo/bar4/jkl" (inum 71009): 5=input/output error
doRead 71006: corrupt block #3920: all-zero nonce
fsck: error reading file "foo/bar5/mno" (inum 71006): 5=input/output error
doRead 71001: corrupt block #127896: all-zero nonce
fsck: error reading file "foo/bar6/pqr" (inum 71001): 5=input/output error
doRead 71003: corrupt block #40976: all-zero nonce
fsck: error reading file "foo/bar7/stu" (inum 71003): 5=input/output error

Things I already did:

  • Checked the cipherdir - everything looks fine. It's a CIFS/SMB3 mount of a rather new NAS that has rather new HDDs in a RAID5 configuration with good RAID health and good HDD SMART values. So I'm confident the underlying storage is fine.
  • hexdump -n 512 -s 0 foo/bar2/def looks good (as expected)
  • hexdump -n 512 -s $((1024*4*2328)) foo/bar2/def looks bad and gives the Eingabe-/Ausgabefehler (as expected). In /var/log/syslog it creates two entries gocryptfs[6457]: doRead 71017: corrupt block #2328: all-zero nonce
  • hexdump -n 512 -s 0 /mnt/vault_cipherdir/<EncryptedPathToSameFile> looks good (as expected)
  • hexdump -n 512 -s $((1024*4*2328)) /mnt/vault_cipherdir/<EncryptedPathToSameFile> also looks good (as hoped for as it shows that the underlying system is working correctly) and prints a bunch of "random" bytes
  • Copied that same original file (with a new file name) in the same directory (foo/bar2/def) again. No error messages.
  • Reading the full file (md5sum foo/bar2/def2) I get an Eingabe-/Ausgabefehler for this file as well (/var/log/syslog says this time twice: gocryptfs[6457]: doRead 71074: corrupt block #88480: all-zero nonce)

What I think is curious: all the inums are just above 71000. Coincidence?

Is this information enough for you to fix the issue - or do you need more information from me?

gocryptfs --version gives gocryptfs 1.7.1; go-fuse 0.0~git20190214.58dcd77; 2019-12-26 go1.13.5 linux/amd64

@EinBur
Copy link
Author

EinBur commented Dec 11, 2020

Running gocryptfs -d -fusedebug and calling hexdump -n 512 -s $((1024*4*2328)) foo/bar2/def i get in the syslog:

...
gocryptfs[86540]: go-fuse: rx 12: OPEN i6 {O_RDONLY,0x8000} 
gocryptfs[86540]: go-fuse: tx 12:     OK, {Fh 2 }
gocryptfs[86540]: go-fuse: rx 14: FLUSH i6 {Fh 2} 
gocryptfs[86540]: go-fuse: tx 14:     OK
gocryptfs[86540]: go-fuse: rx 16: READ i6 {Fh 2 [9535488 +8192)  L 0 RDONLY,0x8000} 
gocryptfs[86540]: ino71017: FUSE Read: offset=9535488 length=8192
gocryptfs[86540]: doRead: off=9535488 len=8192 -> off=9610002 len=8256 skip=0
gocryptfs[86540]: ReadAt offset=9610002 bytes (2328 blocks), want=8256, got=8256
gocryptfs[86540]: doRead 71017: corrupt block #2328: all-zero nonce
gocryptfs[86540]: go-fuse: tx 16:     5=input/output error
gocryptfs[86540]: go-fuse: rx 18: READ i6 {Fh 2 [9535488 +4096)  L 0 RDONLY,0x8000} 
gocryptfs[86540]: ino71017: FUSE Read: offset=9535488 length=4096
gocryptfs[86540]: doRead: off=9535488 len=4096 -> off=9610002 len=4128 skip=0
gocryptfs[86540]: ReadAt offset=9610002 bytes (2328 blocks), want=4128, got=4128
gocryptfs[86540]: doRead 71017: corrupt block #2328: all-zero nonce
gocryptfs[86540]: go-fuse: tx 18:     5=input/output error
gocryptfs[86540]: go-fuse: rx 20: FLUSH i6 {Fh 2} 
gocryptfs[86540]: go-fuse: tx 20:     OK
gocryptfs[86540]: go-fuse: rx 22: RELEASE i6 {Fh 2 0x8000  L0} 
gocryptfs[86540]: go-fuse: tx 22:     OK

@rfjakob
Copy link
Owner

rfjakob commented Dec 12, 2020

Hi, nice analysis so far! But i think you have to adjust the offset in the encrypted file due to overhead (https://github.com/rfjakob/gocryptfs/blob/master/Documentation/file-format.md)

Alternatively, can you run gocryptfs-xray on the encrypted file?

@EinBur
Copy link
Author

EinBur commented Dec 12, 2020

gocryptfs-xray foo/bar2/def shows:

...
Block 2327: IV: b238a1e0e224183d2fb1bdaa3be73326, Tag: 34cc404791dd7533b96ce8f4ebcba888, Offset: 9605874 Len: 4128
Block 2328: IV: 00000000000000000000000000000000, Tag: 929834954efb97b3635f8aae42da3a09, Offset: 9610002 Len: 4128
...

and gocryptfs-xray foo/bar2/def | grep 00000000000000000000000000000000 shows that only block 2328 is affected.

@rfjakob
Copy link
Owner

rfjakob commented Dec 12, 2020

Ok, can you

hexdump -n 4128 -s $((18 + 4128*2328)) /mnt/vault_cipherdir/<EncryptedPathToSameFile>

to see how much data is zeroed out? My guess is that it is more than the nonce.

But in any case, the data in this block is gone.

@rfjakob
Copy link
Owner

rfjakob commented Dec 12, 2020

PS: You could also try to decrypt the file using

https://github.com/slackner/gocryptfs-inspect

would be interesting to see what it thinks about block 2328.

@EinBur
Copy link
Author

EinBur commented Dec 12, 2020

hexdump gives:

092a312 0000 0000 0000 0000 0000 0000 0000 0000
*
092aff2 0000 0000 0000 0000 0000 0000 0000 ca7f
092b002 eeaa 2cdd 1752 4e6a f6c9 00ba f2a1 5f63
092b012 cce4 00fd 7761 6a72 0492 7a6c 088a 33e0
... (no more obvious zeros till the end)

=> From 0x92a312 to 0x92afff it's only zeros.

Running hexdump -n 4128 -s $(( 4128*2328)) <file> I get

092a300 381a cc34 4740 dd91 3375 6cb9 f4e8 cbeb
092a310 88a8 0000 0000 0000 0000 0000 0000 0000
092a320 0000 0000 0000 0000 0000 0000 0000 0000
*
092b000 ca7f eeaa 2cdd 1752 4e6a f6c9 00ba f2a1
092b010 5f63 cce4 00fd 7761 6a72 0492 7a6c 088a
...

so the zeros are starting with 0x92a312

@EinBur
Copy link
Author

EinBur commented Dec 12, 2020

python3 gocryptfs.py --config /mnt/vault_cipherdir/gocryptfs.conf /mnt/vault_cipherdir/<EncryptedPathToSameFile> > ~/decrypted.file results in

Traceback (most recent call last):
  File "gocryptfs.py", line 168, in <module>
    block = fp.read(4096)
  File "gocryptfs.py", line 141, in read
    if not self.fill_buffer():
  File "gocryptfs.py", line 121, in fill_buffer
    self.remaining = io.BytesIO(self.decrypt_block(block))
  File "gocryptfs.py", line 101, in decrypt_gcm_block
    return aes.decrypt_and_verify(ciphertext, tag)
  File "/usr/lib/python3/dist-packages/Cryptodome/Cipher/_mode_gcm.py", line 504, in decrypt_and_verify
    self.verify(received_mac_tag)
  File "/usr/lib/python3/dist-packages/Cryptodome/Cipher/_mode_gcm.py", line 456, in verify
    raise ValueError("MAC check failed")
ValueError: MAC check failed

and then ls -l ~/decrypted.file shows that the decrypted file has a size of 9535488 bytes.

@EinBur
Copy link
Author

EinBur commented Dec 12, 2020

Using gocryptfs -d -fusedebug again and writing one byte to the affected block

printf '\x31' | dd of='foo/bar2/def' bs=1 seek=$((1024*4*2328)) count=1 conv=notrunc 

gives also the input/output error as I guess it's reading the full block to be able to modify the one byte.

So running instead

printf '\x31' | dd of='foo/bar2/def' bs=$((1024*4)) seek=$((1024*4*2328)) count=1 conv=notrunc 

the result from dd looks fine (1 byte copied). And the syslog is also looking fine:

...
gocryptfs[29785]: go-fuse: rx 40: OPEN i6 {RDWR,0x8000} 
gocryptfs[29785]: go-fuse: tx 40:     OK, {Fh 2 }
gocryptfs[29785]: go-fuse: rx 42: FLUSH i6 {Fh 2} 
gocryptfs[29785]: go-fuse: tx 42:     OK
gocryptfs[29785]: go-fuse: rx 44: IOCTL i6 *fuse._IoctlIn: &{{72 39 44 6 {{1000 1000} 29854} 0} 2 0 2150657282 140726798105120 0 48} 
gocryptfs[29785]: go-fuse: tx 44:     38=function not implemented
gocryptfs[29785]: go-fuse: rx 46: GETXATTR i6 {sz 0} ["security.capability"] 20b
gocryptfs[29785]: go-fuse: tx 46:     95=operation not supported
gocryptfs[29785]: go-fuse: rx 48: WRITE i6  1b
gocryptfs[29785]: ino71017: FUSE Write: offset=39057358848 length=1
gocryptfs[29785]: doRead: off=39057358848 len=4096 -> off=39362494482 len=4128 skip=0
gocryptfs[29785]: ReadAt offset=39362494482 bytes (9535488 blocks), want=4128, got=33
gocryptfs[29785]: len(oldData)=1 len(blockData)=1
gocryptfs[29785]: ino71017: Writing 1 bytes to block #9535488
gocryptfs[29785]: go-fuse: tx 48:     OK
gocryptfs[29785]: go-fuse: rx 50: FLUSH i6 {Fh 2} 
gocryptfs[29785]: go-fuse: tx 50:     OK
gocryptfs[29785]: go-fuse: rx 52: RELEASE i6 {Fh 2 RDWR,0x8000  L0} 
gocryptfs[29785]: go-fuse: tx 52:     OK

Reading that just written byte again with dd if='foo/bar2/def' bs=1 skip=$((1024*4*2328)) count=1 conv=notrunc | hexdump results in the same well known input/output error.

@rfjakob
Copy link
Owner

rfjakob commented Dec 12, 2020

This

printf '\x31' | dd of='foo/bar2/def' bs=$((1024*4)) seek=$((1024*4*2328)) count=1 conv=notrunc 

worked because you seeked to 1024*4 x 1024*4*2328 bytes

@EinBur
Copy link
Author

EinBur commented Dec 15, 2020

OK, I tried again:
$ dd if=/dev/urandom of='foo/bar2/def' bs=$((1024*4)) seek=2328 count=1 conv=notrunc
copied 4096 bytes successfully, it said. Syslog says:

gocryptfs[89943]: go-fuse: rx 638: OPEN i289 {RDWR,0x8000} 
gocryptfs[89943]: go-fuse: tx 638:     OK, {Fh 3 }
gocryptfs[89943]: go-fuse: rx 640: FLUSH i289 {Fh 3} 
gocryptfs[89943]: go-fuse: tx 640:     OK
gocryptfs[89943]: go-fuse: rx 642: IOCTL i289 *fuse._IoctlIn: &{{72 39 642 289 {{1000 1000} 90907} 0} 3 0 2150657282 140729113025424 0 48} 
gocryptfs[89943]: go-fuse: tx 642:     38=function not implemented
gocryptfs[89943]: go-fuse: rx 644: GETXATTR i289 {sz 0} ["security.capability"] 20b
gocryptfs[89943]: go-fuse: tx 644:     95=operation not supported
gocryptfs[89943]: go-fuse: rx 646: WRITE i289  4096b
gocryptfs[89943]: ino71017: FUSE Write: offset=9535488 length=4096
gocryptfs[89943]: ino71017: Writing 4096 bytes to block #2328
gocryptfs[89943]: go-fuse: tx 646:     OK
gocryptfs[89943]: go-fuse: rx 648: FLUSH i289 {Fh 3} 
gocryptfs[89943]: go-fuse: tx 648:     OK
gocryptfs[89943]: go-fuse: rx 650: RELEASE i289 {Fh 3 RDWR,0x8000  L0} 
gocryptfs[89943]: go-fuse: tx 650:     OK

Reading it back dd if='foo/bar2/def' bs=1 skip=$((1024*4*2328)) count=$((1024*4)) conv=notrunc | hexdump seems to work fine now. There were no warning or error messages and the hexdump output looks fine (random) to me.

And gocryptfs-xray <crypted file> shows:

...
Block 2327: IV: b238a1e0e224183d2fb1bdaa3be73326, Tag: 34cc404791dd7533b96ce8f4ebcba888, Offset: 9605874 Len: 4128
Block 2328: IV: b62838972c36354c25749e9c9c691774, Tag: 5f0afedd9c4fb3e87a8b26c999c841b3, Offset: 9610002 Len: 4128
...

So it seems that the copy operation fixed the IV.

This leads to the questions:

  • Why was the IV corrupted in the first place?
  • Why is it only happening now (and this in a repeated way) and not during the first ~8 TB that were copied into this vault?

That only a hand full of files are affected and that those files were only added lately (I don't know whether a good file was in between though) was checked by creating a md5sum for each of them, i.e. I made sure that all blocks were read.
I also checked that it's still the identical official Ubuntu package gocryptfs:amd64 (1.7-2) as it wasn't updated since creating this vault.

@rfjakob
Copy link
Owner

rfjakob commented Dec 20, 2020

First non-zero byte is at offset 9613312 bytes = 4096 * 2347.

I think your NAS lost one 4096-byte block starting from

4096 * 2346 = 9609216 to

4096 * 2347 = 9613312

However, we should see two corrupt gocryptfs blocks in this case, as one zeroed-out 4096-byte NAS block will most likely touch two 4128-byte gocryptfs blocks. We are still missing something here.

I suggest taking a big file (1GB from urandom? But NOT all-zero!), copying it many times unencrypted to the NAS, drop caches [1] and then checking the md5sum of all files.

Then try again via gocryptfs?

[1]: echo 3 | sudo tee /proc/sys/vm/drop_caches

@EinBur
Copy link
Author

EinBur commented Dec 20, 2020

I've now created a random filled 1 GB and a 10 GB file locally.

  • Copying 1 GB to the NAS share directly: no problem
  • Copying 10x the 1 GB file (with individual names, of course) to the NAS share directly: no problem
  • Copying 10 GB to the NAS share directly: no problem
  • Copying 10x the 10 GB file (with individual names, of course) to the NAS share directly: no problem
  • doing all the same to the encrypted vault: no problem
  • repeating the copy of the affected file that failed repeatedly the last time: still no problem

So today everything is working as expected. This is sad as it makes this issue sporadic. :(

Deleting all the additional files I've copied the last few days for testing and trying again: still no problem

But, interesting observation: the inode numbers seem to not get reused. The latest copies have created inodes 71334 and 71336. But I'd have expected to be just above 71020 or so again as I didn't fill that vault anymore after I discovered the problems (+ the test copies, of course)
So immediately deleting the latest file and copying it again created inode 71337.

All copying and deleting are done in the terminal, so I'm sure that no desktop environment trash can or similar is interfering with these tests

@EinBur
Copy link
Author

EinBur commented Jan 26, 2021

Now I retried again. Copying the 1 GB random file to 20 separate files in the encrypted vault worked fine. Then 10x a 10 GB random file also worked.
So I deleted those files and copied live data, 3 files with 4 to 7 GB with quite compressed content, i.e. it should also look "random".
Two were working, one was corrupt again (same failure picture: one IV was zeroed out).

Copying then exactly the same three files to the same path of the NAS mount (/mnt/vault_cipherdir/<EncryptedPath>/), so that gocryptfs is bypassed although the operation is looking similar to the NAS, was working fine. The md5sum showed exactly the same hash (of course using drop_caches beforehand)

Then copying the affected file again in the vault in the same path but with a different name shows that the second copy is also corrupt. The position is different though:

gocryptfs-xray /mnt/vault_cipherdir/<EncryptedPath>/<gibberish1> | grep 000000000
Block 2952: IV: 00000000000000000000000000000000, Tag: b4f170e77267a758823f33ac585dfb86, Offset: 12185874 Len: 4128
gocryptfs-xray /mnt/vault_cipherdir/<EncryptedPath>/<gibberish2> | grep 000000000
Block 28288: IV: 00000000000000000000000000000000, Tag: 9d60cb37a41d1c490541b52d87d601d4, Offset: 116772882 Len: 4128

@EinBur
Copy link
Author

EinBur commented Feb 18, 2021

As I still have the problem I wonder what to do to get it solved.

Something that I have discovered is that dmesg tells me stuff like

[38072.282176] CIFS VFS: \\<NAS> No task to wake, unknown frame received! NumMids 3
[38072.282179] 00000000: 424d53fe 00000040 00000000 00000012  .SMB@...........
[38072.282180] 00000010: 00000001 00000000 ffffffff ffffffff  ................
[38072.282181] 00000020: 00000000 00000000 00000000 00000000  ................
[38072.282181] 00000030: 00000000 00000000 00000000 00000000  ................

with some times NumMids 2 and some times NumMids 3.

Also I do wonder whether this issue might be related to #483 ?

@EinBur
Copy link
Author

EinBur commented Feb 18, 2021

Also one more discovery: when I use rsync instead of cp I can successfully copy files to the vault.
So it's not an issue with the file content, I recon it's an issue with something sporadic that's interfering during the copying.

@rfjakob
Copy link
Owner

rfjakob commented May 3, 2021

Hmm, rsync and cp probably use different block sizes.

By the way, what's the NAS model and do you know what Samba version it runs? Also, can you show what mount options were negotiated for the mount (from mount output)

@EinBur
Copy link
Author

EinBur commented May 6, 2021

The NAS is a Synology DS920+. Samba on the NAS is version 4.4.16

Mounting is happening via fstab with the parameters cifs vers=3,noauto,rw,x-systemd.automount,uid=1000,credentials=...
The systemd log shows nothing interesting about the mount.

Since a few months I'm using only rsync to fill the vault even further. And I do check all the copied files by creating a md5sum. No check sum creation has failed (in the past with cp that was enough to detect a corrupt file as all blocks are read) and for quite a few I checked this hash against one from the source drive. All checks were successful.

@rfjakob
Copy link
Owner

rfjakob commented Aug 30, 2021

Same issue without gocryptfs involved on Synology DS218 ?

https://ptrchm.com/blog/how-i-nearly-lost-all-my-photos/

@rfjakob
Copy link
Owner

rfjakob commented Dec 30, 2022

Interesting. restic sees a similar problem on cifs: restic/restic#2659 (comment) :

The pack file ended up with a few 64kb ranges which just contain null bytes instead of the actual data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants