Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pool reports ZFS-8000-8A after zfs send but then clears #16147

Open
gaykitty opened this issue May 1, 2024 · 4 comments
Open

Pool reports ZFS-8000-8A after zfs send but then clears #16147

gaykitty opened this issue May 1, 2024 · 4 comments
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@gaykitty
Copy link

gaykitty commented May 1, 2024

System information

Type Version/Name
Distribution Name NixOS
Distribution Version 23.11
Kernel Version 6.7.10
Architecture x84_64
OpenZFS Version 2.2.3-1

Describe the problem you're observing

When trying to send particular snapshots (as determined by syncoid), the sending pool will start reporting ZFS-8000-8A, with an empty file list. But then there error clears on it own after 5-30 seconds and multiple runs of zpool status -v. The output of zfs send is also invalid.

Describe how to reproduce the problem

Happens every time, don't know how it got in this state.

Include any warning/errors/backtraces from the system logs

zfs status output after zfs send:

  pool: data
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 00:03:14 with 0 errors on Tue Apr 30 21:01:52 2024
config:

	NAME                          STATE     READ WRITE CKSUM
	data                          ONLINE       0     0     0
	 mirror-0                    ONLINE       0     0     0
	   scsi-0HC_Volume_30264525  ONLINE       0     0     0
	   scsi-0HC_Volume_31118548  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

zfs status output after 5-30 seconds and multiple runs of zpool status -v:

  pool: data
 state: ONLINE
  scan: scrub repaired 0B in 00:03:14 with 0 errors on Tue Apr 30 21:01:52 2024
config:

	NAME                          STATE     READ WRITE CKSUM
	data                          ONLINE       0     0     0
	 mirror-0                    ONLINE       0     0     0
	   scsi-0HC_Volume_30264525  ONLINE       0     0     0
	   scsi-0HC_Volume_31118548  ONLINE       0     0     0

errors: No known data errors

Output of zfs send -I 'data/postgres@autosnap_2024-04-30_00:00:02_daily' 'data/postgres@autosnap_2024-05-01_01:00:13_hourly' | zstreamdump:

BEGIN record
	hdrtype = 2
	features = 0
	magic = 2f5bacbac
	creation_time = 0
	type = 0
	flags = 0x0
	toguid = 0
	fromguid = 0
	toname = data/postgres@autosnap_2024-05-01_01:00:13_hourly
	payloadlen = 0
END checksum = 4e3d8c4e3/13074827678/2629d7a733f5/34974e936a868
BEGIN record
	hdrtype = 1
	features = 4
	magic = 2f5bacbac
	creation_time = 66304295
	type = 2
	flags = 0xc
	toguid = 952f620822cc13b3
	fromguid = 18d0c7688a7b650d
	toname = data/postgres@autosnap_2024-04-30_01:00:05_hourly
	payloadlen = 0
warning: cannot send 'data/postgres@autosnap_2024-04-30_01:00:05_hourly': Input/output error
BEGIN record
	hdrtype = 1
	features = 4
	magic = 2f5bacbac
	creation_time = 663050a2
	type = 2
	flags = 0xc
	toguid = 8e478c81e645c5c2
	fromguid = 952f620822cc13b3
	toname = data/postgres@autosnap_2024-04-30_02:00:02_hourly
	payloadlen = 0
invalid checksum
Incorrect checksum in record header.
Expected checksum = f9770daf2/9c3131f152a/35dc04be10236/d38c2d9e9ccd71
SUMMARY:
	Total DRR_BEGIN records = 3 (0 bytes)
	Total DRR_END records = 1 (0 bytes)
	Total DRR_OBJECT records = 0 (0 bytes)
	Total DRR_FREEOBJECTS records = 0 (0 bytes)
	Total DRR_WRITE records = 0 (0 bytes)
	Total DRR_WRITE_BYREF records = 0 (0 bytes)
	Total DRR_WRITE_EMBEDDED records = 0 (0 bytes)
	Total DRR_FREE records = 0 (0 bytes)
	Total DRR_SPILL records = 0 (0 bytes)
	Total records = 4
	Total payload size = 0 (0x0)
	Total header overhead = 1248 (0x4e0)
	Total stream length = 1560 (0x618)
@gaykitty gaykitty added the Type: Defect Incorrect behavior (e.g. crash, hang) label May 1, 2024
@rincebrain
Copy link
Contributor

Can you share the output of zfs get all on data/postgres, otherwise it's going to be rather difficult to try and reproduce what's going on here.

@gaykitty
Copy link
Author

gaykitty commented May 1, 2024

@rincebrain here ya go:

NAME           PROPERTY              VALUE                            SOURCE
data/postgres  type                  filesystem                       -
data/postgres  creation              Mon Apr 10 23:15 2023            -
data/postgres  used                  221M                             -
data/postgres  available             25.5G                            -
data/postgres  referenced            117M                             -
data/postgres  compressratio         1.71x                            -
data/postgres  mounted               yes                              -
data/postgres  quota                 none                             default
data/postgres  reservation           none                             default
data/postgres  recordsize            8K                               local
data/postgres  mountpoint            /var/lib/postgresql/             local
data/postgres  sharenfs              off                              default
data/postgres  checksum              on                               default
data/postgres  compression           on                               default
data/postgres  atime                 on                               default
data/postgres  devices               on                               default
data/postgres  exec                  on                               default
data/postgres  setuid                on                               default
data/postgres  readonly              off                              default
data/postgres  zoned                 off                              default
data/postgres  snapdir               hidden                           default
data/postgres  aclmode               discard                          default
data/postgres  aclinherit            restricted                       default
data/postgres  createtxg             174                              -
data/postgres  canmount              on                               local
data/postgres  xattr                 on                               default
data/postgres  copies                1                                default
data/postgres  version               5                                -
data/postgres  utf8only              off                              -
data/postgres  normalization         none                             -
data/postgres  casesensitivity       sensitive                        -
data/postgres  vscan                 off                              default
data/postgres  nbmand                off                              default
data/postgres  sharesmb              off                              default
data/postgres  refquota              none                             default
data/postgres  refreservation        none                             default
data/postgres  guid                  970173609522556085               -
data/postgres  primarycache          metadata                         local
data/postgres  secondarycache        all                              default
data/postgres  usedbysnapshots       103M                             -
data/postgres  usedbydataset         117M                             -
data/postgres  usedbychildren        0B                               -
data/postgres  usedbyrefreservation  0B                               -
data/postgres  logbias               throughput                       local
data/postgres  objsetid              274                              -
data/postgres  dedup                 off                              default
data/postgres  mlslabel              none                             default
data/postgres  sync                  standard                         default
data/postgres  dnodesize             legacy                           default
data/postgres  refcompressratio      2.02x                            -
data/postgres  written               1.64M                            -
data/postgres  logicalused           351M                             -
data/postgres  logicalreferenced     225M                             -
data/postgres  volmode               default                          default
data/postgres  filesystem_limit      none                             default
data/postgres  snapshot_limit        none                             default
data/postgres  filesystem_count      none                             default
data/postgres  snapshot_count        none                             default
data/postgres  snapdev               hidden                           default
data/postgres  acltype               off                              default
data/postgres  context               none                             default
data/postgres  fscontext             none                             default
data/postgres  defcontext            none                             default
data/postgres  rootcontext           none                             default
data/postgres  relatime              on                               default
data/postgres  redundant_metadata    all                              default
data/postgres  overlay               on                               default
data/postgres  encryption            aes-256-gcm                      -
data/postgres  keylocation           file:///var/secrets/zfs.key      local
data/postgres  keyformat             raw                              -
data/postgres  pbkdf2iters           0                                default
data/postgres  encryptionroot        data/postgres                    -
data/postgres  keystatus             available                        -
data/postgres  special_small_blocks  0                                default
data/postgres  snapshots_changed     Wed May  1 15:00:06 2024         -
data/postgres  nixos:shutdown-time   Mon Mar 18 12:20:20 AM EDT 2024  inherited from data

@rincebrain
Copy link
Contributor

rincebrain commented May 1, 2024

My assumption would be that something is erroring on decryption during send, since it's not using -w, and somehow because it's transient it's vanishing almost immediately, but the error is still breaking your send stream.

zpool events -v might know what it was that it burped on, though probably not why. (My guess would be something strange like it's hitting some metadata object where DVA[0] fails to decrypt but DVA[1] does and somehow native encryption is screwing up error handling so trying DVA[1] doesn't stop it from returning an error? Not sure, I've not seen this reproduce locally, but people have occasionally reported strange hiccups from native encryption that made me wonder if something strange was going on even after 2163cde.)

@gaykitty
Copy link
Author

gaykitty commented May 1, 2024

I was just poking at my system and noticed that a NixOS update had downgraded the kernel version to 6.6.29, but I didn't notice and hadn't rebooted. So I rebooted by my system. After the reboot my pool showed errors like before. After 2 scrubs the errors cleared and doing the zfs send from before no longer produces the bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

2 participants