Zpool shows permanent errors but does not point to any files #16158

develroo · 2024-05-03T10:28:30Z

Distribution Name Debian
Distribution Version Testing
Kernel Version 6.6.15-amd64
Architecture amd64
OpenZFS Version

zfs-2.2.3-1
zfs-kmod-2.2.3-1

So I did a regular monthly scrub and it reported back permanent errors. I checked the disk smart status and found one with missed write counts mounting over the threshold. It had not failed completely yet, so I failed the device and replaced it with a new disk and re-ran the scrub. The errors persisted. Running zpool clear did nothing.

zpool status -v
  pool: mediapool-z1
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 12:37:16 with 3 errors on Thu May  2 21:45:49 2024
config:

	NAME                                          STATE     READ WRITE CKSUM
	mediapool-z1                                  ONLINE       0     0     0
	  raidz1-0                                    ONLINE       0     0     0
	    ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N5VUN75D  ONLINE       0     0     0
	    ata-ST3000VN007-2AH16M_ZGY7N50P           ONLINE       0     0     0
	    sdc                                       ONLINE       0     0     0
	    sdd                                       ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        <0x931>:<0x348c>

What is going on here, and why can I not reset the error flag if there are indeed no files failing?

Any thoughts?

The text was updated successfully, but these errors were encountered:

neurotensin · 2024-05-09T17:43:19Z

Hi,

I think I have some more data. The errors I think are caused by creating and removing snapshots where there is a contention. On my system I use zfs-autosnap to create snapshots and syncoid to copy the snapshots to target systems (not creating).

Sometimes syncoid failed because it cannot read the snapshot:

warning: cannot send 'rootvol_nvme######@zfs-auto-snap_frequent-2024-05-03-0600': Input/output error

If I remove manually (zfs destroy) the syncoid will then correctly run.

I have put a semaphore lock on zfs-autosnap and syncoid (using /tmp/ as a test), in the hope of preventing contention, but clearly something else is going on here. It happens frequently, but can still be cleared with a double scrub.....

Any ideas out there?

develroo · 2024-05-09T20:08:00Z

Well I have cleaned snapshots yes, but I never saw an error doing it. As it is now everything seems fine. I opened every file I could just to verify it would but still could not trace an error.

Oh did I not make myself clear? This is persistent though multiple scrubs. Both before I changed out the failing disk (it had not failed yet but had a higher threshold for write failures ) and afterwards. I had hoped that the scrub would clear it but it not, neither does clear.

It would help if the feedback given by the command was perhaps a little less obtuse? Are they genuine errors or some metadata mismatch or something ?

aerusso · 2024-05-10T03:38:56Z

@develroo Is this an encrypted dataset?

develroo · 2024-05-10T07:25:39Z

@develroo Is this an encrypted dataset?

No. It is a RAIDZ-1 running docker and a VM. In theory, the VM could have caused some sort of locking issue, but honestly the server has been running for many years now, and I have never had this problem before.

Any more debug info I can give just let me know. Would zdb help me drill down to the missing issues? In which case how because this seems a bit like voodoo at this stage.

Thanks.

neurotensin · 2024-05-10T15:05:31Z

@aerusso for mine, yes it is encrypted. I provided the mechanism above - I suspect it is the interaction between zfs-autosnapshot and syncoid. Each is doing the right thing, but even though I put a lock (to prevent them both running simultaneously - I just checked and expanded I will report back...) there is clearly some break of atomic behavior.

These are the system specs:
Distribution Name: Kubuntu
Distribution Version: 24.04 LTS
Kernel Version: 6.8.0-31-lowlatency
Architecture: x86_64
OpenZFS Version:
zfs-2.2.2-0ubuntu9
zfs-kmod-2.2.2-0ubuntu9

neurotensin · 2024-05-10T21:34:53Z

Looking around I suspect this https://github.com/openzfs/zfs/issues/15474 is a related issue. The core issue I suspect is the race condition of zfs destroy being used at the same time as a zfs send -I . The zfs snapshot is removed, between the time of the zfs send identifying the range, and the actual zfs send starting.

In my case with a lock on the creation (all cron processed create/remove a /tmp/ lock) and the syncoid doesn't run until the lock is given up (which issues the zfs send). The time is 5 mins (15/30/45/60) vs (20/40/60). I should probably make it relatively prime to push out the repeat sequence, but you get the idea...

develroo · 2024-05-10T22:40:56Z

Looking around I suspect this https://github.com/openzfs/zfs/issues/15474 is a related issue. The core issue I suspect is the race condition of zfs destroy being used at the same time as a zfs send -I . The zfs snapshot is removed, between the time of the zfs send identifying the range, and the actual zfs send starting.

In my case with a lock on the creation (all cron processed create/remove a /tmp/ lock) and the syncoid doesn't run until the lock is given up (which issues the zfs send). The time is 5 mins (15/30/45/60) vs (20/40/60). I should probably make it relatively prime to push out the repeat sequence, but you get the idea...

Well, that is a theory I guess, though I was not sending zfs snapshots anywhere. I did batch delete snapshots which took about a min to complete, so in theory it could be trying to take a snapshot then. Either way, because no actual files are involved, there seems to be no way to clear the error.

Is there any way to find out what the errors actually are and if necessary clear them?

neurotensin · 2024-05-11T17:32:46Z

@aerusso some more debugging, I have found a relevant zfs event...

May 11 2024 09:20:09.121469698 ereport.fs.zfs.authentication
class = "ereport.fs.zfs.authentication"
ena = 0xf9162f4003200c01
detector = (embedded nvlist)
version = 0x0
scheme = "zfs"
pool = 0x99d944fa950e7d8b
(end detector)
pool = "tank"
pool_guid = 0x99d944fa950e7d8b
pool_state = 0x0
pool_context = 0x0
pool_failmode = "wait"
zio_objset = 0xbc74
zio_object = 0x0
zio_level = 0x0
zio_blkid = 0x1
time = 0x663f7089 0x73d7b02
eid = 0x2a3ff

The zio_objset matches the zero length ( <0xbc74>:<0x0>) error given in the message. Is this the correct way to interpret this?

neurotensin · 2024-05-11T18:58:08Z

zpool events -v

develroo · 2024-05-11T22:30:51Z

Hmm well I ran this.

 zdb -c mediapool-z1

Traversing all blocks to verify metadata checksums and verify nothing leaked ...

loading concrete vdev 0, metaslab 173 of 174 ...
8.35T completed ( 847MB/s) estimated time remaining: 0hr 00min 00sec          
	No leaks (block sum matches space maps exactly)

	bp count:              52531792
	ganged count:           1892148
	bp logical:       6697248716800      avg: 127489
	bp physical:      6646730882560      avg: 126527     compression:   1.01
	bp allocated:     9185567629312      avg: 174857     compression:   0.73
	bp deduped:                   0    ref>1:      0   deduplication:   1.00
	bp cloned:                    0    count:      0
	Normal class:     9185563574272     used: 77.26%
	Embedded log class        1695744     used:  0.00%

	additional, non-pointer bps of type 0:     141716
	Dittoed blocks on same vdev: 1337489

space map refcount mismatch: expected 225 != actual 189

zpool event -v just listed a whole lot of snapshots with no reference to the above numbers

zpool events -v | grep 0x43ad

Anyone know how I can drill down to the errors in the first comment?

EDIT:

Hey.. I just noticed the numbers have changed?!

 on a command or topic, run: zpool help [<topic>]
root@zfsforn:~# zpool status -v 
  pool: mediapool-z1
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 12:48:20 with 3 errors on Fri May 10 09:21:10 2024
config:

	NAME                                          STATE     READ WRITE CKSUM
	mediapool-z1                                  ONLINE       0     0     0
	  raidz1-0                                    ONLINE       0     0     0
	    ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N5VUN75D  ONLINE       0     0     6
	    ata-ST3000VN007-2AH16M_ZGY7N50P           ONLINE       0     0     6
	    sdc                                       ONLINE       0     0     6
	    sdd                                       ONLINE       0     0     6

errors: Permanent errors have been detected in the following files:

        <0x43ad>:<0x348c>

neurotensin · 2024-05-12T18:44:44Z

zdb -c tank
zdb_blkptr_cb: Got error 52 reading <85099, 1204, 1, 0> -- skipping
zdb_blkptr_cb: Got error 52 reading <85099, 0, 1, 2> -- skipping
err == ENOENT (0x34 == 0x2)
ASSERT at module/zfs/dsl_dataset.c:383:load_zfeature()Aborted (core dumped)

FYI - system is still accessible so the core dump was not in the kernel.

develroo · 2024-05-12T19:50:18Z

Yes.. but I am asking about my pool. I don't think your issue is related to mine because you have actual file handles to look at whereas I do not.

danielb2 · 2024-05-23T00:15:44Z

Also experiencing the above.

errors: Permanent errors have been detected in the following files:

        <0x1de80>:<0x0>
        <0x143fc>:<0x0>
        <0x143ff>:<0x0>

I've been using syncoid to make bkups with snapshots

develroo · 2024-05-26T13:19:47Z

OK I am adding another post here, because this is officially weird behaviour.

So I did another scrub yesterday, the error were still there (though each time I did a scrub and check the numbers woudl change?). So I off-lined then on-lined them again one disk at a time, letting it resilver. Now the error has cleared, and I am back to a clean pool?

zpool status -v
  pool: mediapool-z1
 state: ONLINE
  scan: resilvered 56.7G in 00:45:19 with 0 errors on Sun May 26 12:30:35 2024
config:

	NAME                                          STATE     READ WRITE CKSUM
	mediapool-z1                                  ONLINE       0     0     0
	  raidz1-0                                    ONLINE       0     0     0
	    ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N5VUN75D  ONLINE       0     0     0
	    ata-ST3000VN007-2AH16M_ZGY7N50P           ONLINE       0     0     0
	    sdc                                       ONLINE       0     0     0
	    sdd                                       ONLINE       0     0     0

errors: No known data errors

So I have no idea what happened or how I fixed it, really. But maybe this will help someone else?

danielb2 · 2024-05-26T14:45:22Z

The fix for me was to initiate a scrub and then stop it with zpool scrub -s <pool>. Subsequent scrub did not show any errors anymore

develroo · 2024-05-26T15:40:10Z

Hmm interesting.

GregorKopka · 2024-05-30T10:00:51Z

Maybe related to #16147 ?

develroo added the Type: Defect Incorrect behavior (e.g. crash, hang) label May 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zpool shows permanent errors but does not point to any files #16158

Zpool shows permanent errors but does not point to any files #16158

develroo commented May 3, 2024

neurotensin commented May 9, 2024

develroo commented May 9, 2024 •

edited

aerusso commented May 10, 2024

develroo commented May 10, 2024

neurotensin commented May 10, 2024 •

edited

neurotensin commented May 10, 2024

develroo commented May 10, 2024

neurotensin commented May 11, 2024

neurotensin commented May 11, 2024

develroo commented May 11, 2024 •

edited

neurotensin commented May 12, 2024

develroo commented May 12, 2024

danielb2 commented May 23, 2024

develroo commented May 26, 2024

danielb2 commented May 26, 2024 •

edited

develroo commented May 26, 2024

GregorKopka commented May 30, 2024

Zpool shows permanent errors but does not point to any files #16158

Zpool shows permanent errors but does not point to any files #16158

Comments

develroo commented May 3, 2024

neurotensin commented May 9, 2024

develroo commented May 9, 2024 • edited

aerusso commented May 10, 2024

develroo commented May 10, 2024

neurotensin commented May 10, 2024 • edited

neurotensin commented May 10, 2024

develroo commented May 10, 2024

neurotensin commented May 11, 2024

neurotensin commented May 11, 2024

develroo commented May 11, 2024 • edited

neurotensin commented May 12, 2024

develroo commented May 12, 2024

danielb2 commented May 23, 2024

develroo commented May 26, 2024

danielb2 commented May 26, 2024 • edited

develroo commented May 26, 2024

GregorKopka commented May 30, 2024

develroo commented May 9, 2024 •

edited

neurotensin commented May 10, 2024 •

edited

develroo commented May 11, 2024 •

edited

danielb2 commented May 26, 2024 •

edited