-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zpool shows permanent errors but does not point to any files #16158
Comments
Hi, I think I have some more data. The errors I think are caused by creating and removing snapshots where there is a contention. On my system I use zfs-autosnap to create snapshots and syncoid to copy the snapshots to target systems (not creating). Sometimes syncoid failed because it cannot read the snapshot: warning: cannot send 'rootvol_nvme######@zfs-auto-snap_frequent-2024-05-03-0600': Input/output error If I remove manually (zfs destroy) the syncoid will then correctly run. I have put a semaphore lock on zfs-autosnap and syncoid (using /tmp/ as a test), in the hope of preventing contention, but clearly something else is going on here. It happens frequently, but can still be cleared with a double scrub..... Any ideas out there? |
Well I have cleaned snapshots yes, but I never saw an error doing it. As it is now everything seems fine. I opened every file I could just to verify it would but still could not trace an error. Oh did I not make myself clear? This is persistent though multiple scrubs. Both before I changed out the failing disk (it had not failed yet but had a higher threshold for write failures ) and afterwards. I had hoped that the scrub would clear it but it not, neither does clear. It would help if the feedback given by the command was perhaps a little less obtuse? Are they genuine errors or some metadata mismatch or something ? |
@develroo Is this an encrypted dataset? |
No. It is a RAIDZ-1 running docker and a VM. In theory, the VM could have caused some sort of locking issue, but honestly the server has been running for many years now, and I have never had this problem before. Any more debug info I can give just let me know. Would zdb help me drill down to the missing issues? In which case how because this seems a bit like voodoo at this stage. Thanks. |
@aerusso for mine, yes it is encrypted. I provided the mechanism above - I suspect it is the interaction between zfs-autosnapshot and syncoid. Each is doing the right thing, but even though I put a lock (to prevent them both running simultaneously - I just checked and expanded I will report back...) there is clearly some break of atomic behavior. These are the system specs: |
Looking around I suspect this https://github.com/openzfs/zfs/issues/15474 is a related issue. The core issue I suspect is the race condition of zfs destroy being used at the same time as a zfs send -I . The zfs snapshot is removed, between the time of the zfs send identifying the range, and the actual zfs send starting. In my case with a lock on the creation (all cron processed create/remove a /tmp/ lock) and the syncoid doesn't run until the lock is given up (which issues the zfs send). The time is 5 mins (15/30/45/60) vs (20/40/60). I should probably make it relatively prime to push out the repeat sequence, but you get the idea... |
Well, that is a theory I guess, though I was not sending zfs snapshots anywhere. I did batch delete snapshots which took about a min to complete, so in theory it could be trying to take a snapshot then. Either way, because no actual files are involved, there seems to be no way to clear the error. Is there any way to find out what the errors actually are and if necessary clear them? |
@aerusso some more debugging, I have found a relevant zfs event... May 11 2024 09:20:09.121469698 ereport.fs.zfs.authentication The zio_objset matches the zero length ( <0xbc74>:<0x0>) error given in the message. Is this the correct way to interpret this? |
zpool events -v |
Hmm well I ran this.
Anyone know how I can drill down to the errors in the first comment? EDIT: Hey.. I just noticed the numbers have changed?!
|
zdb -c tank FYI - system is still accessible so the core dump was not in the kernel. |
Yes.. but I am asking about my pool. I don't think your issue is related to mine because you have actual file handles to look at whereas I do not. |
Also experiencing the above.
I've been using syncoid to make bkups with snapshots |
OK I am adding another post here, because this is officially weird behaviour. So I did another scrub yesterday, the error were still there (though each time I did a scrub and check the numbers woudl change?). So I off-lined then on-lined them again one disk at a time, letting it resilver. Now the error has cleared, and I am back to a clean pool?
So I have no idea what happened or how I fixed it, really. But maybe this will help someone else? |
The fix for me was to initiate a scrub and then stop it with |
Hmm interesting. |
Maybe related to #16147 ? |
Distribution Name Debian
Distribution Version Testing
Kernel Version 6.6.15-amd64
Architecture amd64
OpenZFS Version
zfs-2.2.3-1
zfs-kmod-2.2.3-1
So I did a regular monthly scrub and it reported back permanent errors. I checked the disk smart status and found one with missed write counts mounting over the threshold. It had not failed completely yet, so I failed the device and replaced it with a new disk and re-ran the scrub. The errors persisted. Running
zpool clear
did nothing.What is going on here, and why can I not reset the error flag if there are indeed no files failing?
Any thoughts?
The text was updated successfully, but these errors were encountered: