Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZFS Hangs After Snapshot Removal on Pool With Corruption #16145

Open
Ghan04 opened this issue Apr 29, 2024 · 3 comments
Open

ZFS Hangs After Snapshot Removal on Pool With Corruption #16145

Ghan04 opened this issue Apr 29, 2024 · 3 comments
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@Ghan04
Copy link

Ghan04 commented Apr 29, 2024

System information

Type Version/Name
Distribution Name Ubuntu
Distribution Version 22.04
Kernel Version 5.15.0-105
Architecture x86_64
OpenZFS Version 2.1.5

Describe the problem you're observing

I have a mirror pool experiencing some data corruption. Looks to be caused by a PCIe issue while writing some data that garbled a few blocks so can't be retrieved from either disk. Only a few hits against checksum errors during scrubs.
The corrupted file in question is a KVM VM qcow2 file so I thought perhaps I could find the issue in the guest, fix it, and maybe the unallocated block(s) would fall off in a scrub and everything would be fine.
As part of this cleanup, I also removed some old snapshots from the dataset in question. After doing the above, I tried to run a scrub. This process hung. See error below.

Describe how to reproduce the problem

I have no idea how to reproduce this. The assert line seems to point at the ashift size being incorrect on a block somewhere, but I have no idea how that could have happened.

Include any warning/errors/backtraces from the system logs

VERIFY3(0 == P2PHASE(offset, 1ULL << vd->vdev_ashift)) failed (0 == 512)
PANIC at metaslab.c:5341:metaslab_free_concrete()
[143550.726760] INFO: task txg_sync:1402 blocked for more than 724 seconds.
[143550.726784]       Tainted: P           O      5.15.0-105-generic #115-Ubuntu
[143550.726799] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[143550.726813] task:txg_sync        state:D stack:    0 pid: 1402 ppid:     2 flags:0x00004000
[143550.726818] Call Trace:
[143550.726820]  <TASK>
[143550.726824]  __schedule+0x24e/0x590
[143550.726834]  schedule+0x69/0x110
[143550.726837]  spl_panic+0xe7/0xe9 [spl]
[143550.726848]  ? range_tree_stat_incr+0x2d/0x50 [zfs]
[143550.726946]  ? range_tree_add_impl+0x3cf/0x610 [zfs]
[143550.727023]  metaslab_free_concrete+0x226/0x270 [zfs]
[143550.727100]  ? do_raw_spin_unlock+0x9/0x10 [zfs]
[143550.727170]  metaslab_free_impl+0xb3/0xf0 [zfs]
[143550.727253]  metaslab_free_dva+0x61/0x80 [zfs]
[143550.727324]  metaslab_free+0x114/0x1d0 [zfs]
[143550.727397]  zio_free_sync+0xf1/0x110 [zfs]
[143550.727503]  dsl_scan_free_block_cb+0x6e/0x1d0 [zfs]
[143550.727587]  bpobj_dsl_scan_free_block_cb+0x11/0x20 [zfs]
[143550.727659]  bpobj_iterate_blkptrs+0xf9/0x380 [zfs]
[143550.727728]  ? dsl_scan_free_block_cb+0x1d0/0x1d0 [zfs]
[143550.727800]  bpobj_iterate_impl+0x23b/0x390 [zfs]
[143550.727872]  ? dsl_scan_free_block_cb+0x1d0/0x1d0 [zfs]
[143550.727943]  bpobj_iterate+0x17/0x20 [zfs]
[143550.728010]  dsl_process_async_destroys+0x2d5/0x580 [zfs]
[143550.728082]  dsl_scan_sync+0x1ec/0x910 [zfs]
[143550.728154]  ? ddt_sync+0xa8/0xd0 [zfs]
[143550.728225]  spa_sync_iterate_to_convergence+0x124/0x1f0 [zfs]
[143550.728312]  spa_sync+0x2dc/0x5b0 [zfs]
[143550.728388]  txg_sync_thread+0x266/0x2f0 [zfs]
[143550.728480]  ? txg_dispatch_callbacks+0x100/0x100 [zfs]
[143550.728560]  thread_generic_wrapper+0x64/0x80 [spl]
[143550.728569]  ? __thread_exit+0x20/0x20 [spl]
[143550.728575]  kthread+0x12a/0x150
[143550.728579]  ? set_kthread_struct+0x50/0x50
[143550.728582]  ret_from_fork+0x22/0x30
[143550.728587]  </TASK>

I've tried booting to rescue mode but the same thing happens when I attempt to import the pool. Is there any way to recover the data? It seems like the data on disk should still be good aside from this one place where it is encountering a mismatch in the ashift value, or is the entire vdev toast?

@Ghan04 Ghan04 added the Type: Defect Incorrect behavior (e.g. crash, hang) label Apr 29, 2024
@rincebrain
Copy link
Contributor

Specifically, the error means "the offset we are trying to free is not a valid offset on this vdev" - e.g. if you had an ashift 12 (4k) vdev, and tried to free something that was 512b into it, you'd trip this.

You might be able to import the pool read-only (since it's not going to evaluate trying to free anything or free space or anything) and yank your data off that way.

@Ghan04
Copy link
Author

Ghan04 commented Apr 29, 2024

I did try a read-only import but it threw the same error message and hung again. I just have the one vdev and it is 4k ashift. Would this be pointing at a single block's offset or is this something about the metadata of the entire vdev?

Edit: I've also tried

echo 1 >> /sys/module/zfs/parameters/zfs_recover

Followed by a read-only import with no luck. Same panic error.

@rincebrain
Copy link
Contributor

Might be able to use git's zdb feature to emit send streams from things too damaged to import normally and get datasets off that way.

I believe it's referring to thinking it has an element to free from a metaslab and finding out the metaslab is e.g. 4k aligned and it's trying to remove 512b into it, or something like that. So it's a specific object, but it's already pending removal in its head.

You could try a readonly import at an older txg for the older txgs listed in the uberblock still, that might or might not fly. You could also use zdb to try "simulating" if that's going to panic rather than letting your kernel do it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

2 participants