`zpool clear` heroically dies restoring the pool, but doesn't save your previously queued IO #16175

rincebrain · 2024-05-07T21:28:11Z

System information

Type	Version/Name
Distribution Name	Debian
Distribution Version	11
Kernel Version	5.10.0-28-amd64
Architecture	x86_64
OpenZFS Version	`04bae5e`

Describe the problem you're observing

If I have a USB hard drive that I created a pool on, using the name /dev/disk/by-id/mydrive-part1, pointing to /dev/sda1, unplug it, and replug it, ZFS will say it's SUSPENDED because the disk is gone.

Great. The device got the name sdb since sda was still being held open when the disk reappeared, I zpool clear that device (by guid, since the device's name won't work, which is a separate complaint), and it marks the pool and device ONLINE after one or two iterations of zpool clear pool or zpool clear pool guid, hurray.

...except the clear command never returns, IO from before the hotplug is hung forever, the old device name is still present (e.g. something is holding references to it still), and the stacktrace is in zio_resume.

New writes to the pool appear to succeed, but old IO is stuck forever. zfs_ioc_clear is blocking on zio_resume is blocking on zio_wait , and txg_wait and cp say they're blocking on:

[  605.497469] INFO: task txg_sync:3137 blocked for more than 241 seconds.
[  605.498342]       Tainted: P           OE     5.10.0-28-amd64 #1 Debian 5.10.209-2
[  605.499169] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  605.499995] task:txg_sync        state:D stack:    0 pid: 3137 ppid:     2 flags:0x00004000
[  605.500846] Call Trace:
[  605.501692]  __schedule+0x282/0x870
[  605.502530]  schedule+0x46/0xb0
[  605.503371]  schedule_timeout+0x8b/0x150
[  605.504213]  ? __next_timer_interrupt+0x110/0x110
[  605.505049]  io_schedule_timeout+0x4c/0x80
[  605.505905]  __cv_timedwait_common+0x12f/0x170 [spl]
[  605.506761]  ? add_wait_queue_exclusive+0x70/0x70
[  605.507623]  __cv_timedwait_io+0x15/0x20 [spl]
[  605.508542]  zio_wait+0x13c/0x2d0 [zfs]
[  605.509455]  dsl_pool_sync+0xf5/0x530 [zfs]
[  605.510354]  spa_sync+0x55a/0x1000 [zfs]
[  605.511175]  ? mutex_lock+0xe/0x30
[  605.512041]  ? spa_txg_history_init_io+0x105/0x110 [zfs]
[  605.512917]  txg_sync_thread+0x212/0x3b0 [zfs]
[  605.513789]  ? txg_quiesce_thread+0x330/0x330 [zfs]
[  605.514607]  thread_generic_wrapper+0x6c/0x80 [spl]
[  605.515409]  ? spl_taskq_fini+0x80/0x80 [spl]
[  605.516199]  kthread+0x118/0x140
[  605.516954]  ? __kthread_bind_mask+0x60/0x60
[  605.517718]  ret_from_fork+0x1f/0x30
[  605.518492] INFO: task cp:3254 blocked for more than 241 seconds.
[  605.519266]       Tainted: P           OE     5.10.0-28-amd64 #1 Debian 5.10.209-2
[  605.520061] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  605.520870] task:cp              state:D stack:    0 pid: 3254 ppid:  3253 flags:0x00004004
[  605.521702] Call Trace:
[  605.522523]  __schedule+0x282/0x870
[  605.523339]  ? __schedule+0x28a/0x870
[  605.524151]  schedule+0x46/0xb0
[  605.524958]  io_schedule+0x42/0x70
[  605.525766]  cv_wait_common+0xac/0x130 [spl]
[  605.526561]  ? add_wait_queue_exclusive+0x70/0x70
[  605.527424]  txg_wait_synced_impl+0xcd/0x120 [zfs]
[  605.528289]  txg_wait_synced+0xc/0x40 [zfs]
[  605.529159]  dmu_tx_wait+0x209/0x400 [zfs]
[  605.530022]  dmu_tx_assign+0x16b/0x4e0 [zfs]
[  605.530880]  zfs_write+0x4be/0xd10 [zfs]
[  605.531687]  ? xas_load+0x5/0x80
[  605.532475]  ? current_time+0x2b/0xf0
[  605.533251]  ? touch_atime+0x33/0x140
[  605.534069]  zpl_iter_write+0x103/0x170 [zfs]
[  605.534839]  new_sync_write+0x119/0x1b0
[  605.535590]  vfs_write+0x1ce/0x260
[  605.536337]  ksys_write+0x5f/0xe0
[  605.537065]  do_syscall_64+0x30/0x80
[  605.537783]  entry_SYSCALL_64_after_hwframe+0x62/0xc7

So I assume it's some dance like nothing told the old zios that their reference to the disk they're trying to write to is stale. (zpool reopen doesn't save you.)

Describe how to reproduce the problem

Above.

Include any warning/errors/backtraces from the system logs

Above.

The text was updated successfully, but these errors were encountered:

GregorKopka · 2024-05-30T08:50:53Z

As a dirty hack: what about making zpool clear generally re-queue all in-flight ZIO for the pool?

rincebrain · 2024-05-30T08:56:22Z

It turns out to be very messy and buggy to do that, when I tried it. (Also, I think some IOs aren't necessarily safe to requeue - you'd basically need to reissue the entire txg's worth of IOs to the device, I think? e.g. if you had more than one queue's worth of IOs to issue, and the device went away while they were in the write cache, whether they actually landed depends on how "away" the device went...and you might no longer have them in queue to issue since nominally it claimed they went out.)

It's also somewhat difficult to test, since Linux's behavior on hot-removing a device in use is not...always consistent, and you're not always at the same point in the IO pipeline when you do it.

GregorKopka · 2024-06-01T11:24:39Z

ZIOs no longer in the queue don't matter, they can no longer block and we don't have the information to recreate them anyway... But all still tracked by ZFS could simply be restarted (as they're not on disk yet and/or something is waiting for them) to remove the deadlock and actually persist all data we still have access to... possibly this restart operation could auto-trigger a partial scrub (from the last known-good TXG - or some back to be sure), to automatically check for writes that might have made it out of the ZIO pipeline but got lost on the way to stable storage. 🤔

…

On 30 May 2024 10:56:43 CEST, Rich Ercolani ***@***.***> wrote: It turns out to be very messy and buggy to do that, when I tried it. It's also somewhat difficult to test, since Linux's behavior on hot-removing a device in use is not...always consistent, and you're not always at the same point in the IO pipeline when you do it. -- Reply to this email directly or view it on GitHub: #16175 (comment) You are receiving this because you commented. Message ID: ***@***.***>

rincebrain added the Type: Defect Incorrect behavior (e.g. crash, hang) label May 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`zpool clear` heroically dies restoring the pool, but doesn't save your previously queued IO #16175

`zpool clear` heroically dies restoring the pool, but doesn't save your previously queued IO #16175

rincebrain commented May 7, 2024

GregorKopka commented May 30, 2024

rincebrain commented May 30, 2024 •

edited

GregorKopka commented Jun 1, 2024 via email

zpool clear heroically dies restoring the pool, but doesn't save your previously queued IO #16175

zpool clear heroically dies restoring the pool, but doesn't save your previously queued IO #16175

Comments

rincebrain commented May 7, 2024

System information

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

GregorKopka commented May 30, 2024

rincebrain commented May 30, 2024 • edited

GregorKopka commented Jun 1, 2024 via email

`zpool clear` heroically dies restoring the pool, but doesn't save your previously queued IO #16175

`zpool clear` heroically dies restoring the pool, but doesn't save your previously queued IO #16175

rincebrain commented May 30, 2024 •

edited