PANIC at zap.c:441:zap_create_leaf #16157

snajpa · 2024-05-03T10:25:22Z

System information

Type	Version/Name
Distribution Name	NixOS
Distribution Version	custom
Kernel Version	6.8.8
Architecture	x86-64
OpenZFS Version	`8f1b7a6`

Describe the problem you're observing

One of our nodes hit the following VERIFY3 assertion:

VERIFY3(NULL == dmu_buf_set_user(l->l_dbuf, &l->l_dbu)) failed (0000000000000000 == ffff91dfeb60e200)
PANIC at zap.c:441:zap_create_leaf()
Showing stack for process 2470499
CPU: 10 PID: 2470499 Comm: python3 Tainted: G           OE      6.8.8 #1-vpsAdminOS
Hardware name: Dell Inc. PowerEdge R7525/0PYVT1, BIOS 2.5.6 10/06/2021
In memory cgroup /osctl/pool.tank/group.default/user.2739/ct.19681/user-owned/lxc.payload.19681/docker/3bd6008a375b00b30cb2fccbf7dcd2e32c081094425b4c8603c16fa09743d08a
Call Trace:
 <TASK>
 dump_stack_lvl+0x53/0x70
 spl_panic+0x100/0x120 [spl]
 zap_expand_leaf+0x4ec/0x520 [zfs]
 fzap_add_cd+0xf9/0x230 [zfs]
 fzap_add+0x47/0xd0 [zfs]
 zap_add_impl+0x15e/0x330 [zfs]
 ? srso_return_thunk+0x5/0x5f
 ? zap_add+0x5c/0xd0 [zfs]
 zfs_link_create+0x189/0x500 [zfs]
 zfs_create+0x75f/0xa00 [zfs]
 zpl_create+0xd0/0x1e0 [zfs]
 path_openat+0xe9c/0x1180
 do_filp_open+0xb3/0x160
 do_sys_openat2+0xab/0xe0
 __x64_sys_openat+0x6e/0xa0
 do_syscall_64+0xab/0x1b0
 entry_SYSCALL_64_after_hwframe+0x79/0x81
RIP: 0033:0x7f9df8ef9f01
Code: 75 57 89 f0 25 00 00 41 00 3d 00 00 41 00 74 49 80 3d ea 26 0e 00 00 74 6d 89 da 48 89 ee bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00 f0 ff ff 0f 87 93 00 00 00 48 8b 54 24 28 64 48 2b 14 25
RSP: 002b:00007ffe2a11f4d0 EFLAGS: 00000202 ORIG_RAX: 0000000000000101
RAX: ffffffffffffffda RBX: 0000000000080241 RCX: 00007f9df8ef9f01
RDX: 0000000000080241 RSI: 00007f9df8a61d00 RDI: 00000000ffffff9c
RBP: 00007f9df8a61d00 R08: 0000000000000000 R09: 0000000000000000
R10: 00000000000001b6 R11: 0000000000000202 R12: 00007f9df8e00fc0
R13: 00000000ffffffff R14: 00007f9df8acfaf0 R15: 0000000000000001
 </TASK>

Describe how to reproduce the problem

Don't know at the moment. Nothing obvious stands out (nowhere near dataset quota or full pool, no device failures, no memory shortage).

The text was updated successfully, but these errors were encountered:

snajpa · 2024-05-03T11:23:05Z

Perhaps related to #12366

amotin · 2024-05-03T15:28:38Z

#16159 should fix #12366, but I don't know if this problem is the consequence, though I guess it is possible.

amotin · 2024-05-03T16:54:30Z

@alex-stetsenko I wonder if this issue can be related to #15888. Before zap_trunc() roll back zap_freeblk, is there anything to free zap_leaf_t associated with deleted leafs before dbuf eviction code get to them? zap_shrink() does dmu_free_range(), but I wonder whether it frees associated user data buffers. May be we should do it explicitly.

alex-stetsenko · 2024-05-06T18:02:36Z

@alex-stetsenko I wonder if this issue can be related to #15888. Before zap_trunc() roll back zap_freeblk, is there anything to free zap_leaf_t associated with deleted leafs before dbuf eviction code get to them? zap_shrink() does dmu_free_range(), but I wonder whether it frees associated user data buffers. May be we should do it explicitly.

It looks suspicious. I don't have answers at the moment - needs an investigation. Should we back out the zap trimming code until it fixed?

amotin · 2024-05-06T18:14:04Z

It looks suspicious. I don't have answers at the moment - needs an investigation. Should we back out the zap trimming code until it fixed?

I don't think it should cause any pool corruptions, and it is only in master branch. So I'd vote towards just focusing on fixing it. If it can be fixed within lets say a week, I would not mess the commit history with reverts.

snajpa · 2024-05-09T22:21:30Z

A fresh one:

VERIFY3(NULL == dmu_buf_set_user(l->l_dbuf, &l->l_dbu)) failed (0000000000000000 == ffff9a9575f48600)
PANIC at zap.c:441:zap_create_leaf()
Showing stack for process 1234530
CPU: 108 PID: 1234530 Comm: php-fpm7 Tainted: G      D    OE      6.8.8 #1-vpsAdminOS
Hardware name: Dell Inc. PowerEdge R6515/0R4CNN, BIOS 2.0.3 01/15/2021
In memory cgroup /osctl/pool.tank/group.default/user.995/ct.20822/user-owned/lxc.payload.20822/lxc.payload.retroweb
Call Trace:
 <TASK>
 dump_stack_lvl+0x53/0x70
 spl_panic+0x100/0x120 [spl]
 zap_expand_leaf+0x4ec/0x520 [zfs]
 fzap_add_cd+0xf9/0x230 [zfs]
 fzap_add+0x47/0xd0 [zfs]
 zap_add_impl+0x15e/0x330 [zfs]
 zfs_link_create+0x189/0x500 [zfs]
 zfs_mkdir+0x535/0x710 [zfs]
 zpl_mkdir+0xcb/0x1d0 [zfs]
 vfs_mkdir+0x197/0x240
 do_mkdirat+0x14b/0x180
 __x64_sys_mkdir+0x49/0x70
 do_syscall_64+0xab/0x1b0
 entry_SYSCALL_64_after_hwframe+0x79/0x81
RIP: 0033:0x7f8ab576d0df

I can instrument the code with some zfs_dbgmsg, but it can't shoot down the workload, if that would help. I can also revert the ZAP shrinking patches in our build, but it'll take time to confirm the panics definitely disappear - it seems to be rather hard to hit, I think? These two panics are from two different boxes and so far it hasn't repeated. Uptime of first one is now a nice round week :-D

grwilson · 2024-05-10T03:40:33Z

We have also hit this bug and are disabling zap shrinking to see if it works around the problem.

robn · 2024-05-10T07:04:12Z

Don't know this code super well, but I've been reading it and @amotin seems right.

is there anything to free zap_leaf_t associated with deleted leafs before dbuf eviction code get to them? zap_shrink() does dmu_free_range(), but I wonder whether it frees associated user data buffers.

zap_shrink():

		(void) dmu_free_range(zap->zap_objset, zap->zap_object,
		    sl_blkid << bs, 1 << bs, tx);
		zap_put_leaf(sl);

which is mostly just dmu_buf_rele(). If that was the last reference, then its going to go into the dbuf cache, where it keeps its dbu for eviction.

Intuitively, it doesn't seem like its going to be usable again after this point, so we probably could just clear the dbu and free the zap_leaf_t right here. Maybe it's as simple as:

diff --git module/zfs/zap.c module/zfs/zap.c
index 1b6b16fc6..33f529468 100644
--- module/zfs/zap.c
+++ module/zfs/zap.c
@@ -1660,7 +1660,9 @@ zap_shrink(zap_name_t *zn, zap_leaf_t *l, dmu_tx_t *tx)
 
 		(void) dmu_free_range(zap->zap_objset, zap->zap_object,
 		    sl_blkid << bs, 1 << bs, tx);
+		dmu_buf_remove_user(sl->l_dbuf, &sl->l_dbu);
 		zap_put_leaf(sl);
+		zap_leaf_evict_sync(sl);
 
 		zap_f_phys(zap)->zap_num_leafs--;

though I'd probably stick a refcount verify in the dbuf in there, just to be sure.

But maybe it's safer just to reuse the zap_leaf_t if there's still one because we haven't evicted it yet. More like:

diff --git module/zfs/zap.c module/zfs/zap.c
index 1b6b16fc6..48333b545 100644
--- module/zfs/zap.c
+++ module/zfs/zap.c
@@ -425,20 +425,30 @@ zap_leaf_evict_sync(void *dbu)
 static zap_leaf_t *
 zap_create_leaf(zap_t *zap, dmu_tx_t *tx)
 {
-	zap_leaf_t *l = kmem_zalloc(sizeof (zap_leaf_t), KM_SLEEP);
-
 	ASSERT(RW_WRITE_HELD(&zap->zap_rwlock));
 
-	rw_init(&l->l_rwlock, NULL, RW_NOLOCKDEP, NULL);
-	rw_enter(&l->l_rwlock, RW_WRITER);
-	l->l_blkid = zap_allocate_blocks(zap, 1);
-	l->l_dbuf = NULL;
+	uint64_t blkid = zap_allocate_blocks(zap, 1);
+	dmu_buf_t *db = NULL;
 
 	VERIFY0(dmu_buf_hold_by_dnode(zap->zap_dnode,
-	    l->l_blkid << FZAP_BLOCK_SHIFT(zap), NULL, &l->l_dbuf,
+	    blkid << FZAP_BLOCK_SHIFT(zap), NULL, &db,
 	    DMU_READ_NO_PREFETCH));
-	dmu_buf_init_user(&l->l_dbu, zap_leaf_evict_sync, NULL, &l->l_dbuf);
-	VERIFY3P(NULL, ==, dmu_buf_set_user(l->l_dbuf, &l->l_dbu));
+
+	zap_leaf_t *l = dmu_buf_get_user(db);
+	if (l == NULL) {
+		l = kmem_zalloc(sizeof (zap_leaf_t), KM_SLEEP);
+		l->l_blkid = blkid;
+		l->l_dbuf = db;
+		rw_init(&l->l_rwlock, NULL, RW_NOLOCKDEP, NULL);
+		dmu_buf_init_user(&l->l_dbu, zap_leaf_evict_sync, NULL,
+		    &l->l_dbuf);
+		dmu_buf_set_user(l->l_dbuf, &l->l_dbu);
+	} else {
+		ASSERT3U(l->l_blkid, ==, blkid);
+		ASSERT3P(l->l_dbuf, ==, db);
+	}
+
+	rw_enter(&l->l_rwlock, RW_WRITER);
 	dmu_buf_will_dirty(l->l_dbuf, tx);
 
 	zap_leaf_init(l, zap->zap_normflags != 0);

I don't have a repro or a good intuition about how to reproduce this directly, so I haven't tested the above diffs, and I might be way off. Feels not totally wrong though!

ahrens · 2024-05-10T15:58:47Z

@robn those seem like good ideas. I was also wondering how important zap_trunc() is. It seems like it would be rare that it's the last blockid that's being freed, and I think the only benefit would be a reduced amount of indirect blocks? If we didn't call it, then this wouldn't be a problem because we'd always be allocating a never-used block.

amotin · 2024-05-10T17:40:51Z

I was also wondering how important zap_trunc() is. It seems like it would be rare that it's the last blockid that's being freed, and I think the only benefit would be a reduced amount of indirect blocks? If we didn't call it, then this wouldn't be a problem because we'd always be allocating a never-used block.

@ahrens Having excessive amounts of holes makes prefetching more difficult. Also it is not that we just don't shrink the dnode once, but that we grow it indefinitely with each grow/shrink cycle, since the code it unable to fill the holes, only append.

alex-stetsenko · 2024-05-11T09:34:03Z

At this moment, I would suggest to add a module parameter to disable zap_trunc(). So that zap_shrink() could work without zap_trunc() until the problem fixed.

snajpa · 2024-05-11T11:35:29Z

So we have two crashes with one container, I'm trying to reach out to our member whether I can work with them to devise a clean reproducer. In the meantime we've deployed a revert of the ZAP shrinking patch.

amotin · 2024-05-13T21:42:06Z

At this moment, I would suggest to add a module parameter to disable zap_trunc(). So that zap_shrink() could work without zap_trunc() until the problem fixed.

@alex-stetsenko @robn provided even two possible solutions above, both of which look reasonable to me, while the first seems to be cleaner, not leaving user buffer for block we consider free. Do you see a problem with integrating one of them instead of introducing unneeded kill switches? I'd expect PR to be created about 4 days ago. Do you want someone of us to create the PR?

alex-stetsenko · 2024-05-14T10:44:42Z

@robn provided even two possible solutions above, both of which look reasonable to me, while the first seems to be cleaner, not leaving user buffer for block we consider free. Do you see a problem with integrating one of them instead of introducing unneeded kill switches? I'd expect PR to be created about 4 days ago. Do you want someone of us to create the PR?

The first patch doesn't looks correct.
The second one seems to be good. Though, I would like to have it tested by anyone who could reproduce the issue before creating PR.

snajpa · 2024-05-14T12:57:28Z

@robn can you please post the second idea as a PR so it runs through the test suite? We could merge it into our staging right now, but that wouldn't tell us much as the staging environment exerts only a pretty low pressure on the code. I'd say if it passes the test suite, let's merge it and then wait and see whether more reports pop up or not :)

(also if anyone could do a more comprehensive re-review of the codepaths that would be awesome, I'll be sure to send a few beers to everyone involved :D)

robn · 2024-05-14T23:24:47Z

@snajpa Will do today.

@alex-stetsenko Can you elaborate on "doesn't look correct" please? (apart from the missing & heh, I mean, is the logic or concept wrong?)

alex-stetsenko · 2024-05-15T11:40:35Z

@alex-stetsenko Can you elaborate on "doesn't look correct" please? (apart from the missing & heh, I mean, is the logic or concept wrong?

I don't think you can unconditionally destroy a leaf. As far as I can see, zap_cursor (fzap_cursor_retrieve()) can still reference it.

robn · 2024-05-17T01:56:38Z

@alex-stetsenko mm, you might be right. That was partly why I said "maybe verify the refcount" there, mostly because I didn't want to think very hard for a drive-by.

In any case, the second seems like a safer thing to do in terms of the timeline - we know at that point nothing is using it, or we wouldn't be trying to create it, and we also know that the userdata being NULL or not can't race, because of the dance on db_mtx through holds and evicts.

So here's #16204; please take a look and lets see what falls out. @snajpa sorry about the delay!

If a shrink or truncate had recently freed a portion of the ZAP, the dbuf could still be sitting on the dbuf cache waiting for eviction. If it is then allocated for a new leaf before it can be evicted, the zap_leaf_t is still attached as userdata, tripping the VERIFY. Instead, just check for the userdata, and if we find it, reuse it. Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16157. Closes #16204

snajpa added the Type: Defect Incorrect behavior (e.g. crash, hang) label May 3, 2024

robn mentioned this issue May 17, 2024

zap: reuse zap_leaf_t on dbuf reuse after shrink #16204

Merged

13 tasks

behlendorf closed this as completed in #16204 May 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PANIC at zap.c:441:zap_create_leaf #16157

PANIC at zap.c:441:zap_create_leaf #16157

snajpa commented May 3, 2024 •

edited

snajpa commented May 3, 2024

amotin commented May 3, 2024

amotin commented May 3, 2024 •

edited

alex-stetsenko commented May 6, 2024

amotin commented May 6, 2024

snajpa commented May 9, 2024 •

edited

grwilson commented May 10, 2024

robn commented May 10, 2024

ahrens commented May 10, 2024

amotin commented May 10, 2024

alex-stetsenko commented May 11, 2024

snajpa commented May 11, 2024

amotin commented May 13, 2024

alex-stetsenko commented May 14, 2024 •

edited

snajpa commented May 14, 2024

robn commented May 14, 2024

alex-stetsenko commented May 15, 2024

robn commented May 17, 2024 •

edited

PANIC at zap.c:441:zap_create_leaf #16157

PANIC at zap.c:441:zap_create_leaf #16157

Comments

snajpa commented May 3, 2024 • edited

System information

Describe the problem you're observing

Describe how to reproduce the problem

snajpa commented May 3, 2024

amotin commented May 3, 2024

amotin commented May 3, 2024 • edited

alex-stetsenko commented May 6, 2024

amotin commented May 6, 2024

snajpa commented May 9, 2024 • edited

grwilson commented May 10, 2024

robn commented May 10, 2024

ahrens commented May 10, 2024

amotin commented May 10, 2024

alex-stetsenko commented May 11, 2024

snajpa commented May 11, 2024

amotin commented May 13, 2024

alex-stetsenko commented May 14, 2024 • edited

snajpa commented May 14, 2024

robn commented May 14, 2024

alex-stetsenko commented May 15, 2024

robn commented May 17, 2024 • edited

snajpa commented May 3, 2024 •

edited

amotin commented May 3, 2024 •

edited

snajpa commented May 9, 2024 •

edited

alex-stetsenko commented May 14, 2024 •

edited

robn commented May 17, 2024 •

edited