Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PANIC at zap.c:441:zap_create_leaf #16157

Closed
snajpa opened this issue May 3, 2024 · 18 comments · Fixed by #16204
Closed

PANIC at zap.c:441:zap_create_leaf #16157

snajpa opened this issue May 3, 2024 · 18 comments · Fixed by #16204
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@snajpa
Copy link
Contributor

snajpa commented May 3, 2024

System information

Type Version/Name
Distribution Name NixOS
Distribution Version custom
Kernel Version 6.8.8
Architecture x86-64
OpenZFS Version 8f1b7a6

Describe the problem you're observing

One of our nodes hit the following VERIFY3 assertion:

VERIFY3(NULL == dmu_buf_set_user(l->l_dbuf, &l->l_dbu)) failed (0000000000000000 == ffff91dfeb60e200)
PANIC at zap.c:441:zap_create_leaf()
Showing stack for process 2470499
CPU: 10 PID: 2470499 Comm: python3 Tainted: G           OE      6.8.8 #1-vpsAdminOS
Hardware name: Dell Inc. PowerEdge R7525/0PYVT1, BIOS 2.5.6 10/06/2021
In memory cgroup /osctl/pool.tank/group.default/user.2739/ct.19681/user-owned/lxc.payload.19681/docker/3bd6008a375b00b30cb2fccbf7dcd2e32c081094425b4c8603c16fa09743d08a
Call Trace:
 <TASK>
 dump_stack_lvl+0x53/0x70
 spl_panic+0x100/0x120 [spl]
 zap_expand_leaf+0x4ec/0x520 [zfs]
 fzap_add_cd+0xf9/0x230 [zfs]
 fzap_add+0x47/0xd0 [zfs]
 zap_add_impl+0x15e/0x330 [zfs]
 ? srso_return_thunk+0x5/0x5f
 ? zap_add+0x5c/0xd0 [zfs]
 zfs_link_create+0x189/0x500 [zfs]
 zfs_create+0x75f/0xa00 [zfs]
 zpl_create+0xd0/0x1e0 [zfs]
 path_openat+0xe9c/0x1180
 do_filp_open+0xb3/0x160
 do_sys_openat2+0xab/0xe0
 __x64_sys_openat+0x6e/0xa0
 do_syscall_64+0xab/0x1b0
 entry_SYSCALL_64_after_hwframe+0x79/0x81
RIP: 0033:0x7f9df8ef9f01
Code: 75 57 89 f0 25 00 00 41 00 3d 00 00 41 00 74 49 80 3d ea 26 0e 00 00 74 6d 89 da 48 89 ee bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00 f0 ff ff 0f 87 93 00 00 00 48 8b 54 24 28 64 48 2b 14 25
RSP: 002b:00007ffe2a11f4d0 EFLAGS: 00000202 ORIG_RAX: 0000000000000101
RAX: ffffffffffffffda RBX: 0000000000080241 RCX: 00007f9df8ef9f01
RDX: 0000000000080241 RSI: 00007f9df8a61d00 RDI: 00000000ffffff9c
RBP: 00007f9df8a61d00 R08: 0000000000000000 R09: 0000000000000000
R10: 00000000000001b6 R11: 0000000000000202 R12: 00007f9df8e00fc0
R13: 00000000ffffffff R14: 00007f9df8acfaf0 R15: 0000000000000001
 </TASK>

Describe how to reproduce the problem

Don't know at the moment. Nothing obvious stands out (nowhere near dataset quota or full pool, no device failures, no memory shortage).

@snajpa snajpa added the Type: Defect Incorrect behavior (e.g. crash, hang) label May 3, 2024
@snajpa
Copy link
Contributor Author

snajpa commented May 3, 2024

Perhaps related to #12366

@amotin
Copy link
Member

amotin commented May 3, 2024

#16159 should fix #12366, but I don't know if this problem is the consequence, though I guess it is possible.

@amotin
Copy link
Member

amotin commented May 3, 2024

@alex-stetsenko I wonder if this issue can be related to #15888. Before zap_trunc() roll back zap_freeblk, is there anything to free zap_leaf_t associated with deleted leafs before dbuf eviction code get to them? zap_shrink() does dmu_free_range(), but I wonder whether it frees associated user data buffers. May be we should do it explicitly.

@alex-stetsenko
Copy link
Contributor

@alex-stetsenko I wonder if this issue can be related to #15888. Before zap_trunc() roll back zap_freeblk, is there anything to free zap_leaf_t associated with deleted leafs before dbuf eviction code get to them? zap_shrink() does dmu_free_range(), but I wonder whether it frees associated user data buffers. May be we should do it explicitly.

It looks suspicious. I don't have answers at the moment - needs an investigation. Should we back out the zap trimming code until it fixed?

@amotin
Copy link
Member

amotin commented May 6, 2024

It looks suspicious. I don't have answers at the moment - needs an investigation. Should we back out the zap trimming code until it fixed?

I don't think it should cause any pool corruptions, and it is only in master branch. So I'd vote towards just focusing on fixing it. If it can be fixed within lets say a week, I would not mess the commit history with reverts.

@snajpa
Copy link
Contributor Author

snajpa commented May 9, 2024

A fresh one:

VERIFY3(NULL == dmu_buf_set_user(l->l_dbuf, &l->l_dbu)) failed (0000000000000000 == ffff9a9575f48600)
PANIC at zap.c:441:zap_create_leaf()
Showing stack for process 1234530
CPU: 108 PID: 1234530 Comm: php-fpm7 Tainted: G      D    OE      6.8.8 #1-vpsAdminOS
Hardware name: Dell Inc. PowerEdge R6515/0R4CNN, BIOS 2.0.3 01/15/2021
In memory cgroup /osctl/pool.tank/group.default/user.995/ct.20822/user-owned/lxc.payload.20822/lxc.payload.retroweb
Call Trace:
 <TASK>
 dump_stack_lvl+0x53/0x70
 spl_panic+0x100/0x120 [spl]
 zap_expand_leaf+0x4ec/0x520 [zfs]
 fzap_add_cd+0xf9/0x230 [zfs]
 fzap_add+0x47/0xd0 [zfs]
 zap_add_impl+0x15e/0x330 [zfs]
 zfs_link_create+0x189/0x500 [zfs]
 zfs_mkdir+0x535/0x710 [zfs]
 zpl_mkdir+0xcb/0x1d0 [zfs]
 vfs_mkdir+0x197/0x240
 do_mkdirat+0x14b/0x180
 __x64_sys_mkdir+0x49/0x70
 do_syscall_64+0xab/0x1b0
 entry_SYSCALL_64_after_hwframe+0x79/0x81
RIP: 0033:0x7f8ab576d0df

I can instrument the code with some zfs_dbgmsg, but it can't shoot down the workload, if that would help. I can also revert the ZAP shrinking patches in our build, but it'll take time to confirm the panics definitely disappear - it seems to be rather hard to hit, I think? These two panics are from two different boxes and so far it hasn't repeated. Uptime of first one is now a nice round week :-D

@grwilson
Copy link
Member

We have also hit this bug and are disabling zap shrinking to see if it works around the problem.

@robn
Copy link
Contributor

robn commented May 10, 2024

Don't know this code super well, but I've been reading it and @amotin seems right.

is there anything to free zap_leaf_t associated with deleted leafs before dbuf eviction code get to them? zap_shrink() does dmu_free_range(), but I wonder whether it frees associated user data buffers.

zap_shrink():

		(void) dmu_free_range(zap->zap_objset, zap->zap_object,
		    sl_blkid << bs, 1 << bs, tx);
		zap_put_leaf(sl);

which is mostly just dmu_buf_rele(). If that was the last reference, then its going to go into the dbuf cache, where it keeps its dbu for eviction.

Intuitively, it doesn't seem like its going to be usable again after this point, so we probably could just clear the dbu and free the zap_leaf_t right here. Maybe it's as simple as:

diff --git module/zfs/zap.c module/zfs/zap.c
index 1b6b16fc6..33f529468 100644
--- module/zfs/zap.c
+++ module/zfs/zap.c
@@ -1660,7 +1660,9 @@ zap_shrink(zap_name_t *zn, zap_leaf_t *l, dmu_tx_t *tx)
 
 		(void) dmu_free_range(zap->zap_objset, zap->zap_object,
 		    sl_blkid << bs, 1 << bs, tx);
+		dmu_buf_remove_user(sl->l_dbuf, &sl->l_dbu);
 		zap_put_leaf(sl);
+		zap_leaf_evict_sync(sl);
 
 		zap_f_phys(zap)->zap_num_leafs--;
 

though I'd probably stick a refcount verify in the dbuf in there, just to be sure.

But maybe it's safer just to reuse the zap_leaf_t if there's still one because we haven't evicted it yet. More like:

diff --git module/zfs/zap.c module/zfs/zap.c
index 1b6b16fc6..48333b545 100644
--- module/zfs/zap.c
+++ module/zfs/zap.c
@@ -425,20 +425,30 @@ zap_leaf_evict_sync(void *dbu)
 static zap_leaf_t *
 zap_create_leaf(zap_t *zap, dmu_tx_t *tx)
 {
-	zap_leaf_t *l = kmem_zalloc(sizeof (zap_leaf_t), KM_SLEEP);
-
 	ASSERT(RW_WRITE_HELD(&zap->zap_rwlock));
 
-	rw_init(&l->l_rwlock, NULL, RW_NOLOCKDEP, NULL);
-	rw_enter(&l->l_rwlock, RW_WRITER);
-	l->l_blkid = zap_allocate_blocks(zap, 1);
-	l->l_dbuf = NULL;
+	uint64_t blkid = zap_allocate_blocks(zap, 1);
+	dmu_buf_t *db = NULL;
 
 	VERIFY0(dmu_buf_hold_by_dnode(zap->zap_dnode,
-	    l->l_blkid << FZAP_BLOCK_SHIFT(zap), NULL, &l->l_dbuf,
+	    blkid << FZAP_BLOCK_SHIFT(zap), NULL, &db,
 	    DMU_READ_NO_PREFETCH));
-	dmu_buf_init_user(&l->l_dbu, zap_leaf_evict_sync, NULL, &l->l_dbuf);
-	VERIFY3P(NULL, ==, dmu_buf_set_user(l->l_dbuf, &l->l_dbu));
+
+	zap_leaf_t *l = dmu_buf_get_user(db);
+	if (l == NULL) {
+		l = kmem_zalloc(sizeof (zap_leaf_t), KM_SLEEP);
+		l->l_blkid = blkid;
+		l->l_dbuf = db;
+		rw_init(&l->l_rwlock, NULL, RW_NOLOCKDEP, NULL);
+		dmu_buf_init_user(&l->l_dbu, zap_leaf_evict_sync, NULL,
+		    &l->l_dbuf);
+		dmu_buf_set_user(l->l_dbuf, &l->l_dbu);
+	} else {
+		ASSERT3U(l->l_blkid, ==, blkid);
+		ASSERT3P(l->l_dbuf, ==, db);
+	}
+
+	rw_enter(&l->l_rwlock, RW_WRITER);
 	dmu_buf_will_dirty(l->l_dbuf, tx);
 
 	zap_leaf_init(l, zap->zap_normflags != 0);

I don't have a repro or a good intuition about how to reproduce this directly, so I haven't tested the above diffs, and I might be way off. Feels not totally wrong though!

@ahrens
Copy link
Member

ahrens commented May 10, 2024

@robn those seem like good ideas. I was also wondering how important zap_trunc() is. It seems like it would be rare that it's the last blockid that's being freed, and I think the only benefit would be a reduced amount of indirect blocks? If we didn't call it, then this wouldn't be a problem because we'd always be allocating a never-used block.

@amotin
Copy link
Member

amotin commented May 10, 2024

I was also wondering how important zap_trunc() is. It seems like it would be rare that it's the last blockid that's being freed, and I think the only benefit would be a reduced amount of indirect blocks? If we didn't call it, then this wouldn't be a problem because we'd always be allocating a never-used block.

@ahrens Having excessive amounts of holes makes prefetching more difficult. Also it is not that we just don't shrink the dnode once, but that we grow it indefinitely with each grow/shrink cycle, since the code it unable to fill the holes, only append.

@alex-stetsenko
Copy link
Contributor

At this moment, I would suggest to add a module parameter to disable zap_trunc(). So that zap_shrink() could work without zap_trunc() until the problem fixed.

@snajpa
Copy link
Contributor Author

snajpa commented May 11, 2024

So we have two crashes with one container, I'm trying to reach out to our member whether I can work with them to devise a clean reproducer. In the meantime we've deployed a revert of the ZAP shrinking patch.

@amotin
Copy link
Member

amotin commented May 13, 2024

At this moment, I would suggest to add a module parameter to disable zap_trunc(). So that zap_shrink() could work without zap_trunc() until the problem fixed.

@alex-stetsenko @robn provided even two possible solutions above, both of which look reasonable to me, while the first seems to be cleaner, not leaving user buffer for block we consider free. Do you see a problem with integrating one of them instead of introducing unneeded kill switches? I'd expect PR to be created about 4 days ago. Do you want someone of us to create the PR?

@alex-stetsenko
Copy link
Contributor

alex-stetsenko commented May 14, 2024

@robn provided even two possible solutions above, both of which look reasonable to me, while the first seems to be cleaner, not leaving user buffer for block we consider free. Do you see a problem with integrating one of them instead of introducing unneeded kill switches? I'd expect PR to be created about 4 days ago. Do you want someone of us to create the PR?

The first patch doesn't looks correct.
The second one seems to be good. Though, I would like to have it tested by anyone who could reproduce the issue before creating PR.

@snajpa
Copy link
Contributor Author

snajpa commented May 14, 2024

@robn can you please post the second idea as a PR so it runs through the test suite? We could merge it into our staging right now, but that wouldn't tell us much as the staging environment exerts only a pretty low pressure on the code. I'd say if it passes the test suite, let's merge it and then wait and see whether more reports pop up or not :)

(also if anyone could do a more comprehensive re-review of the codepaths that would be awesome, I'll be sure to send a few beers to everyone involved :D)

@robn
Copy link
Contributor

robn commented May 14, 2024

@snajpa Will do today.

@alex-stetsenko Can you elaborate on "doesn't look correct" please? (apart from the missing & heh, I mean, is the logic or concept wrong?)

@alex-stetsenko
Copy link
Contributor

@alex-stetsenko Can you elaborate on "doesn't look correct" please? (apart from the missing & heh, I mean, is the logic or concept wrong?

I don't think you can unconditionally destroy a leaf. As far as I can see, zap_cursor (fzap_cursor_retrieve()) can still reference it.

@robn
Copy link
Contributor

robn commented May 17, 2024

@alex-stetsenko mm, you might be right. That was partly why I said "maybe verify the refcount" there, mostly because I didn't want to think very hard for a drive-by.

In any case, the second seems like a safer thing to do in terms of the timeline - we know at that point nothing is using it, or we wouldn't be trying to create it, and we also know that the userdata being NULL or not can't race, because of the dance on db_mtx through holds and evicts.

So here's #16204; please take a look and lets see what falls out. @snajpa sorry about the delay!

behlendorf pushed a commit that referenced this issue May 25, 2024
If a shrink or truncate had recently freed a portion of the ZAP, the
dbuf could still be sitting on the dbuf cache waiting for eviction. If
it is then allocated for a new leaf before it can be evicted, the
zap_leaf_t is still attached as userdata, tripping the VERIFY.

Instead, just check for the userdata, and if we find it, reuse it.

Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16157.
Closes #16204
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants