-
Notifications
You must be signed in to change notification settings - Fork 761
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
At least since 3.0-dev7 peers/stick table related crashes #2552
Comments
Thanks @idl0r, I'll take a look. |
So at the first glance, it seems to not affect 2.9.7 with AWS-LC. For now at least. The system running it has no dumps so far while in the same time dev9 has 2 dumps but that's no guarantee yet I'd say. |
Still no crashes on the 2.9.7 one but way more on the dev9 one. So I'd say only 3.0-dev* is affected by this and it's likely not AWS-LC related? peers/stick-tables doesn't do SSL anyway? |
It seems the fix for #2525 (21447b1) was shipped with the 3.0-dev8 but not the 2.9.7. So it means 2.9.7 your are not affected by #2525 or it is just harder to trigger the bug. However it may be good to test the 2.9-HEAD to know if the fix is related to your issue or not. But at first glance, it is not obvious for me. @idl0r, you may test to set |
Another question. Are you using the peer sharding mechanism ? |
Nope |
It might take me about a week to do that. But it should have been fixed in dev9, shouldn't it? So we tried up to dev9. But I can test HEAD anyway, just takes some time (vacation :D).
That's already set since #2094. |
Sorry for the delay (vacation too).. The 3.0-dev9 contains the fix for #2525 but not the 2.9.7. Only 2.9-HEAD contains it. So if the issue is about 21447b1, this should crash too on 2.9-HEAD. If not, it means it is most probably related to my fixes about peers. In both case, I cannot see anything obvious. |
For the record: I have deployed dev11 on the same node that crashed repeatedly before and will now monitor this for a few days |
|
That's probably still the same I guess? |
It may be related somehow, but it's a different crash this time: First one (from your original report) was watchdog-triggered because too much time was spent in a function. Now for the original crash (watchdog triggered), maybe it's also |
Thanks @idl0r. This one is interesting. And it is probably the same issue indeed. It seems diff --git a/src/stick_table.c b/src/stick_table.c
index a1cc03c0ec..b0338abb7f 100644
--- a/src/stick_table.c
+++ b/src/stick_table.c
@@ -264,6 +264,7 @@ int stktable_trash_oldest(struct stktable *t, int to_batch)
int max_per_shard = (to_batch + CONFIG_HAP_TBL_BUCKETS - 1) / CONFIG_HAP_TBL_BUCKETS;
int done_per_shard;
int batched = 0;
+ int updt_locked;
int looped;
int shard;
@@ -271,6 +272,7 @@ int stktable_trash_oldest(struct stktable *t, int to_batch)
while (batched < to_batch) {
done_per_shard = 0;
+ updt_locked = 0;
looped = 0;
HA_RWLOCK_WRLOCK(STK_TABLE_LOCK, &t->shards[shard].sh_lock);
@@ -328,18 +330,35 @@ int stktable_trash_oldest(struct stktable *t, int to_batch)
continue;
}
- /* session expired, trash it */
- ebmb_delete(&ts->key);
+ /* if the entry is in the update list, we must be extremely careful
+ * because peers can see it at any moment and start to use it. Peers
+ * will take the table's updt_lock for reading when doing that, and
+ * with that lock held, will grab a ref_cnt before releasing the
+ * lock. So we must take this lock as well and check the ref_cnt.
+ */
if (ts->upd.node.leaf_p) {
- HA_RWLOCK_WRLOCK(STK_TABLE_LOCK, &t->updt_lock);
- eb32_delete(&ts->upd);
- HA_RWLOCK_WRUNLOCK(STK_TABLE_LOCK, &t->updt_lock);
+ if (!updt_locked) {
+ updt_locked = 1;
+ HA_RWLOCK_WRLOCK(STK_TABLE_LOCK, &t->updt_lock);
+ }
+ /* now we're locked, new peers can't grab it anymore,
+ * existing ones already have the ref_cnt.
+ */
+ if (HA_ATOMIC_LOAD(&ts->ref_cnt))
+ continue;
}
+
+ /* session expired, trash it */
+ ebmb_delete(&ts->key);
+ eb32_delete(&ts->upd);
__stksess_free(t, ts);
batched++;
done_per_shard++;
}
+ if (updt_locked)
+ HA_RWLOCK_WRUNLOCK(STK_TABLE_LOCK, &t->updt_lock);
+
HA_RWLOCK_WRUNLOCK(STK_TABLE_LOCK, &t->shards[shard].sh_lock);
if (max_search <= 0) @wtarreau any opinion on that ? I used the same locking mechanism than the one used in |
@capflam do you want me to give it a try? |
If it is possible on your side, it could be helpful indeed. |
dev11 with your patch.
|
So here, we are back again with the first crash. The table's update lock seems to be acquired too long in |
@idl0r, I don't know if you have an opportunity to test the following patches, on top of the 0001-BUG-MEDIUM-stick-tables-Fix-race-with-peers-when-tra.patch.txt |
BTW, if you apply the patches, please rebase on top of latest master (39caa20) as it fixes an issue which can cause some FD leaks or even some incorrectly set options on connections to the backend once the FD number reaches 4096! |
No problem! Can do that tomorrow and I might be able to give feedback in the evening already. |
Ok, here we go:
|
Strange, looks like it was killed while working. @capflam I didn't check the patches, but do you think there's a possibility that it was looping forever trying to find new entries maybe ? |
Hmmm had not checked but only the tables code was touched, nothing changed for peers, so maybe we just pushed the limit a bit further and now we're contemplating an exhausted thread that has slowly progressed. I really think that the peers should count the time spent waiting and quit earlier on contention. That's tricky to do, though. |
@idl0r you can limit the number of loops trying to send if you want. It defaults to 200 and is controlled by |
Set to 10. Trying again |
And I confirm that dev6 doesn't crash (but it's way slower). |
Mistake, 3.0-dev6 crashes as well but it takes more time. I'm always facing memory corruption detected in the pools. I'm worried that we could be detecting another bug that nullifies the bisect :-/ |
3.0-dev0 crashes similarly :-( |
2.9.7 crashes similarly (but takes much more time, 2 crashes/min) and 2.8.9 doesn't seem to, but it's also 5 times slower than 2.9.7. At least it gives an entry point. |
So with this commit which doubles the perf: I'm seeing about 1 segv every 2 minutes, and with the next one which gets 30-50% I get 10 cores in the first few seconds. I'm thus wondering how the performance affects these measurements (e.g. based on the table size, forced evictions etc). Because this completely invalidates some of my bisects. I really want to be sure about the first faulty commit, not the first one that reproduces the issue on that platform. |
Also it seems like below 48-47 active peers I can't reproduce it anymore. I'll push the limit further, that machine has enough RAM :-) |
Interesting. We had no segfault with dev6 and also none with 2.9.7 which is running rock stable on all other nodes until now. |
I was already thinking about building an injector as well as some peers receivers, which also could be used for VTest perhaps, to simulate some high load peers syncing. |
For the record I'm currently using that:
I start 50 instances of it like this: I'm having httpterm running on s1 & s2, and am sending the load from the machine itself using h1load this way:
I'll retry with varying parameters (more/less processes/threads etc) to make sure that with the very first commit it instantly dies. |
f37ddbe is the first bad commit |
Christian, we finally found it just now! The real culprit is this one: cfeca3a ("MEDIUM: stick-table: touch updates under an upgradable read lock") It ran 5mn at 50 peers with zero problem versus crashes in the first 2-5 second till now. The inter-dependencies and assumptions of the various functions are all but obvious, but at some point we started to seriously suspect a double insertion in unprotected code. The protection was insufficient (TOCTOU) and it could indeed happen that the same key learned from two distinct peers on different threads at the same time causes this double insert. The only way to trigger this is indeed to have many peers... We can also confirm that the two patches marked BUG above are needed. I'm going to write the fix, merge it as well as the two other fixes and I'll let you know once this is done. Please note that we also noticed twice (in hundreds of tests) a crash on appctx_wakeup() in the peers code with a null appctx, it seems to have happened when several peers were lost at the exact same time (we lost them by packs of 10-20 due to the crashes). But this is independent and might have been there for a while. We'll try to reproduce and fix that one as well but I'm much less worried anyway. Stay tuned! |
…ession When a sticky session is killed, we must be sure no other entity is still referencing it. The session's ref_cnt must be 0. However, there is a race with peers, as decribed in 21447b1 ("BUG/MAJOR: stick-tables: fix race with peers in entry expiration"). When the update lock is acquire, we must recheck the ref_cnt value. This patch is part of a debugging session about issue #2552. It must be backported to 2.9.
…ntries In GH issue #2552, Christian Ruppert reported an increase in crashes with recent 3.0-dev versions, always related with stick-tables and peers. One particularity of his config is that it has a lot of peers. While trying to reproduce, it empirically was found that firing 10 load generators at 10 different haproxy instances tracking a random key among 100k against a table of max 5k entries, on 8 threads and between a total of 50 parallel peers managed to reproduce the crashes in seconds, very often in ebtree deletion or insertion code, but not only. The debugging revealed that the crashes are often caused by a parent node being corrupted while delete/insert tries to update it regarding a recently inserted/removed node, and that that corrupted node had always been proven to be deleted, then immediately freed, so it ought not be visited in the tree from functions enclosed between a pair of lock/unlock. As such the only possibility was that it had experienced unexpected inserts. Also, running with pool integrity checking would 90% of the time cause crashes during allocation based on corrupted contents in the node, likely because it was found at two places in the same tree and still present as a parent of a node being deleted or inserted (hence the __stksess_free and stktable_trash_oldest callers being visible on these items). Indeed the issue is in fact related to the test set (occasionally redundant keys, many peers). What happens is that sometimes, a same key is learned from two different peers. When it is learned for the first time, we end up in stktable_touch_with_exp() in the "else" branch, where the test for existence is made before taking the lock (since commit cfeca3a ("MEDIUM: stick-table: touch updates under an upgradable read lock") that was merged in 2.9), and from there the entry is added. But is one of the threads manages to insert it before the other thread takes the lock, then the second thread will try to insert this node again. And inserting an already inserted node will corrupt the tree (note that we never switched to enforcing a check in insertion code on this due to API history that would break various code parts). Here the solution is simple, it requires to recheck leaf_p after getting the lock, to avoid touching anything if the entry has already been inserted in the mean time. Many thanks to Christian Ruppert for testing this and for his invaluable help on this hard-to-trigger issue. This fix needs to be backported to 2.9.
OK Christian, I've pushed this branch with all needed fixes: 20240523-fix-table-locking-3 |
Very welcome! |
Thanks. Note that the branch already contains all latest fixes, so it should be as simple as just checking out that branch (if it's easier for you to pick the patches, a git log or git format-patch from master will list them all and they'll apply cleanly). It's running fine now, no crashes anymore here on the 80-core system with 50 processes all talking to all others. |
20240523-fix-table-locking-3 built and deployed on that node. Lets see how it behaves :) |
…ession When a sticky session is killed, we must be sure no other entity is still referencing it. The session's ref_cnt must be 0. However, there is a race with peers, as decribed in 21447b1 ("BUG/MAJOR: stick-tables: fix race with peers in entry expiration"). When the update lock is acquire, we must recheck the ref_cnt value. This patch is part of a debugging session about issue #2552. It must be backported to 2.9.
…ntries In GH issue #2552, Christian Ruppert reported an increase in crashes with recent 3.0-dev versions, always related with stick-tables and peers. One particularity of his config is that it has a lot of peers. While trying to reproduce, it empirically was found that firing 10 load generators at 10 different haproxy instances tracking a random key among 100k against a table of max 5k entries, on 8 threads and between a total of 50 parallel peers managed to reproduce the crashes in seconds, very often in ebtree deletion or insertion code, but not only. The debugging revealed that the crashes are often caused by a parent node being corrupted while delete/insert tries to update it regarding a recently inserted/removed node, and that that corrupted node had always been proven to be deleted, then immediately freed, so it ought not be visited in the tree from functions enclosed between a pair of lock/unlock. As such the only possibility was that it had experienced unexpected inserts. Also, running with pool integrity checking would 90% of the time cause crashes during allocation based on corrupted contents in the node, likely because it was found at two places in the same tree and still present as a parent of a node being deleted or inserted (hence the __stksess_free and stktable_trash_oldest callers being visible on these items). Indeed the issue is in fact related to the test set (occasionally redundant keys, many peers). What happens is that sometimes, a same key is learned from two different peers. When it is learned for the first time, we end up in stktable_touch_with_exp() in the "else" branch, where the test for existence is made before taking the lock (since commit cfeca3a ("MEDIUM: stick-table: touch updates under an upgradable read lock") that was merged in 2.9), and from there the entry is added. But is one of the threads manages to insert it before the other thread takes the lock, then the second thread will try to insert this node again. And inserting an already inserted node will corrupt the tree (note that we never switched to enforcing a check in insertion code on this due to API history that would break various code parts). Here the solution is simple, it requires to recheck leaf_p after getting the lock, to avoid touching anything if the entry has already been inserted in the mean time. Many thanks to Christian Ruppert for testing this and for his invaluable help on this hard-to-trigger issue. This fix needs to be backported to 2.9.
FWIW I've just merged the series now. I'm assuming it's still OK on your side. |
Yup. Still looking good :) |
Awesome, thank you! |
Still running fine. No crashes. Running for ~7 hours already :) |
Amazing ! Thanks for help @idl0r. |
FYI: Still looking good! No crashes. So I'd say that was it and it's fixed. Thanks, guys! :) |
That's pretty cool, thanks Christian! It's one of these rare bugs thare are uncovered by performance improvements. We'll keep this open till we backport the fix to 2.9. |
Detailed Description of the Problem
Hi,
this time it seems to be / might be more complex.
So we replaced OpenSSL by AWS-LC and I built 3.0-dev7, 8 and 9.
While -dev6 was fine (with OpenSSL tho) and we didn't try -dev6 with AWS-LC, I can't tell for sure that it's not AWS-LC related.
However, it seems to be peers / stick-tables related.
Expected Behavior
No crash
Steps to Reproduce the Behavior
Many peers?
Do you have any idea what may have caused this?
No response
Do you have an idea how to solve the issue?
No response
What is your configuration?
Output of
haproxy -vv
Last Outputs and Backtraces
backtrace:
https://gist.github.com/idl0r/934e0e5af7241f1f4c151c427d6047a6
Additional Information
No response
The text was updated successfully, but these errors were encountered: