Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

At least since 3.0-dev7 peers/stick table related crashes #2552

Open
idl0r opened this issue Apr 29, 2024 · 76 comments
Open

At least since 3.0-dev7 peers/stick table related crashes #2552

idl0r opened this issue Apr 29, 2024 · 76 comments
Labels
2.9 This issue affects the HAProxy 2.9 stable branch. status: fixed This issue is a now-fixed bug. type: bug This issue describes a bug.

Comments

@idl0r
Copy link
Contributor

idl0r commented Apr 29, 2024

Detailed Description of the Problem

Hi,

this time it seems to be / might be more complex.
So we replaced OpenSSL by AWS-LC and I built 3.0-dev7, 8 and 9.
While -dev6 was fine (with OpenSSL tho) and we didn't try -dev6 with AWS-LC, I can't tell for sure that it's not AWS-LC related.
However, it seems to be peers / stick-tables related.

Expected Behavior

No crash

Steps to Reproduce the Behavior

Many peers?

Do you have any idea what may have caused this?

No response

Do you have an idea how to solve the issue?

No response

What is your configuration?

I can share specific parts on request

Output of haproxy -vv

HAProxy version 3.0-dev9-ba0f8b5 2024/04/27 - https://haproxy.org/
Status: development branch - not safe for use in production.
Known bugs: https://github.com/haproxy/haproxy/issues?q=is:issue+is:open
Running on: Linux 5.10.0-28-amd64 #1 SMP Debian 5.10.209-2 (2024-01-31) x86_64
Build options :
  TARGET  = linux-glibc
  CC      = cc
  CFLAGS  = -O2 -g -fwrapv
  OPTIONS = USE_LIBCRYPT=1 USE_OPENSSL_AWSLC=1 USE_LUA=1 USE_ZLIB= USE_SLZ=1 USE_NS= USE_SYSTEMD=1 USE_PROMEX=1 USE_PCRE= USE_PCRE_JIT= USE_PCRE2=1 USE_PCRE2_JIT=
  DEBUG   = 

Feature list : -51DEGREES +ACCEPT4 +BACKTRACE -CLOSEFROM +CPU_AFFINITY +CRYPT_H -DEVICEATLAS +DL -ENGINE +EPOLL -EVPORTS +GETADDRINFO -KQUEUE -LIBATOMIC +LIBCRYPT +LINUX_CAP +LINUX_SPLICE +LINUX_TPROXY +LUA +MATH -MEMORY_PROFILING +NETFILTER -NS -OBSOLETE_LINKER +OPENSSL +OPENSSL_AWSLC -OPENSSL_WOLFSSL -OT -PCRE +PCRE2 -PCRE2_JIT -PCRE_JIT +POLL +PRCTL -PROCCTL +PROMEX -PTHREAD_EMULATION -QUIC -QUIC_OPENSSL_COMPAT +RT +SHM_OPEN +SLZ +SSL -STATIC_PCRE -STATIC_PCRE2 +SYSTEMD +TFO +THREAD +THREAD_DUMP +TPROXY -WURFL -ZLIB

Default settings :
  bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with multi-threading support (MAX_TGROUPS=16, MAX_THREADS=256, default=4).
Built with OpenSSL version : OpenSSL 1.1.1 (compatible; AWS-LC 1.25.0)
Running on OpenSSL version : AWS-LC 1.25.0
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2 TLSv1.3
Built with Lua version : Lua 5.3.3
Built with the Prometheus exporter as a service
Built with libslz for stateless compression.
Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Built with PCRE2 version : 10.36 2020-12-04
PCRE2 library supports JIT : no (USE_PCRE2_JIT not set)
Encrypted password support via crypt(3): yes
Built with gcc compiler version 10.2.1 20210110

Available polling systems :
      epoll : pref=300,  test result OK
       poll : pref=200,  test result OK
     select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.

Available multiplexer protocols :
(protocols marked as <default> cannot be specified using 'proto' keyword)
         h2 : mode=HTTP  side=FE|BE  mux=H2    flags=HTX|HOL_RISK|NO_UPG
       fcgi : mode=HTTP  side=BE     mux=FCGI  flags=HTX|HOL_RISK|NO_UPG
  <default> : mode=HTTP  side=FE|BE  mux=H1    flags=HTX
         h1 : mode=HTTP  side=FE|BE  mux=H1    flags=HTX|NO_UPG
  <default> : mode=TCP   side=FE|BE  mux=PASS  flags=
       none : mode=TCP   side=FE|BE  mux=PASS  flags=NO_UPG

Available services : prometheus-exporter
Available filters :
	[BWLIM] bwlim-in
	[BWLIM] bwlim-out
	[CACHE] cache
	[COMP] compression
	[FCGI] fcgi-app
	[SPOE] spoe
	[TRACE] trace

Last Outputs and Backtraces

Apr 29 08:44:33.067 n095157 haproxy[2368838]: Thread 3 is about to kill the process.
Apr 29 08:44:33.067 n095157 haproxy[2368838]:   Thread 1 : id=0x7f1c7f456100 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
Apr 29 08:44:33.067 n095157 haproxy[2368838]:       1/1    stuck=0 prof=0 harmless=1 isolated=0
Apr 29 08:44:33.067 n095157 haproxy[2368838]:              cpu_ns: poll=11216720929 now=11216725769 diff=4840
Apr 29 08:44:33.067 n095157 haproxy[2368838]:              curr_task=0
Apr 29 08:44:33.067 n095157 haproxy[2368838]:   Thread 2 : id=0x7f1c7f44a700 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
Apr 29 08:44:33.067 n095157 haproxy[2368838]:       1/2    stuck=0 prof=0 harmless=1 isolated=0
Apr 29 08:44:33.067 n095157 haproxy[2368838]:              cpu_ns: poll=9116973064 now=9116977435 diff=4371
Apr 29 08:44:33.067 n095157 haproxy[2368838]:              curr_task=0
Apr 29 08:44:33.068 n095157 haproxy[2368838]: *>Thread 3 : id=0x7f1c7e19a700 act=1 glob=1 wq=1 rq=0 tl=0 tlsz=0 rqsz=1
Apr 29 08:44:33.068 n095157 haproxy[2368838]:       1/3    stuck=1 prof=0 harmless=0 isolated=0
Apr 29 08:44:33.068 n095157 haproxy[2368838]:              cpu_ns: poll=12582501326 now=15069193316 diff=2486691990
Apr 29 08:44:33.068 n095157 haproxy[2368838]:              curr_task=0x7f1ba4d741a0 (task) calls=7 last=0
Apr 29 08:44:33.068 n095157 haproxy[2368838]:                fct=0x55bbcf5ede90(task_run_applet) ctx=0x55bbf031bf30(<PEER>)
Apr 29 08:44:33.068 n095157 haproxy[2368838]:              stream=0x7f1bb0059960: [29/Apr/2024:08:44:30.262153] id=0 proto=?
Apr 29 08:44:33.068 n095157 haproxy[2368838]:                flags=0x42, conn_retries=0, conn_exp=<NEVER> conn_et=0x000 srv_conn=0x55bbd0fafe90, pend_pos=(nil) waiting=0 epoch=0
Apr 29 08:44:33.068 n095157 haproxy[2368838]:                frontend=n095157 (id=4294967295 mode=peers), listener=? (id=0)
Apr 29 08:44:33.068 n095157 haproxy[2368838]:                backend=n095157 (id=4294967295 mode=peers) addr=192.168.95.157:10124
Apr 29 08:44:33.068 n095157 haproxy[2368838]:                server=n095157 (id=0) addr=192.168.95.157:10000
Apr 29 08:44:33.068 n095157 haproxy[2368838]:                task=0x7f1bb0059dc0 (state=0x00 nice=0 calls=2 rate=0 exp=0s tid=2(1/2) age=19884d2h)
Apr 29 08:44:33.068 n095157 haproxy[2368838]:                scf=0x7f1bb0059900 flags=0x00000040 ioto=5s state=EST endp=APPCTX,0x55bbf031bf30,0x50000002 sub=0 rex=5s wex=<NEVER> rto=5s wto=<NEVER>
Apr 29 08:44:33.068 n095157 haproxy[2368838]:                  iobuf.flags=0x00000000 .pipe=0 .buf=0@(nil)+0/0
Apr 29 08:44:33.068 n095157 haproxy[2368838]:                    app0=0x55bbf031bf30 st0=7 st1=0 applet=<PEER> tid=2 nice=0 calls=7 rate=0
Apr 29 08:44:33.068 n095157 haproxy[2368838]:                scb=0x7f1bb0059e50 flags=0x00000449 ioto=5s state=EST endp=CONN,0x7f1bb009e1c0,0x50640001 sub=1 rex=5s wex=<NEVER> rto=5s wto=<NEVER>
Apr 29 08:44:33.068 n095157 haproxy[2368838]:                  iobuf.flags=0x00000000 .pipe=0 .buf=0@(nil)+0/0
Apr 29 08:44:33.068 n095157 haproxy[2368838]:                    co1=0x7f1c780a91b0 ctrl=tcpv4 xprt=RAW mux=PASS data=STRM target=SERVER:0x55bbd0fafe90
Apr 29 08:44:33.068 n095157 haproxy[2368838]:                    flags=0x00000300 fd=9 fd.state=10121 updt=1 fd.tmask=0x4
Apr 29 08:44:33.068 n095157 haproxy[2368838]:                req=0x7f1bb0059988 (f=0x848101 an=0x0 tofwd=-1 total=52070)
Apr 29 08:44:33.068 n095157 haproxy[2368838]:                    an_exp=<NEVER> buf=0x7f1bb0059990 data=0x7f1bb004eea0 o=6947 p=6947 i=0 size=20480
Apr 29 08:44:33.068 n095157 haproxy[2368838]:                res=0x7f1bb00599d0 (f=0x80048101 an=0x0 tofwd=-1 total=4)
Apr 29 08:44:33.068 n095157 haproxy[2368838]:                    an_exp=<NEVER> buf=0x7f1bb00599d8 data=(nil) o=0 p=0 i=0 size=0
Apr 29 08:44:33.068 n095157 haproxy[2368838]:              call trace(17):
Apr 29 08:44:33.068 n095157 haproxy[2368838]:              | 0x55bbcf592231 [eb ba 66 66 2e 0f 1f 84]: ha_thread_dump+0x91/0x93
Apr 29 08:44:33.069 n095157 haproxy[2368838]:              | 0x55bbcf592351 [64 48 8b 53 10 64 48 8b]: ha_panic+0x111/0x3f4
Apr 29 08:44:33.069 n095157 haproxy[2368838]:              | 0x7f1c7f785140 [48 c7 c0 0f 00 00 00 0f]: libpthread:+0x13140
Apr 29 08:44:33.069 n095157 haproxy[2368838]:              | 0x55bbcf60bf14 [a8 01 74 50 0f b7 48 1f]: eb32_lookup_ge+0x34/0xd9
Apr 29 08:44:33.069 n095157 haproxy[2368838]:              | 0x55bbcf51bd4c [48 85 c0 74 27 48 89 c2]: main+0x11d54c
Apr 29 08:44:33.069 n095157 haproxy[2368838]:              | 0x55bbcf51d7fa [49 89 c5 48 85 c0 0f 84]: main+0x11effa
Apr 29 08:44:33.069 n095157 haproxy[2368838]:              | 0x55bbcf51eaf5 [41 89 c4 85 c0 0f 8e 9a]: main+0x1202f5
Apr 29 08:44:33.069 n095157 haproxy[2368838]:              | 0x55bbcf521a9e [85 c0 0f 8f c7 05 00 00]: main+0x12329e
Apr 29 08:44:33.069 n095157 haproxy[2368838]:              | 0x55bbcf5edfa0 [44 8b 05 81 71 3d 00 45]: task_run_applet+0x110/0xb93
Apr 29 08:44:33.069 n095157 haproxy[2368838]:   Thread 4 : id=0x7f1c7d999700 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
Apr 29 08:44:33.069 n095157 haproxy[2368838]:       1/4    stuck=0 prof=0 harmless=1 isolated=0
Apr 29 08:44:33.069 n095157 haproxy[2368838]:              cpu_ns: poll=14478523193 now=14478530274 diff=7081
Apr 29 08:44:33.069 n095157 haproxy[2368838]:              curr_task=0

backtrace:
https://gist.github.com/idl0r/934e0e5af7241f1f4c151c427d6047a6

Additional Information

No response

@idl0r idl0r added status: needs-triage This issue needs to be triaged. type: bug This issue describes a bug. labels Apr 29, 2024
@capflam
Copy link
Member

capflam commented Apr 29, 2024

Thanks @idl0r, I'll take a look.

@idl0r
Copy link
Contributor Author

idl0r commented Apr 29, 2024

So at the first glance, it seems to not affect 2.9.7 with AWS-LC. For now at least. The system running it has no dumps so far while in the same time dev9 has 2 dumps but that's no guarantee yet I'd say.
Also dev7 was known broken for us due to #2525 so only dev8 and dev9 are broken like that for sure.

@idl0r
Copy link
Contributor Author

idl0r commented Apr 30, 2024

Still no crashes on the 2.9.7 one but way more on the dev9 one. So I'd say only 3.0-dev* is affected by this and it's likely not AWS-LC related? peers/stick-tables doesn't do SSL anyway?

@capflam
Copy link
Member

capflam commented Apr 30, 2024

It seems the fix for #2525 (21447b1) was shipped with the 3.0-dev8 but not the 2.9.7. So it means 2.9.7 your are not affected by #2525 or it is just harder to trigger the bug. However it may be good to test the 2.9-HEAD to know if the fix is related to your issue or not. But at first glance, it is not obvious for me.

@idl0r, you may test to set no-memory-trimming global option to see if that change something. Do you know if your machine is especially loaded when the crash occurs ?

@capflam
Copy link
Member

capflam commented Apr 30, 2024

Another question. Are you using the peer sharding mechanism ?

@capflam capflam added status: reviewed This issue was reviewed. A fix is required. and removed status: needs-triage This issue needs to be triaged. labels Apr 30, 2024
@idl0r
Copy link
Contributor Author

idl0r commented Apr 30, 2024

Another question. Are you using the peer sharding mechanism ?

Nope

@idl0r
Copy link
Contributor Author

idl0r commented Apr 30, 2024

It seems the fix for #2525 (21447b1) was shipped with the 3.0-dev8 but not the 2.9.7. So it means 2.9.7 your are not affected by #2525 or it is just harder to trigger the bug. However it may be good to test the 2.9-HEAD to know if the fix is related to your issue or not. But at first glance, it is not obvious for me.

It might take me about a week to do that. But it should have been fixed in dev9, shouldn't it? So we tried up to dev9. But I can test HEAD anyway, just takes some time (vacation :D).

@idl0r, you may test to set no-memory-trimming global option to see if that change something. Do you know if your machine is especially loaded when the crash occurs ?

That's already set since #2094.
No, that host is basically idle all the time, more or less.

@capflam
Copy link
Member

capflam commented May 6, 2024

Sorry for the delay (vacation too).. The 3.0-dev9 contains the fix for #2525 but not the 2.9.7. Only 2.9-HEAD contains it. So if the issue is about 21447b1, this should crash too on 2.9-HEAD. If not, it means it is most probably related to my fixes about peers. In both case, I cannot see anything obvious.

@idl0r
Copy link
Contributor Author

idl0r commented May 14, 2024

For the record: I have deployed dev11 on the same node that crashed repeatedly before and will now monitor this for a few days

@idl0r
Copy link
Contributor Author

idl0r commented May 14, 2024

# gdb -batch -n -ex 'set pagination off' -ex 'set print elements 0' -ex 'thread apply all -ascending bt full' -core /tmp/core.994.1782809.1715699458 /usr/sbin/haproxy

warning: Can't open file /dev/zero (deleted) during file-backed mapping note processing

warning: Can't open file /dev/shm/haproxy_startup_logs_1291369 (deleted) during file-backed mapping note processing
[New LWP 1782809]
[New LWP 1782811]
[New LWP 1782813]
[New LWP 1782812]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/sbin/haproxy -sf 1751910 1694111 -x sockpair@4 -Ws -f /etc/haproxy/haproxy'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x000055649225c720 in __eb_delete (node=node@entry=0x7fb64847af48) at include/import/ebtree.h:709
709	include/import/ebtree.h: No such file or directory.
[Current thread is 1 (Thread 0x7fb71eb1a240 (LWP 1782809))]

Thread 1 (Thread 0x7fb71eb1a240 (LWP 1782809)):
#0  0x000055649225c720 in __eb_delete (node=node@entry=0x7fb64847af48) at include/import/ebtree.h:709
        pside = 1
        gpside = 0
        parent = 0x7fb6483ca7e8
        gparent = 0x0
        sibtype = <optimized out>
        delete_unlink = <optimized out>
        pside = <optimized out>
        gpside = <optimized out>
        sibtype = <optimized out>
        parent = <optimized out>
        gparent = <optimized out>
#1  eb_delete (node=node@entry=0x7fb64847af48) at src/ebtree.c:25
No locals.
#2  0x0000556492129668 in eb32_delete (eb32=0x7fb64847af48) at include/import/eb32tree.h:89
No locals.
#3  stktable_trash_oldest (t=0x5564941bd360, to_batch=to_batch@entry=16996) at src/stick_table.c:335
        ts = 0x7fb64847af08
        eb = 0x7fb6480dbf80
        max_search = 3090
        max_per_shard = 532
        done_per_shard = 198
        batched = 16363
        looped = 1
        shard = 31
#4  0x0000556492195124 in manage_proxy (t=0x5564b30aa150, context=0x5564941bb700, state=<optimized out>) at src/proxy.c:1957
        budget = 16996
        cleaned_up = <optimized out>
        p = 0x5564941bb700
        next = 0
        wait = <optimized out>
        __func__ = <optimized out>
        out = <optimized out>
#5  0x00005564921e98cd in run_tasks_from_lists (budgets=<optimized out>) at src/task.c:634
        process = <optimized out>
        tl_queues = 0x5564928550d0 <ha_thread_ctx+208>
        t = 0x5564b30aa150
        budget_mask = 15 '\017'
        profile_entry = 0x0
        done = 59
        queue = 1
        state = <optimized out>
        ctx = <optimized out>
        __func__ = <optimized out>
#6  0x00005564921ea243 in process_runnable_tasks () at src/task.c:876
        tt = 0x556492855000 <ha_thread_ctx>
        lrq = <optimized out>
        grq = <optimized out>
        t = <optimized out>
        max = {0, 80, 0, 0}
        max_total = <optimized out>
        tmp_list = <optimized out>
        queue = <optimized out>
        max_processed = <optimized out>
        lpicked = <optimized out>
        gpicked = <optimized out>
        heavy_queued = 1
        budget = <optimized out>
        not_done_yet = <optimized out>
#7  0x00005564921b2d49 in run_poll_loop () at src/haproxy.c:3073
        next = <optimized out>
        wake = <optimized out>
        __func__ = <optimized out>
#8  0x00005564921b33b1 in run_thread_poll_loop (data=<optimized out>) at src/haproxy.c:3287
        ptaf = <optimized out>
        ptif = 0x556492612540 <per_thread_init_list>
        ptdf = <optimized out>
        ptff = <optimized out>
        init_left = 0
        init_mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}
        init_cond = {__data = {__wseq = {__value64 = 13, __value32 = {__low = 13, __high = 0}}, __g1_start = {__value64 = 7, __value32 = {__low = 7, __high = 0}}, __g_refs = {0, 0}, __g_size = {0, 0}, __g1_orig_size = 12, __wrefs = 0, __g_signals = {0, 0}}, __size = "\r\000\000\000\000\000\000\000\a", '\000' <repeats 23 times>, "\f", '\000' <repeats 14 times>, __align = 13}
#9  0x0000556492050a1a in main (argc=<optimized out>, argv=<optimized out>) at src/haproxy.c:3989
        err = <optimized out>
        retry = <optimized out>
        limit = {rlim_cur = 18446744073709551615, rlim_max = 18446744073709551615}
        pidfd = <optimized out>
        intovf = <optimized out>
        msg = <optimized out>

Thread 2 (Thread 0x7fb71eb0e6c0 (LWP 1782811)):
#0  0x00007fb71ed05e26 in epoll_wait (epfd=11, events=0x7fb718062cf0, maxevents=200, timeout=timeout@entry=20) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
        sc_ret = -4
        sc_cancel_oldtype = 0
        sc_ret = <optimized out>
#1  0x0000556492053bd0 in _do_poll (p=<optimized out>, exp=<optimized out>, wake=<optimized out>) at src/ev_epoll.c:232
        timeout = 20
        status = <optimized out>
        fd = <optimized out>
        count = <optimized out>
        updt_idx = <optimized out>
        wait_time = 20
        old_fd = <optimized out>
#2  0x00005564921b2d08 in run_poll_loop () at src/haproxy.c:3145
        next = <optimized out>
        wake = <optimized out>
        __func__ = <optimized out>
#3  0x00005564921b33b1 in run_thread_poll_loop (data=<optimized out>) at src/haproxy.c:3287
        ptaf = <optimized out>
        ptif = 0x556492612540 <per_thread_init_list>
        ptdf = <optimized out>
        ptff = <optimized out>
        init_left = 0
        init_mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}
        init_cond = {__data = {__wseq = {__value64 = 13, __value32 = {__low = 13, __high = 0}}, __g1_start = {__value64 = 7, __value32 = {__low = 7, __high = 0}}, __g_refs = {0, 0}, __g_size = {0, 0}, __g1_orig_size = 12, __wrefs = 0, __g_signals = {0, 0}}, __size = "\r\000\000\000\000\000\000\000\a", '\000' <repeats 23 times>, "\f", '\000' <repeats 14 times>, __align = 13}
#4  0x00007fb71ec86134 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
        ret = <optimized out>
        pd = <optimized out>
        out = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140424470652608, 1594414262551347688, -43952, 22, 140726106604048, 140424462262272, -1635296826134330904, -1635296206209777176}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
#5  0x00007fb71ed067dc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
No locals.

Thread 3 (Thread 0x7fb71d06b6c0 (LWP 1782813)):
#0  0x00007fb71ed05e26 in epoll_wait (epfd=1812, events=0x7fb63c061c70, maxevents=200, timeout=timeout@entry=20) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
        sc_ret = -4
        sc_cancel_oldtype = 0
        sc_ret = <optimized out>
#1  0x0000556492053bd0 in _do_poll (p=<optimized out>, exp=<optimized out>, wake=<optimized out>) at src/ev_epoll.c:232
        timeout = 20
        status = <optimized out>
        fd = <optimized out>
        count = <optimized out>
        updt_idx = <optimized out>
        wait_time = 20
        old_fd = <optimized out>
#2  0x00005564921b2d08 in run_poll_loop () at src/haproxy.c:3145
        next = <optimized out>
        wake = <optimized out>
        __func__ = <optimized out>
#3  0x00005564921b33b1 in run_thread_poll_loop (data=<optimized out>) at src/haproxy.c:3287
        ptaf = <optimized out>
        ptif = 0x556492612540 <per_thread_init_list>
        ptdf = <optimized out>
        ptff = <optimized out>
        init_left = 0
        init_mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}
        init_cond = {__data = {__wseq = {__value64 = 13, __value32 = {__low = 13, __high = 0}}, __g1_start = {__value64 = 7, __value32 = {__low = 7, __high = 0}}, __g_refs = {0, 0}, __g_size = {0, 0}, __g1_orig_size = 12, __wrefs = 0, __g_signals = {0, 0}}, __size = "\r\000\000\000\000\000\000\000\a", '\000' <repeats 23 times>, "\f", '\000' <repeats 14 times>, __align = 13}
#4  0x00007fb71ec86134 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
        ret = <optimized out>
        pd = <optimized out>
        out = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140424442721984, 1594414262551347688, -43952, 22, 140726106604048, 140424434331648, -1635293340231499288, -1635296206209777176}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
#5  0x00007fb71ed067dc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
No locals.

Thread 4 (Thread 0x7fb71d86c6c0 (LWP 1782812)):
#0  0x00007fb71ed05e26 in epoll_wait (epfd=1809, events=0x7fb648061c70, maxevents=200, timeout=timeout@entry=19) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
        sc_ret = -4
        sc_cancel_oldtype = 0
        sc_ret = <optimized out>
#1  0x0000556492053bd0 in _do_poll (p=<optimized out>, exp=<optimized out>, wake=<optimized out>) at src/ev_epoll.c:232
        timeout = 19
        status = <optimized out>
        fd = <optimized out>
        count = <optimized out>
        updt_idx = <optimized out>
        wait_time = 19
        old_fd = <optimized out>
#2  0x00005564921b2d08 in run_poll_loop () at src/haproxy.c:3145
        next = <optimized out>
        wake = <optimized out>
        __func__ = <optimized out>
#3  0x00005564921b33b1 in run_thread_poll_loop (data=<optimized out>) at src/haproxy.c:3287
        ptaf = <optimized out>
        ptif = 0x556492612540 <per_thread_init_list>
        ptdf = <optimized out>
        ptff = <optimized out>
        init_left = 0
        init_mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}
        init_cond = {__data = {__wseq = {__value64 = 13, __value32 = {__low = 13, __high = 0}}, __g1_start = {__value64 = 7, __value32 = {__low = 7, __high = 0}}, __g_refs = {0, 0}, __g_size = {0, 0}, __g1_orig_size = 12, __wrefs = 0, __g_signals = {0, 0}}, __size = "\r\000\000\000\000\000\000\000\a", '\000' <repeats 23 times>, "\f", '\000' <repeats 14 times>, __align = 13}
#4  0x00007fb71ec86134 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
        ret = <optimized out>
        pd = <optimized out>
        out = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140424451114688, 1594414262551347688, -43952, 22, 140726106604048, 140424442724352, -1635294439206256152, -1635296206209777176}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
#5  0x00007fb71ed067dc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
No locals.

@idl0r
Copy link
Contributor Author

idl0r commented May 14, 2024

That's probably still the same I guess?

@Darlelet
Copy link
Contributor

Darlelet commented May 14, 2024

It may be related somehow, but it's a different crash this time:

First one (from your original report) was watchdog-triggered because too much time was spent in a function.
However this last one may suggest that there is a race somewhere because the process crashed with SEGFAULT while under manage_proxy()->stktable_trash_oldest() which runs periodically alongside the traffic during stopping to try to prune expired entries.
(ie: the process was already busy doing heavy peer work on a thread when the automatic cleanup kicked in on another thread, and we're missing a lock somewhere, and we end up reading garbage data due to unprotected access?)

Now for the original crash (watchdog triggered), maybe it's also manage_proxy()->stktable_trash_oldest() (which may prune up to 32K entries per batch at most) which causes contention due to heavy peers work running in parallel, as both threads may be competing for the same peers locks.. ?

@capflam
Copy link
Member

capflam commented May 14, 2024

Thanks @idl0r. This one is interesting. And it is probably the same issue indeed. It seems stktable_trash_oldest() has the same bug that the one fixed by 21447b1. I guess the following patch should fix the issue:

diff --git a/src/stick_table.c b/src/stick_table.c
index a1cc03c0ec..b0338abb7f 100644
--- a/src/stick_table.c
+++ b/src/stick_table.c
@@ -264,6 +264,7 @@ int stktable_trash_oldest(struct stktable *t, int to_batch)
        int max_per_shard = (to_batch + CONFIG_HAP_TBL_BUCKETS - 1) / CONFIG_HAP_TBL_BUCKETS;
        int done_per_shard;
        int batched = 0;
+       int updt_locked;
        int looped;
        int shard;
 
@@ -271,6 +272,7 @@ int stktable_trash_oldest(struct stktable *t, int to_batch)
 
        while (batched < to_batch) {
                done_per_shard = 0;
+               updt_locked = 0;
                looped = 0;
 
                HA_RWLOCK_WRLOCK(STK_TABLE_LOCK, &t->shards[shard].sh_lock);
@@ -328,18 +330,35 @@ int stktable_trash_oldest(struct stktable *t, int to_batch)
                                continue;
                        }
 
-                       /* session expired, trash it */
-                       ebmb_delete(&ts->key);
+                       /* if the entry is in the update list, we must be extremely careful
+                        * because peers can see it at any moment and start to use it. Peers
+                        * will take the table's updt_lock for reading when doing that, and
+                        * with that lock held, will grab a ref_cnt before releasing the
+                        * lock. So we must take this lock as well and check the ref_cnt.
+                        */
                        if (ts->upd.node.leaf_p) {
-                               HA_RWLOCK_WRLOCK(STK_TABLE_LOCK, &t->updt_lock);
-                               eb32_delete(&ts->upd);
-                               HA_RWLOCK_WRUNLOCK(STK_TABLE_LOCK, &t->updt_lock);
+                               if (!updt_locked) {
+                                       updt_locked = 1;
+                                       HA_RWLOCK_WRLOCK(STK_TABLE_LOCK, &t->updt_lock);
+                               }
+                               /* now we're locked, new peers can't grab it anymore,
+                                * existing ones already have the ref_cnt.
+                                */
+                               if (HA_ATOMIC_LOAD(&ts->ref_cnt))
+                                       continue;
                        }
+
+                       /* session expired, trash it */
+                       ebmb_delete(&ts->key);
+                       eb32_delete(&ts->upd);
                        __stksess_free(t, ts);
                        batched++;
                        done_per_shard++;
                }
 
+               if (updt_locked)
+                       HA_RWLOCK_WRUNLOCK(STK_TABLE_LOCK, &t->updt_lock);
+
                HA_RWLOCK_WRUNLOCK(STK_TABLE_LOCK, &t->shards[shard].sh_lock);
 
                if (max_search <= 0)

@wtarreau any opinion on that ? I used the same locking mechanism than the one used in process_table_expire().

@idl0r
Copy link
Contributor Author

idl0r commented May 15, 2024

@capflam do you want me to give it a try?

@capflam
Copy link
Member

capflam commented May 15, 2024

If it is possible on your side, it could be helpful indeed.

@idl0r
Copy link
Contributor Author

idl0r commented May 15, 2024

dev11 with your patch.

Core was generated by `/usr/sbin/haproxy -sf 2516592 -x sockpair@4 -Ws -f /etc/haproxy/haproxy.cfg -S'.
Program terminated with signal SIGABRT, Aborted.
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
44	./nptl/pthread_kill.c: No such file or directory.
[Current thread is 1 (Thread 0x7f2c13a446c0 (LWP 2520030))]

Thread 1 (Thread 0x7f2c13a446c0 (LWP 2520030)):
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
        tid = <optimized out>
        ret = 0
        pd = <optimized out>
        old_mask = {__val = {0}}
        ret = <optimized out>
#1  0x00007f2c13bbde8f in __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
No locals.
#2  0x00007f2c13b6efb2 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
        ret = <optimized out>
#3  0x00007f2c13b59472 in __GI_abort () at ./stdlib/abort.c:79
        save_stage = 1
        act = {__sigaction_handler = {sa_handler = 0x20, sa_sigaction = 0x20}, sa_mask = {__val = {0 <repeats 16 times>}}, sa_flags = 0, sa_restorer = 0x0}
#4  0x00005571cc0252a0 in ha_panic () at src/debug.c:624
        old = <optimized out>
        thr = <optimized out>
        old = <optimized out>
        thr = <optimized out>
        __v = <optimized out>
        __v = <optimized out>
        __v = <optimized out>
        __v = <optimized out>
#5  ha_panic () at src/debug.c:569
        old = <optimized out>
        thr = 0
        __v = <optimized out>
        __v = <optimized out>
        __v = <optimized out>
        __v = <optimized out>
#6  <signal handler called>
No locals.
#7  __pl_wait_unlock_long (mask=<optimized out>, lock=<optimized out>) at include/import/plock.h:96
        loops = 130555
        ret = <optimized out>
        m = 225411
        ret = <optimized out>
        m = <optimized out>
        loops = <optimized out>
#8  pl_wait_unlock_long (lock=0x5571cdd68660, mask=mask@entry=4294967288) at include/import/plock.h:123
No locals.
#9  0x00005571cc0ff3a1 in process_table_expire (task=0x5571e72daa20, context=0x5571cdd67e60, state=<optimized out>) at src/stick_table.c:915
        __lk_r = 0x5571cdd68660
        __set_r = 21474836484
        __msk_r = 18446744069414584320
        __pl_r = <optimized out>
        t = 0x5571cdd67e60
        ts = 0x7f2b440b3b40
        eb = <optimized out>
        updt_locked = 1
        looped = <optimized out>
        exp_next = <optimized out>
        task_exp = 907762
        shard = <optimized out>
#10 0x00005571cc1c093d in run_tasks_from_lists (budgets=<optimized out>) at src/task.c:634
        process = <optimized out>
        tl_queues = 0x5571cc82d2d0 <ha_thread_ctx+720>
        t = 0x5571e72daa20
        budget_mask = 15 '\017'
        profile_entry = 0x0
        done = 0
        queue = 1
        state = <optimized out>
        ctx = <optimized out>
        __func__ = <optimized out>
#11 0x00005571cc1c12b3 in process_runnable_tasks () at src/task.c:876
        tt = 0x5571cc82d200 <ha_thread_ctx+512>
        lrq = <optimized out>
        grq = <optimized out>
        t = <optimized out>
        max = {0, 139, 0, 0}
        max_total = <optimized out>
        tmp_list = <optimized out>
        queue = <optimized out>
        max_processed = <optimized out>
        lpicked = <optimized out>
        gpicked = <optimized out>
        heavy_queued = 1
        budget = <optimized out>
        not_done_yet = <optimized out>
#12 0x00005571cc189db9 in run_poll_loop () at src/haproxy.c:3073
        next = <optimized out>
        wake = <optimized out>
        __func__ = <optimized out>
#13 0x00005571cc18a421 in run_thread_poll_loop (data=<optimized out>) at src/haproxy.c:3287
        ptaf = <optimized out>
        ptif = 0x5571cc5ea540 <per_thread_init_list>
        ptdf = <optimized out>
        ptff = <optimized out>
        init_left = 0
        init_mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}
        init_cond = {__data = {__wseq = {__value64 = 13, __value32 = {__low = 13, __high = 0}}, __g1_start = {__value64 = 7, __value32 = {__low = 7, __high = 0}}, __g_refs = {0, 0}, __g_size = {0, 0}, __g1_orig_size = 12, __wrefs = 0, __g_signals = {0, 0}}, __size = "\r\000\000\000\000\000\000\000\a", '\000' <repeats 23 times>, "\f", '\000' <repeats 14 times>, __align = 13}
#14 0x00007f2c13bbc134 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
        ret = <optimized out>
        pd = <optimized out>
        out = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {139827284821696, -7013473067111592739, -43952, 22, 140721269081200, 139827276431360, 6985931985399794909, 6985931917898632413}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
#15 0x00007f2c13c3c7dc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
No locals.

Thread 2 (Thread 0x7f2c11fa16c0 (LWP 2520032)):
#0  0x00005571cc22ea9f in eb32_lookup_ge (root=<optimized out>, x=2147542070) at src/eb32tree.c:184
        node = 0x5571ed2316f0
        troot = 0x5571ed2316f1
#1  0x00005571cc14128c in peer_teach_stage1_stksess_lookup (st=st@entry=0x5571e8d29d50) at src/peers.c:1497
        eb = <optimized out>
        ret = <optimized out>
#2  0x00005571cc142d4a in peer_send_teachmsgs (appctx=appctx@entry=0x5571ed1db780, p=p@entry=0x5571cdb506f0, peer_stksess_lookup=peer_stksess_lookup@entry=0x5571cc141270 <peer_teach_stage1_stksess_lookup>, st=st@entry=0x5571e8d29d50) at src/peers.c:1578
        ts = <optimized out>
        updateid = <optimized out>
        ret = <optimized out>
        new_pushed = 0
        use_timed = 1
        updates_sent = 12
#3  0x00005571cc143f87 in peer_send_teach_stage1_msgs (st=0x5571e8d29d50, p=0x5571cdb506f0, appctx=0x5571ed1db780) at src/peers.c:1660
No locals.
#4  peer_send_msgs (appctx=appctx@entry=0x5571ed1db780, peer=peer@entry=0x5571cdb506f0, peers=peers@entry=0x5571cdb1ef00) at src/peers.c:2643
        st = 0x5571e8d29d50
        last_local_table = 0x5571e8d42280
        updates = 0
        repl = <optimized out>
#5  0x00005571cc147100 in peer_io_handler (appctx=<optimized out>) at src/peers.c:3120
        msg_len = <optimized out>
        msg_end = 0x7f2b3803b6f0 ' ' <repeats 17 times>, "iobuf.flags=0x00000000 .pipe=0 .buf=0@(nil)+0/0\n", ' ' <repeats 17 times>, "iobuf.flags=0x00000000 .pipe=0 .buf=0@(nil)+0/0\n"
        msg_head = "n\0178+\177\000"
        msg_cur = 0x7f2b3803b6f0 ' ' <repeats 17 times>, "iobuf.flags=0x00000000 .pipe=0 .buf=0@(nil)+0/0\n", ' ' <repeats 17 times>, "iobuf.flags=0x00000000 .pipe=0 .buf=0@(nil)+0/0\n"
        totl = <optimized out>
        sc = <optimized out>
        s = <optimized out>
        curpeers = <optimized out>
        curpeer = 0x5571cdb506f0
        reql = 0
        repl = <optimized out>
        maj_ver = <optimized out>
        min_ver = <optimized out>
        prev_state = 7
        switchstate = <optimized out>
        __func__ = <optimized out>
#6  0x00005571cc210f15 in task_run_applet (t=0x7f2c0c0bfba0, context=0x5571ed1db780, state=<optimized out>) at src/applet.c:815
        app = 0x5571ed1db780
        sc = <optimized out>
        sco = <optimized out>
        ic = <optimized out>
        oc = 0x7f2b380f6ea0
        rate = <optimized out>
        input = 2219008
        output = 0
        did_send = 0
        __FUNCTION__ = "task_run_applet"
#7  0x00005571cc1c093d in run_tasks_from_lists (budgets=<optimized out>) at src/task.c:634
        process = <optimized out>
        tl_queues = 0x5571cc82d6d0 <ha_thread_ctx+1744>
        t = 0x7f2c0c0bfba0
        budget_mask = 15 '\017'
        profile_entry = 0x0
        done = 0
        queue = 1
        state = <optimized out>
        ctx = <optimized out>
        __func__ = <optimized out>
#8  0x00005571cc1c12b3 in process_runnable_tasks () at src/task.c:876
        tt = 0x5571cc82d600 <ha_thread_ctx+1536>
        lrq = <optimized out>
        grq = <optimized out>
        t = <optimized out>
        max = {0, 107, 0, 0}
        max_total = <optimized out>
        tmp_list = <optimized out>
        queue = <optimized out>
        max_processed = <optimized out>
        lpicked = <optimized out>
        gpicked = <optimized out>
        heavy_queued = 1
        budget = <optimized out>
        not_done_yet = <optimized out>
#9  0x00005571cc189db9 in run_poll_loop () at src/haproxy.c:3073
        next = <optimized out>
        wake = <optimized out>
        __func__ = <optimized out>
#10 0x00005571cc18a421 in run_thread_poll_loop (data=<optimized out>) at src/haproxy.c:3287
        ptaf = <optimized out>
        ptif = 0x5571cc5ea540 <per_thread_init_list>
        ptdf = <optimized out>
        ptff = <optimized out>
        init_left = 0
        init_mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}
        init_cond = {__data = {__wseq = {__value64 = 13, __value32 = {__low = 13, __high = 0}}, __g1_start = {__value64 = 7, __value32 = {__low = 7, __high = 0}}, __g_refs = {0, 0}, __g_size = {0, 0}, __g1_orig_size = 12, __wrefs = 0, __g_signals = {0, 0}}, __size = "\r\000\000\000\000\000\000\000\a", '\000' <repeats 23 times>, "\f", '\000' <repeats 14 times>, __align = 13}
#11 0x00007f2c13bbc134 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
        ret = <optimized out>
        pd = <optimized out>
        out = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {139827256891072, -7013473067111592739, -43952, 22, 140721269081200, 139827248500736, 6985928088253844701, 6985931917898632413}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
#12 0x00007f2c13c3c7dc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
No locals.

Thread 3 (Thread 0x7f2c13a50240 (LWP 2520027)):
#0  0x00007f2c13c3be26 in epoll_wait (epfd=5, events=0x5571ece4af90, maxevents=200, timeout=timeout@entry=2) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
        sc_ret = -4
        sc_cancel_oldtype = 0
        sc_ret = <optimized out>
#1  0x00005571cc02abd0 in _do_poll (p=<optimized out>, exp=<optimized out>, wake=<optimized out>) at src/ev_epoll.c:232
        timeout = 2
        status = <optimized out>
        fd = <optimized out>
        count = <optimized out>
        updt_idx = <optimized out>
        wait_time = 2
        old_fd = <optimized out>
#2  0x00005571cc189d78 in run_poll_loop () at src/haproxy.c:3145
        next = <optimized out>
        wake = <optimized out>
        __func__ = <optimized out>
#3  0x00005571cc18a421 in run_thread_poll_loop (data=<optimized out>) at src/haproxy.c:3287
        ptaf = <optimized out>
        ptif = 0x5571cc5ea540 <per_thread_init_list>
        ptdf = <optimized out>
        ptff = <optimized out>
        init_left = 0
        init_mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}
        init_cond = {__data = {__wseq = {__value64 = 13, __value32 = {__low = 13, __high = 0}}, __g1_start = {__value64 = 7, __value32 = {__low = 7, __high = 0}}, __g_refs = {0, 0}, __g_size = {0, 0}, __g1_orig_size = 12, __wrefs = 0, __g_signals = {0, 0}}, __size = "\r\000\000\000\000\000\000\000\a", '\000' <repeats 23 times>, "\f", '\000' <repeats 14 times>, __align = 13}
#4  0x00005571cc027a1a in main (argc=<optimized out>, argv=<optimized out>) at src/haproxy.c:3989
        err = <optimized out>
        retry = <optimized out>
        limit = {rlim_cur = 18446744073709551615, rlim_max = 18446744073709551615}
        pidfd = <optimized out>
        intovf = <optimized out>
        msg = <optimized out>

Thread 4 (Thread 0x7f2c127a26c0 (LWP 2520031)):
#0  0x00007f2c13c3be26 in epoll_wait (epfd=1807, events=0x7f2b44061c70, maxevents=200, timeout=timeout@entry=2) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
        sc_ret = -4
        sc_cancel_oldtype = 0
        sc_ret = <optimized out>
#1  0x00005571cc02abd0 in _do_poll (p=<optimized out>, exp=<optimized out>, wake=<optimized out>) at src/ev_epoll.c:232
        timeout = 2
        status = <optimized out>
        fd = <optimized out>
        count = <optimized out>
        updt_idx = <optimized out>
        wait_time = 2
        old_fd = <optimized out>
#2  0x00005571cc189d78 in run_poll_loop () at src/haproxy.c:3145
        next = <optimized out>
        wake = <optimized out>
        __func__ = <optimized out>
#3  0x00005571cc18a421 in run_thread_poll_loop (data=<optimized out>) at src/haproxy.c:3287
        ptaf = <optimized out>
        ptif = 0x5571cc5ea540 <per_thread_init_list>
        ptdf = <optimized out>
        ptff = <optimized out>
        init_left = 0
        init_mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}
        init_cond = {__data = {__wseq = {__value64 = 13, __value32 = {__low = 13, __high = 0}}, __g1_start = {__value64 = 7, __value32 = {__low = 7, __high = 0}}, __g_refs = {0, 0}, __g_size = {0, 0}, __g1_orig_size = 12, __wrefs = 0, __g_signals = {0, 0}}, __size = "\r\000\000\000\000\000\000\000\a", '\000' <repeats 23 times>, "\f", '\000' <repeats 14 times>, __align = 13}
#4  0x00007f2c13bbc134 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
        ret = <optimized out>
        pd = <optimized out>
        out = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {139827265283776, -7013473067111592739, -43952, 22, 140721269081200, 139827256893440, 6985935785372110045, 6985931917898632413}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
#5  0x00007f2c13c3c7dc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
No locals.

@capflam
Copy link
Member

capflam commented May 15, 2024

So here, we are back again with the first crash. The table's update lock seems to be acquired too long in peer_send_teachmsgs() and process_table_expire() (and probably in stktable_trash_oldest() too because of my patch). I don't know if it is possible to reduce the scope of this lock.

@capflam
Copy link
Member

capflam commented May 21, 2024

@idl0r, I don't know if you have an opportunity to test the following patches, on top of the 3.0-dev12 . It could be really helpful to validate it fixes your issue before the release, planned for the next week. I performed very limited tests for now. Idea is to prioritize the resynchro on reload and postpone the purge of the stick-table if some contentions are detected.

0001-BUG-MEDIUM-stick-tables-Fix-race-with-peers-when-tra.patch.txt
0002-MEDIUM-stick-tables-Don-t-own-update-lock-too-long-w.patch.txt
0003-MEDIUM-stick-tables-Postpone-expired-sessions-purge-.patch.txt
0004-MEDIUM-stick-tables-Deal-with-updt_lock-contentions-.patch.txt

@capflam capflam added the status: feedback required The developers are waiting for a reply from the reporter. label May 21, 2024
@wtarreau
Copy link
Member

BTW, if you apply the patches, please rebase on top of latest master (39caa20) as it fixes an issue which can cause some FD leaks or even some incorrectly set options on connections to the backend once the FD number reaches 4096!

@idl0r
Copy link
Contributor Author

idl0r commented May 21, 2024

No problem! Can do that tomorrow and I might be able to give feedback in the evening already.
Thanks, guys!

@idl0r
Copy link
Contributor Author

idl0r commented May 22, 2024

Ok, here we go:
3.0-ss-20240522 which contains 39caa20 according to the changelog and I also checked the source.
Ontop of if the patches mentioned by @capflam
This one looks different:

Core was generated by `/usr/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -S /run/haproxy-master.sock -'.
Program terminated with signal SIGABRT, Aborted.
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
44	./nptl/pthread_kill.c: No such file or directory.
[Current thread is 1 (Thread 0x7fd7c13926c0 (LWP 2600327))]

Thread 1 (Thread 0x7fd7c13926c0 (LWP 2600327)):
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
        tid = <optimized out>
        ret = 0
        pd = <optimized out>
        old_mask = {__val = {0}}
        ret = <optimized out>
#1  0x00007fd7c278ce8f in __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
No locals.
#2  0x00007fd7c273dfb2 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
        ret = <optimized out>
#3  0x00007fd7c2728472 in __GI_abort () at ./stdlib/abort.c:79
        save_stage = 1
        act = {__sigaction_handler = {sa_handler = 0x20, sa_sigaction = 0x20}, sa_mask = {__val = {0 <repeats 16 times>}}, sa_flags = 0, sa_restorer = 0x0}
#4  0x0000562085966260 in ha_panic () at src/debug.c:624
        old = <optimized out>
        thr = <optimized out>
        old = <optimized out>
        thr = <optimized out>
        __v = <optimized out>
        __v = <optimized out>
        __v = <optimized out>
        __v = <optimized out>
#5  ha_panic () at src/debug.c:569
        old = <optimized out>
        thr = 0
        __v = <optimized out>
        __v = <optimized out>
        __v = <optimized out>
        __v = <optimized out>
#6  <signal handler called>
No locals.
#7  eb32_lookup_ge (root=<optimized out>, x=2147514982) at include/import/ebtree.h:414
        node = <optimized out>
        troot = 0x5620a7e17191
#8  0x0000562085a8368c in peer_teach_stage1_stksess_lookup (st=st@entry=0x5620a15786a0) at src/peers.c:1497
        eb = <optimized out>
        ret = <optimized out>
#9  0x0000562085a8514a in peer_send_teachmsgs (appctx=appctx@entry=0x5620a80c0ed0, p=p@entry=0x562087cdd4f0, peer_stksess_lookup=peer_stksess_lookup@entry=0x562085a83670 <peer_teach_stage1_stksess_lookup>, st=st@entry=0x5620a15786a0) at src/peers.c:1578
        ts = <optimized out>
        updateid = <optimized out>
        ret = <optimized out>
        new_pushed = 0
        use_timed = 1
        updates_sent = 45
#10 0x0000562085a86387 in peer_send_teach_stage1_msgs (st=0x5620a15786a0, p=0x562087cdd4f0, appctx=0x5620a80c0ed0) at src/peers.c:1660
No locals.
#11 peer_send_msgs (appctx=appctx@entry=0x5620a80c0ed0, peer=peer@entry=0x562087cdd4f0, peers=peers@entry=0x562087cabed0) at src/peers.c:2643
        st = 0x5620a15786a0
        last_local_table = 0x5620a1591350
        updates = 0
        repl = <optimized out>
#12 0x0000562085a89500 in peer_io_handler (appctx=<optimized out>) at src/peers.c:3120
        msg_len = <optimized out>
        msg_end = 0x7fd74003b6f0 "  Thread 4 : id=0x7fd7c0b916c0 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0\n      1/4    stuck=0 prof=0 harmless=1 isolated=0\n", ' ' <repeats 13 times>, "cpu_ns: poll=11754603650 now=11754622470 diff=18820\n", ' ' <repeats 13 times>, "curr_task=0\n"
        msg_head = "\212\a@\327\177\000"
        msg_cur = 0x7fd74003b6f0 "  Thread 4 : id=0x7fd7c0b916c0 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0\n      1/4    stuck=0 prof=0 harmless=1 isolated=0\n", ' ' <repeats 13 times>, "cpu_ns: poll=11754603650 now=11754622470 diff=18820\n", ' ' <repeats 13 times>, "curr_task=0\n"
        totl = <optimized out>
        sc = <optimized out>
        s = <optimized out>
        curpeers = <optimized out>
        curpeer = 0x562087cdd4f0
        reql = 0
        repl = <optimized out>
        maj_ver = <optimized out>
        min_ver = <optimized out>
        prev_state = 7
        switchstate = <optimized out>
        __func__ = <optimized out>
#13 0x0000562085b53565 in task_run_applet (t=0x7fd73019f9b0, context=0x5620a80c0ed0, state=<optimized out>) at src/applet.c:815
        app = 0x5620a80c0ed0
        sc = <optimized out>
        sco = <optimized out>
        ic = <optimized out>
        oc = 0x7fd740078af0
        rate = <optimized out>
        input = 2201824
        output = 0
        did_send = 0
        __FUNCTION__ = "task_run_applet"
#14 0x0000562085b02d7d in run_tasks_from_lists (budgets=<optimized out>) at src/task.c:634
        process = <optimized out>
        tl_queues = 0x56208616f5d0 <ha_thread_ctx+1232>
        t = 0x7fd73019f9b0
        budget_mask = 15 '\017'
        profile_entry = 0x0
        done = 0
        queue = 1
        state = <optimized out>
        ctx = <optimized out>
        __func__ = <optimized out>
#15 0x0000562085b036f3 in process_runnable_tasks () at src/task.c:876
        tt = 0x56208616f500 <ha_thread_ctx+1024>
        lrq = <optimized out>
        grq = <optimized out>
        t = <optimized out>
        max = {0, 107, 0, 0}
        max_total = <optimized out>
        tmp_list = <optimized out>
        queue = <optimized out>
        max_processed = <optimized out>
        lpicked = <optimized out>
        gpicked = <optimized out>
        heavy_queued = 1
        budget = <optimized out>
        not_done_yet = <optimized out>
#16 0x0000562085acc1d9 in run_poll_loop () at src/haproxy.c:3073
        next = <optimized out>
        wake = <optimized out>
        __func__ = <optimized out>
#17 0x0000562085acc841 in run_thread_poll_loop (data=<optimized out>) at src/haproxy.c:3287
        ptaf = <optimized out>
        ptif = 0x562085f2c5c0 <per_thread_init_list>
        ptdf = <optimized out>
        ptff = <optimized out>
        init_left = 0
        init_mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}
        init_cond = {__data = {__wseq = {__value64 = 13, __value32 = {__low = 13, __high = 0}}, __g1_start = {__value64 = 7, __value32 = {__low = 7, __high = 0}}, __g_refs = {0, 0}, __g_size = {0, 0}, __g1_orig_size = 12, __wrefs = 0, __g_signals = {0, 0}}, __size = "\r\000\000\000\000\000\000\000\a", '\000' <repeats 23 times>, "\f", '\000' <repeats 14 times>, __align = 13}
#18 0x00007fd7c278b134 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
        ret = <optimized out>
        pd = <optimized out>
        out = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140564636444352, 4924449491120200739, -43952, 2, 140721212461136, 140564628054016, -4901980417091789789, -4901978760717231069}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
#19 0x00007fd7c280b7dc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
No locals.

Thread 2 (Thread 0x7fd7c0b916c0 (LWP 2600328)):
#0  0x00007fd7c280ae26 in epoll_wait (epfd=1819, events=0x7fd730061be0, maxevents=200, timeout=timeout@entry=34) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
        sc_ret = -4
        sc_cancel_oldtype = 0
        sc_ret = <optimized out>
#1  0x000056208596bc20 in _do_poll (p=<optimized out>, exp=<optimized out>, wake=<optimized out>) at src/ev_epoll.c:232
        timeout = 34
        status = <optimized out>
        fd = <optimized out>
        count = <optimized out>
        updt_idx = <optimized out>
        wait_time = 34
        old_fd = <optimized out>
#2  0x0000562085acc198 in run_poll_loop () at src/haproxy.c:3145
        next = <optimized out>
        wake = <optimized out>
        __func__ = <optimized out>
#3  0x0000562085acc841 in run_thread_poll_loop (data=<optimized out>) at src/haproxy.c:3287
        ptaf = <optimized out>
        ptif = 0x562085f2c5c0 <per_thread_init_list>
        ptdf = <optimized out>
        ptff = <optimized out>
        init_left = 0
        init_mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}
        init_cond = {__data = {__wseq = {__value64 = 13, __value32 = {__low = 13, __high = 0}}, __g1_start = {__value64 = 7, __value32 = {__low = 7, __high = 0}}, __g_refs = {0, 0}, __g_size = {0, 0}, __g1_orig_size = 12, __wrefs = 0, __g_signals = {0, 0}}, __size = "\r\000\000\000\000\000\000\000\a", '\000' <repeats 23 times>, "\f", '\000' <repeats 14 times>, __align = 13}
#4  0x00007fd7c278b134 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
        ret = <optimized out>
        pd = <optimized out>
        out = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140564628051648, 4924449491120200739, -43952, 2, 140721212461136, 140564619661312, -4901981516066546653, -4901978760717231069}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
#5  0x00007fd7c280b7dc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
No locals.

Thread 3 (Thread 0x7fd7c26136c0 (LWP 2600326)):
#0  0x00007fd7c280ae26 in epoll_wait (epfd=1811, events=0x7fd7bc062c60, maxevents=200, timeout=timeout@entry=51) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
        sc_ret = -4
        sc_cancel_oldtype = 0
        sc_ret = <optimized out>
#1  0x000056208596bc20 in _do_poll (p=<optimized out>, exp=<optimized out>, wake=<optimized out>) at src/ev_epoll.c:232
        timeout = 51
        status = <optimized out>
        fd = <optimized out>
        count = <optimized out>
        updt_idx = <optimized out>
        wait_time = 51
        old_fd = <optimized out>
#2  0x0000562085acc198 in run_poll_loop () at src/haproxy.c:3145
        next = <optimized out>
        wake = <optimized out>
        __func__ = <optimized out>
#3  0x0000562085acc841 in run_thread_poll_loop (data=<optimized out>) at src/haproxy.c:3287
        ptaf = <optimized out>
        ptif = 0x562085f2c5c0 <per_thread_init_list>
        ptdf = <optimized out>
        ptff = <optimized out>
        init_left = 0
        init_mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}
        init_cond = {__data = {__wseq = {__value64 = 13, __value32 = {__low = 13, __high = 0}}, __g1_start = {__value64 = 7, __value32 = {__low = 7, __high = 0}}, __g_refs = {0, 0}, __g_size = {0, 0}, __g1_orig_size = 12, __wrefs = 0, __g_signals = {0, 0}}, __size = "\r\000\000\000\000\000\000\000\a", '\000' <repeats 23 times>, "\f", '\000' <repeats 14 times>, __align = 13}
#4  0x00007fd7c278b134 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
        ret = <optimized out>
        pd = <optimized out>
        out = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140564655847104, 4924449491120200739, -43952, 2, 140721212461136, 140564647456768, -4901978555760337885, -4901978760717231069}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
#5  0x00007fd7c280b7dc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
No locals.

Thread 4 (Thread 0x7fd7c261f240 (LWP 2600324)):
#0  0x0000562085a41b1b in process_table_expire (task=0x5620a156baf0, context=0x562087eeb060, state=<optimized out>) at src/stick_table.c:959
        __pl_r = 4
        t = 0x562087eeb060
        ts = 0x5620a733aed0
        eb = 0x5620a76d2a08
        updt_locked = 0
        looped = 1
        exp_next = <optimized out>
        task_exp = 391064
        shard = 14
        __func__ = <optimized out>
#1  0x0000562085b02d7d in run_tasks_from_lists (budgets=<optimized out>) at src/task.c:634
        process = <optimized out>
        tl_queues = 0x56208616f1d0 <ha_thread_ctx+208>
        t = 0x5620a156baf0
        budget_mask = 15 '\017'
        profile_entry = 0x0
        done = 0
        queue = 1
        state = <optimized out>
        ctx = <optimized out>
        __func__ = <optimized out>
#2  0x0000562085b036f3 in process_runnable_tasks () at src/task.c:876
        tt = 0x56208616f100 <ha_thread_ctx>
        lrq = <optimized out>
        grq = <optimized out>
        t = <optimized out>
        max = {0, 139, 0, 0}
        max_total = <optimized out>
        tmp_list = <optimized out>
        queue = <optimized out>
        max_processed = <optimized out>
        lpicked = <optimized out>
        gpicked = <optimized out>
        heavy_queued = 1
        budget = <optimized out>
        not_done_yet = <optimized out>
#3  0x0000562085acc1d9 in run_poll_loop () at src/haproxy.c:3073
        next = <optimized out>
        wake = <optimized out>
        __func__ = <optimized out>
#4  0x0000562085acc841 in run_thread_poll_loop (data=<optimized out>) at src/haproxy.c:3287
        ptaf = <optimized out>
        ptif = 0x562085f2c5c0 <per_thread_init_list>
        ptdf = <optimized out>
        ptff = <optimized out>
        init_left = 0
        init_mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}
        init_cond = {__data = {__wseq = {__value64 = 13, __value32 = {__low = 13, __high = 0}}, __g1_start = {__value64 = 7, __value32 = {__low = 7, __high = 0}}, __g_refs = {0, 0}, __g_size = {0, 0}, __g1_orig_size = 12, __wrefs = 0, __g_signals = {0, 0}}, __size = "\r\000\000\000\000\000\000\000\a", '\000' <repeats 23 times>, "\f", '\000' <repeats 14 times>, __align = 13}
#5  0x00005620859689da in main (argc=<optimized out>, argv=<optimized out>) at src/haproxy.c:3989
        err = <optimized out>
        retry = <optimized out>
        limit = {rlim_cur = 18446744073709551615, rlim_max = 18446744073709551615}
        pidfd = <optimized out>
        intovf = <optimized out>
        msg = <optimized out>

@wtarreau
Copy link
Member

Strange, looks like it was killed while working. @capflam I didn't check the patches, but do you think there's a possibility that it was looping forever trying to find new entries maybe ?

@wtarreau
Copy link
Member

Hmmm had not checked but only the tables code was touched, nothing changed for peers, so maybe we just pushed the limit a bit further and now we're contemplating an exhausted thread that has slowly progressed. I really think that the peers should count the time spent waiting and quit earlier on contention. That's tricky to do, though.

@wtarreau
Copy link
Member

@idl0r you can limit the number of loops trying to send if you want. It defaults to 200 and is controlled by tune.peers.max-updates-at-once in the global section and crashed at 45! You could try with a smaller value, e.g. 10. I still don't understand how it did not progress more than that in this time frame. I'm wondering if that function isn't called itself from another loop, continuing to check.

@idl0r
Copy link
Contributor Author

idl0r commented May 22, 2024

Set to 10. Trying again

@wtarreau
Copy link
Member

And I confirm that dev6 doesn't crash (but it's way slower).

@wtarreau
Copy link
Member

Mistake, 3.0-dev6 crashes as well but it takes more time. I'm always facing memory corruption detected in the pools. I'm worried that we could be detecting another bug that nullifies the bisect :-/

@wtarreau
Copy link
Member

3.0-dev0 crashes similarly :-(

@wtarreau
Copy link
Member

2.9.7 crashes similarly (but takes much more time, 2 crashes/min) and 2.8.9 doesn't seem to, but it's also 5 times slower than 2.9.7. At least it gives an entry point.

@wtarreau
Copy link
Member

So with this commit which doubles the perf:
29982ea ("MEDIUM: peers: only read-lock peer_send_teachmsgs()")

I'm seeing about 1 segv every 2 minutes, and with the next one which gets 30-50%
more perf:
87e072e ("MEDIUM: stick-table: use a distinct lock for the updates tree")

I get 10 cores in the first few seconds. I'm thus wondering how the performance affects these measurements (e.g. based on the table size, forced evictions etc). Because this completely invalidates some of my bisects. I really want to be sure about the first faulty commit, not the first one that reproduces the issue on that platform.

@wtarreau
Copy link
Member

Also it seems like below 48-47 active peers I can't reproduce it anymore. I'll push the limit further, that machine has enough RAM :-)

@idl0r
Copy link
Contributor Author

idl0r commented May 23, 2024

Interesting. We had no segfault with dev6 and also none with 2.9.7 which is running rock stable on all other nodes until now.

@idl0r
Copy link
Contributor Author

idl0r commented May 23, 2024

I was already thinking about building an injector as well as some peers receivers, which also could be used for VTest perhaps, to simulate some high load peers syncing.

@wtarreau
Copy link
Member

For the record I'm currently using that:

global
        stats socket "/tmp/haproxy-port-80$HAPROXY_LOCALPEER.stat" mode 666 level admin
        stats timeout 1d
        nbthread 16
defaults
        mode http
        timeout connect 5s
        timeout http-request 5s
        timeout queue 30s
        timeout tarpit 1m
        timeout client 60s
        timeout server 60s

frontend stats
        bind ":81$HAPROXY_LOCALPEER"
        stats uri /

peers peers
        peer 01 127.0.0.1:10001
        peer 02 127.0.0.1:10002
        ...
        peer 49 127.0.0.1:10049
        peer 50 127.0.0.1:10050

frontend ft
        bind :"80$HAPROXY_LOCALPEER"
        http-request track-sc0 rand(100000) table tbl2
        filter bwlim-out lim-out limit 2047m key rand(100000000),ipmask(32) min-size 1 table tbl2
        http-request set-bandwidth-limit lim-out
        use_backend bk

backend bk
        server s1 198.18.0.30:8000
        server s2 198.18.0.34:8000

backend tbl2
        stick-table type ip size 5k expire 10s store http_req_cnt,bytes_in_rate(1s),bytes_out_rate(1s) peers peers

I start 50 instances of it like this:
$ for i in {1..50}; do ./haproxy -dMno-merge,global,caller,tag,integrity,cold-first -D -f stktbl-bwlim-50peers2.cfg -L $(printf "%02d" $i); done

I'm having httpterm running on s1 & s2, and am sending the load from the machine itself using h1load this way:

$ ~/h1load/h1load -e -c 100 0:8001

I'll retry with varying parameters (more/less processes/threads etc) to make sure that with the very first commit it instantly dies.

@idl0r
Copy link
Contributor Author

idl0r commented May 23, 2024

f37ddbe is the first bad commit
I doubt that very much. It's difficult to say if a commit was good because sometimes it crashed fast, sometimes it ran for hours :/
I may have to redo that bisect and let it run for much longer.

@wtarreau
Copy link
Member

Christian, we finally found it just now! The real culprit is this one:

cfeca3a ("MEDIUM: stick-table: touch updates under an upgradable read lock")

It ran 5mn at 50 peers with zero problem versus crashes in the first 2-5 second till now. The inter-dependencies and assumptions of the various functions are all but obvious, but at some point we started to seriously suspect a double insertion in unprotected code. The protection was insufficient (TOCTOU) and it could indeed happen that the same key learned from two distinct peers on different threads at the same time causes this double insert. The only way to trigger this is indeed to have many peers...

We can also confirm that the two patches marked BUG above are needed. I'm going to write the fix, merge it as well as the two other fixes and I'll let you know once this is done. Please note that we also noticed twice (in hundreds of tests) a crash on appctx_wakeup() in the peers code with a null appctx, it seems to have happened when several peers were lost at the exact same time (we lost them by packs of 10-20 due to the crashes). But this is independent and might have been there for a while. We'll try to reproduce and fix that one as well but I'm much less worried anyway. Stay tuned!

haproxy-mirror pushed a commit that referenced this issue May 23, 2024
…ession

When a sticky session is killed, we must be sure no other entity is still
referencing it. The session's ref_cnt must be 0. However, there is a race
with peers, as decribed in 21447b1 ("BUG/MAJOR: stick-tables: fix race
with peers in entry expiration"). When the update lock is acquire, we must
recheck the ref_cnt value.

This patch is part of a debugging session about issue #2552. It must be
backported to 2.9.
haproxy-mirror pushed a commit that referenced this issue May 23, 2024
…ntries

In GH issue #2552, Christian Ruppert reported an increase in crashes
with recent 3.0-dev versions, always related with stick-tables and peers.
One particularity of his config is that it has a lot of peers.

While trying to reproduce, it empirically was found that firing 10 load
generators at 10 different haproxy instances tracking a random key among
100k against a table of max 5k entries, on 8 threads and between a total
of 50 parallel peers managed to reproduce the crashes in seconds, very
often in ebtree deletion or insertion code, but not only.

The debugging revealed that the crashes are often caused by a parent node
being corrupted while delete/insert tries to update it regarding a recently
inserted/removed node, and that that corrupted node had always been proven
to be deleted, then immediately freed, so it ought not be visited in the
tree from functions enclosed between a pair of lock/unlock. As such the
only possibility was that it had experienced unexpected inserts. Also,
running with pool integrity checking would 90% of the time cause crashes
during allocation based on corrupted contents in the node, likely because
it was found at two places in the same tree and still present as a parent
of a node being deleted or inserted (hence the __stksess_free and
stktable_trash_oldest callers being visible on these items).

Indeed the issue is in fact related to the test set (occasionally redundant
keys, many peers). What happens is that sometimes, a same key is learned
from two different peers. When it is learned for the first time, we end up
in stktable_touch_with_exp() in the "else" branch, where the test for
existence is made before taking the lock (since commit cfeca3a
("MEDIUM: stick-table: touch updates under an upgradable read lock") that
was merged in 2.9), and from there the entry is added. But is one of the
threads manages to insert it before the other thread takes the lock, then
the second thread will try to insert this node again. And inserting an
already inserted node will corrupt the tree (note that we never switched
to enforcing a check in insertion code on this due to API history that
would break various code parts).

Here the solution is simple, it requires to recheck leaf_p after getting
the lock, to avoid touching anything if the entry has already been
inserted in the mean time.

Many thanks to Christian Ruppert for testing this and for his invaluable
help on this hard-to-trigger issue.

This fix needs to be backported to 2.9.
@wtarreau
Copy link
Member

OK Christian, I've pushed this branch with all needed fixes: 20240523-fix-table-locking-3
I intend to merge it but since I'm totally exhausted by this afternoon's debugging session, I'd appreciate a confirmation before deciding to merge it ;-) If you don't have time, no worries, I'll re-check everything tomorrow morning and merge it then. Thanks again for your invaluable help on this one, we would never have figured it without your thorough tests and observations!

@idl0r
Copy link
Contributor Author

idl0r commented May 23, 2024

Very welcome!
I'll apply it tomorrow morning on top of the latest snapshot and let it run on that one node

@wtarreau
Copy link
Member

Thanks. Note that the branch already contains all latest fixes, so it should be as simple as just checking out that branch (if it's easier for you to pick the patches, a git log or git format-patch from master will list them all and they'll apply cleanly). It's running fine now, no crashes anymore here on the 80-core system with 50 processes all talking to all others.

@idl0r
Copy link
Contributor Author

idl0r commented May 24, 2024

20240523-fix-table-locking-3 built and deployed on that node. Lets see how it behaves :)

haproxy-mirror pushed a commit that referenced this issue May 24, 2024
…ession

When a sticky session is killed, we must be sure no other entity is still
referencing it. The session's ref_cnt must be 0. However, there is a race
with peers, as decribed in 21447b1 ("BUG/MAJOR: stick-tables: fix race
with peers in entry expiration"). When the update lock is acquire, we must
recheck the ref_cnt value.

This patch is part of a debugging session about issue #2552. It must be
backported to 2.9.
haproxy-mirror pushed a commit that referenced this issue May 24, 2024
…ntries

In GH issue #2552, Christian Ruppert reported an increase in crashes
with recent 3.0-dev versions, always related with stick-tables and peers.
One particularity of his config is that it has a lot of peers.

While trying to reproduce, it empirically was found that firing 10 load
generators at 10 different haproxy instances tracking a random key among
100k against a table of max 5k entries, on 8 threads and between a total
of 50 parallel peers managed to reproduce the crashes in seconds, very
often in ebtree deletion or insertion code, but not only.

The debugging revealed that the crashes are often caused by a parent node
being corrupted while delete/insert tries to update it regarding a recently
inserted/removed node, and that that corrupted node had always been proven
to be deleted, then immediately freed, so it ought not be visited in the
tree from functions enclosed between a pair of lock/unlock. As such the
only possibility was that it had experienced unexpected inserts. Also,
running with pool integrity checking would 90% of the time cause crashes
during allocation based on corrupted contents in the node, likely because
it was found at two places in the same tree and still present as a parent
of a node being deleted or inserted (hence the __stksess_free and
stktable_trash_oldest callers being visible on these items).

Indeed the issue is in fact related to the test set (occasionally redundant
keys, many peers). What happens is that sometimes, a same key is learned
from two different peers. When it is learned for the first time, we end up
in stktable_touch_with_exp() in the "else" branch, where the test for
existence is made before taking the lock (since commit cfeca3a
("MEDIUM: stick-table: touch updates under an upgradable read lock") that
was merged in 2.9), and from there the entry is added. But is one of the
threads manages to insert it before the other thread takes the lock, then
the second thread will try to insert this node again. And inserting an
already inserted node will corrupt the tree (note that we never switched
to enforcing a check in insertion code on this due to API history that
would break various code parts).

Here the solution is simple, it requires to recheck leaf_p after getting
the lock, to avoid touching anything if the entry has already been
inserted in the mean time.

Many thanks to Christian Ruppert for testing this and for his invaluable
help on this hard-to-trigger issue.

This fix needs to be backported to 2.9.
@wtarreau
Copy link
Member

FWIW I've just merged the series now. I'm assuming it's still OK on your side.

@idl0r
Copy link
Contributor Author

idl0r commented May 24, 2024

Yup. Still looking good :)

@wtarreau
Copy link
Member

Awesome, thank you!

@idl0r
Copy link
Contributor Author

idl0r commented May 24, 2024

Still running fine. No crashes. Running for ~7 hours already :)
I'll let it run at least until monday.

@capflam capflam added status: fixed This issue is a now-fixed bug. 2.9 This issue affects the HAProxy 2.9 stable branch. and removed status: feedback required The developers are waiting for a reply from the reporter. status: reviewed This issue was reviewed. A fix is required. labels May 24, 2024
@capflam
Copy link
Member

capflam commented May 24, 2024

Amazing ! Thanks for help @idl0r.

@idl0r
Copy link
Contributor Author

idl0r commented May 27, 2024

FYI: Still looking good! No crashes. So I'd say that was it and it's fixed. Thanks, guys! :)

@wtarreau
Copy link
Member

That's pretty cool, thanks Christian! It's one of these rare bugs thare are uncovered by performance improvements. We'll keep this open till we backport the fix to 2.9.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.9 This issue affects the HAProxy 2.9 stable branch. status: fixed This issue is a now-fixed bug. type: bug This issue describes a bug.
Projects
None yet
Development

No branches or pull requests

5 participants