You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've been observing frequent segfaults when running the datakit filter on a kong/kong:nightly container which uses ngx_wasm_module commit d5e4c69 and a filter configuration that dispatches multiple HTTP subrequests. (I could not reproduce the problem on a configuration with a single HTTP dispatch).
This was one of the segfaults:
```
Program received signal SIGSEGV, Segmentation fault.
0x000057f6f83fade1 in ngx_palloc_large (pool=pool@entry=0x57f6fa0e2710, size=size@entry=1120) at src/core/ngx_palloc.c:244
244 src/core/ngx_palloc.c: No such file or directory.
(gdb) bt
#0 0x000057f6f83fade1 in ngx_palloc_large (pool=pool@entry=0x57f6fa0e2710, size=size@entry=1120) at src/core/ngx_palloc.c:244
#1 0x000057f6f83fafdf in ngx_palloc (pool=pool@entry=0x57f6fa0e2710, size=size@entry=1120) at src/core/ngx_palloc.c:131
#2 0x00007723db389217 in ngx_list_init (size=56, n=20, pool=0x57f6fa0e2710, list=0x57f6f9f32ec0) at src/core/ngx_list.h:39
#3 ngx_wasm_socket_read_http_response (sock=0x57f6f9f328e8, bytes=539, ctx=0x57f6f9f32cc8)
at /home/runner/.cache/bazel/_bazel_runner/5f5cb99aeb7e2b4db19484ea5022e949/execroot/kong/external/ngx_wasm_module/src/common/ngx_wasm_socket_tcp.c:968
#4 0x00007723db3894ba in ngx_wasm_socket_tcp_read (sock=sock@entry=0x57f6f9f328e8, reader=0x7723db389150 ,
reader_ctx=reader_ctx@entry=0x57f6f9f32cc8)
at /home/runner/.cache/bazel/_bazel_runner/5f5cb99aeb7e2b4db19484ea5022e949/execroot/kong/external/ngx_wasm_module/src/common/ngx_wasm_socket_tcp.c:1064
#5 0x00007723db39b934 in ngx_http_proxy_wasm_dispatch_resume_handler (sock=)
at /home/runner/.cache/bazel/_bazel_runner/5f5cb99aeb7e2b4db19484ea5022e949/execroot/kong/external/ngx_wasm_module/src/http/proxy_wasm/ngx_http_proxy_wasm_dispatch.c:785
#6 0x00007723db387f1f in ngx_wasm_socket_tcp_resume (sock=)
at /home/runner/.cache/bazel/_bazel_runner/5f5cb99aeb7e2b4db19484ea5022e949/execroot/kong/external/ngx_wasm_module/src/common/ngx_wasm_socket_tcp.c:90
#7 ngx_wasm_socket_tcp_resume (sock=)
at /home/runner/.cache/bazel/_bazel_runner/5f5cb99aeb7e2b4db19484ea5022e949/execroot/kong/external/ngx_wasm_module/src/common/ngx_wasm_socket_tcp.c:76
#8 0x000057f6f8425a6f in ngx_epoll_process_events (cycle=, timer=, flags=)
at src/event/modules/ngx_epoll_module.c:901
#9 0x000057f6f841b79a in ngx_process_events_and_timers (cycle=cycle@entry=0x57f6f87991b0) at src/event/ngx_event.c:258
#10 0x000057f6f842384e in ngx_worker_process_cycle (cycle=cycle@entry=0x57f6f87991b0, data=data@entry=0x0) at src/os/unix/ngx_process_cycle.c:793
#11 0x000057f6f8421c0f in ngx_spawn_process (cycle=cycle@entry=0x57f6f87991b0, proc=0x57f6f84237d0 , data=0x0,
name=0x57f6f85f4004 "worker process", respawn=respawn@entry=0) at src/os/unix/ngx_process.c:199
#12 0x000057f6f8424a10 in ngx_reap_children (cycle=0x57f6f87991b0) at src/os/unix/ngx_process_cycle.c:665
#13 ngx_master_process_cycle (cycle=0x57f6f87991b0) at src/os/unix/ngx_process_cycle.c:180
#14 0x000057f6f83f884e in main (argc=, argv=) at src/core/nginx.c:387
```
I got others with different backtraces, which suggested an earlier memory corruption.
I installed Valgrind in the container and got a ton on memory errors, all which seem to follow from the same cause as the first one:
```
2024/04/09 21:56:42 [debug] 1321#0: *645 [proxy-wasm] resetting filter chain: pwctx->exec_index 0 to 0 (pwctx: 00000000056B6A50)
2024/04/09 21:56:42 [debug] 1321#0: *645 [proxy-wasm]["datakit_filter" #1] filter 1/1 resuming "on_response_headers" step in "header_filter" phase
2024/04/09 21:56:42 [debug] 1321#0: *645 [proxy-wasm]["datakit_filter" #1] filter 1/1 resuming "on_response_body" step in "body_filter" phase
2024/04/09 21:56:42 [debug] 1321#0: *645 [proxy-wasm]["datakit_filter" #1] filter 1/1 resuming "on_response_body" step in "body_filter" phase
172.17.0.1 - - [09/Apr/2024:21:56:42 +0000] "GET /demo HTTP/1.1" 404 648 "-" "HTTPie/2.6.0" kong_request_id: "0dcdce20288d1e0b60aee81a85b68465"
2024/04/09 21:56:42 [debug] 1321#0: *645 [proxy-wasm]["datakit_filter" #1] filter 1/1 resuming "on_log" step in "log" phase
2024/04/09 21:56:42 [debug] 1321#0: *645 [proxy-wasm]["datakit_filter" #1] filter 1/1 resuming "on_done" step in "done" phase
2024/04/09 21:56:42 [debug] 1321#0: *645 [proxy-wasm]["datakit_filter" #1] filter 1/1 finalizing context
2024/04/09 21:56:42 [debug] 1321#0: *645 [proxy-wasm] "datakit_filter" filter freeing context #1 (1/1)
2024/04/09 21:56:42 [info] 1321#0: *645 client 172.17.0.1 closed keepalive connection
==1321== Invalid read of size 8
==1321== at 0x5D21811: ngx_http_proxy_wasm_dispatch_resume_handler (ngx_http_proxy_wasm_dispatch.c:708)
==1321== by 0x5D0DF1E: ngx_wasm_socket_tcp_resume (ngx_wasm_socket_tcp.c:90)
==1321== by 0x5D0DF1E: ngx_wasm_socket_tcp_resume (ngx_wasm_socket_tcp.c:76)
==1321== by 0x177A6E: ngx_epoll_process_events (ngx_epoll_module.c:901)
==1321== by 0x16D799: ngx_process_events_and_timers (ngx_event.c:258)
==1321== by 0x17584D: ngx_worker_process_cycle (ngx_process_cycle.c:793)
==1321== by 0x173C0E: ngx_spawn_process (ngx_process.c:199)
==1321== by 0x174F0F: ngx_start_worker_processes (ngx_process_cycle.c:382)
==1321== by 0x1762B3: ngx_master_process_cycle (ngx_process_cycle.c:135)
==1321== by 0x14A84D: main (nginx.c:387)
==1321== Address 0x6fd9260 is 2,704 bytes inside a block of size 4,096 free'd
==1321== at 0x484B27F: free (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==1321== by 0x14CF36: ngx_destroy_pool (ngx_palloc.c:90)
==1321== by 0x18FF2C: ngx_http_free_request (ngx_http_request.c:3771)
==1321== by 0x190B53: ngx_http_set_keepalive (ngx_http_request.c:3151)
==1321== by 0x190B53: ngx_http_finalize_connection.part.0 (ngx_http_request.c:2788)
==1321== by 0x1A24C1: ngx_http_upstream_handler (ngx_http_upstream.c:1309)
==1321== by 0x177A6E: ngx_epoll_process_events (ngx_epoll_module.c:901)
==1321== by 0x16D799: ngx_process_events_and_timers (ngx_event.c:258)
==1321== by 0x17584D: ngx_worker_process_cycle (ngx_process_cycle.c:793)
==1321== by 0x173C0E: ngx_spawn_process (ngx_process.c:199)
==1321== by 0x174F0F: ngx_start_worker_processes (ngx_process_cycle.c:382)
==1321== by 0x1762B3: ngx_master_process_cycle (ngx_process_cycle.c:135)
==1321== by 0x14A84D: main (nginx.c:387)
==1321== Block was alloc'd at
==1321== at 0x484DE30: memalign (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==1321== by 0x484DF92: posix_memalign (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==1321== by 0x1711BA: ngx_memalign (ngx_alloc.c:57)
==1321== by 0x14CC5A: ngx_palloc_block (ngx_palloc.c:186)
==1321== by 0x14D195: ngx_pcalloc (ngx_palloc.c:302)
==1321== by 0x18F393: ngx_http_alloc_request (ngx_http_request.c:622)
==1321== by 0x18F814: ngx_http_create_request (ngx_http_request.c:534)
==1321== by 0x192D81: ngx_http_wait_request_handler (ngx_http_request.c:516)
==1321== by 0x177A6E: ngx_epoll_process_events (ngx_epoll_module.c:901)
==1321== by 0x16D799: ngx_process_events_and_timers (ngx_event.c:258)
==1321== by 0x17584D: ngx_worker_process_cycle (ngx_process_cycle.c:793)
==1321== by 0x173C0E: ngx_spawn_process (ngx_process.c:199)
==1321==
```
ngx_wasm_socket_tcp_resume triggers the ngx_http_proxy_wasm_dispatch_resume_handler, but the request data has already been destroyed by ngx_http_free_request.
The invalid data in the Valgrind trace above is call->pwexec, where call comes from sock->data, which is the input argument of the function.
The deallocation at ngx_http_free_request is ngx_destroy_pool for r->pool.
One weird behavior quirk I noticed in the logs before the first Valgrind error is that I did get two call: run lines (which trigger in my filter from on_http_request_headers but only one call: resume line (which triggers from on_http_call_response). The resume operation seems delayed until after ngx_http_free_request already destroyed the main request's data, taking the pwexec context with it.
This is unexpected because in normal operation the filter should be paused waiting for the last dispatch to finish.
Normally, in my local build, I get this sequence of operations for two dispatch calls:
And then when the resume handler runs, it's operating on dead data:
==1321== Invalid read of size 8
==1321== at 0x5D21811: ngx_http_proxy_wasm_dispatch_resume_handler (ngx_http_proxy_wasm_dispatch.c:708)
This seems to indicate that the crash is caused by something being scheduled out-of-order earlier on (rather than something like "this piece of data is allocated with a pool with the wrong lifetime").
I also ran my local Gateway builds this morning with Valgrind and the same filter configuration as above to see if it would report something, even if it wasn't crashing, and no memory access violations reported there.
I've been observing frequent segfaults when running the datakit filter on a
kong/kong:nightly
container which uses ngx_wasm_module commit d5e4c69 and a filter configuration that dispatches multiple HTTP subrequests. (I could not reproduce the problem on a configuration with a single HTTP dispatch).This was one of the segfaults:
I got others with different backtraces, which suggested an earlier memory corruption.
I installed Valgrind in the container and got a ton on memory errors, all which seem to follow from the same cause as the first one:
ngx_wasm_socket_tcp_resume
triggers thengx_http_proxy_wasm_dispatch_resume_handler
, but the request data has already been destroyed byngx_http_free_request
.The invalid data in the Valgrind trace above is
call->pwexec
, wherecall
comes fromsock->data
, which is the input argument of the function.The deallocation at
ngx_http_free_request
isngx_destroy_pool
forr->pool
.One weird behavior quirk I noticed in the logs before the first Valgrind error is that I did get two
call: run
lines (which trigger in my filter fromon_http_request_headers
but only onecall: resume
line (which triggers fromon_http_call_response
). The resume operation seems delayed until afterngx_http_free_request
already destroyed the main request's data, taking thepwexec
context with it.This is unexpected because in normal operation the filter should be paused waiting for the last dispatch to finish.
Normally, in my local build, I get this sequence of operations for two dispatch calls:
Data for first dispatch arrives:
First dispatch resumes:
it PAUSEs at the end because there's another one pending:
Data for second dispatch arrives:
second dispatch resumes:
done, it sets CONTINUE and the next phases (
on_response_headers
etc.) continue:Contrast to what happens in the crashing case:
(
wasm tcp socket
logs and the like did not show up in the container logs for some reason)I trigger two calls, with dispatch ids 0 ond 1:
The first one resumes, but CONTINUE is set right away.
This causes all other phases to run and the filter context to be freed.
And then when the resume handler runs, it's operating on dead data:
This seems to indicate that the crash is caused by something being scheduled out-of-order earlier on (rather than something like "this piece of data is allocated with a pool with the wrong lifetime").
full_log.txt
The text was updated successfully, but these errors were encountered: