-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel dispatch calls freeze when Lua resolver yields #524
Comments
Interesting: running #523 on CI with Valgrind produced the following after the test cases timed out: https://github.com/Kong/ngx_wasm_module/actions/runs/8438836263/job/23111996715
...so it seems that an |
Thanks for the test case! Long story short, yes as by the Valgrind report, it appears our Lua bridge cannot handle more than one "Lua thread" at a time ( At first I enabled some of the DDEBUG macros and ran the test as-is. I commented the error.log output here, which shows what is currently going on. "posted event" followed by "delete posted event" is not of concern, since its part of I then commented-out the lua_reset_ctx call which prevented the waiting file descriptor for call 1 from being prematurely closed, but also showed that call 1 was never resumed, even after the Lua thread of call 2 had yielded and resumed all the way through. It then became apparent that rctx only supports a single lctx object, which is problematic since the http module resumes yielded Lua threads via its single lctx reference. Maybe we should make |
Thanks for taking a look at it!
Good, so I think I was looking at the right thing: that
Something along those lines was the first thing that came to my mind when I saw the Valgrind output. I'll give that a try! |
I did try this approach: https://github.com/Kong/ngx_wasm_module/tree/wip/lctxs-queue I managed to collect all lctxs and resume them, but it doesn't quite work, because the socket data did not arrive correctly. What is happening is:
Apparently, the socket data connection is only plugged correctly to the most-recently-created lctx? I don't fully understand how data is supposed to flow there.
If there is no way to map incoming data to the correct lctx at the "wev_handler", then that would be the way to go. But then, OpenResty does have its own socket scheduler, right? (i.e. it implements the equivalents of both luasocket and copas). Not sure how we would be able to put another scheduling layer on top of that. Lua coroutine schedulers are not composable out-of-the-box... |
At first I noticed that in
Since |
Yep, that's what I feared.
This is going to be quite the journey. I've always been uncomfortable with that part of the codebase not knowing what exactly the various ctx-this-and-that variables exactly mean; about time I build a proper understanding of those bits. I'm going to deep dive into this, since it's a blocker for the other stuff I was doing. Thanks for the pointers! |
I think we will figure it out; I made good progress this morning, so far both calls resume with the proper |
Major refactor of the Lua bridge to support multiple concurrent yielding Lua threads. The old implementation would break down when scheduling more than one yielding Lua thread at a time. The new implementation "tricks" OpenResty by scheduling uthreads via C and passing these threads to the OpenResty runloop as if they were created from Lua (via `ngx.thread`). Because all uthreads must resume their "parent thread" when finished (as per OpenResty's implementation), we schedule a stub "entry thread" whenever we are trying to use the Lua bridge. This entry thread itself does nothing and is collected at request pool cleanup. List of significant changes for this refactor: - *Breaking:* the `proxy_wasm.start()` FFI function is **removed**. Only `proxy_wasm.attach()` is now necessary, and the filter chain is only resumed once the ngx_http_wasm_module `rewrite` or `access` phases are entered. Prior, `proxy_wasm.start()` would resume the filter chain during the ngx_http_lua_module phase handlers, which was incompatible with Lua threads yielding. - In ngx_wasm_socket_tcp, the `sock->env` member is now a pointer to the request's `env` instead of a copy so as to manipulate the `env->state` control variable. - The `wasm_call` directive can now yield, which allows for sanity testing of the Lua bridge yielding functionality. Fix #524
Major refactor of the Lua bridge to support multiple concurrent yielding Lua threads. The old implementation would break down when scheduling more than one yielding Lua thread at a time. The new implementation "tricks" OpenResty by scheduling uthreads via C and passing these threads to the OpenResty runloop as if they were created from Lua (via `ngx.thread`). Because all uthreads must resume their "parent thread" when finished (as per OpenResty's implementation), we schedule a stub "entry thread" whenever we are trying to use the Lua bridge. This entry thread itself does nothing and is collected at request pool cleanup. List of significant changes for this refactor: - **Breaking:** the `proxy_wasm.start()` FFI function is **removed**. Only `proxy_wasm.attach()` is now necessary, and the filter chain is only resumed once the ngx_http_wasm_module `rewrite` or `access` phases are entered. Prior, `proxy_wasm.start()` would resume the filter chain during the ngx_http_lua_module phase handlers, which was incompatible with Lua threads yielding. - In ngx_wasm_socket_tcp, the `sock->env` member is now a pointer to the request's `env` instead of a copy so as to manipulate the `env->state` control variable. - The `wasm_call` directive can now yield, which allows for sanity testing of the Lua bridge yielding functionality. Fix #524
Major refactor of the Lua bridge to support multiple concurrent yielding Lua threads. The old implementation would break down when scheduling more than one yielding Lua thread at a time. The new implementation "tricks" OpenResty by scheduling uthreads via C and passing these threads to the OpenResty runloop as if they were created from Lua (via `ngx.thread`). Because all uthreads must resume their "parent thread" when finished (as per OpenResty's implementation), we schedule a stub "entry thread" whenever we are trying to use the Lua bridge. This entry thread itself does nothing and is collected at request pool cleanup. List of significant changes for this refactor: - **Breaking:** the `proxy_wasm.start()` FFI function is **removed**. Only `proxy_wasm.attach()` is now necessary, and the filter chain is only resumed once the ngx_http_wasm_module `rewrite` or `access` phases are entered. Prior, `proxy_wasm.start()` would resume the filter chain during the ngx_http_lua_module phase handlers, which was incompatible with Lua threads yielding. - In ngx_wasm_socket_tcp, the `sock->env` member is now a pointer to the request's `env` instead of a copy so as to manipulate the `env->state` control variable. - The `wasm_call` directive can now yield, which allows for sanity testing of the Lua bridge yielding functionality. Fix #524
Major refactor of the Lua bridge to support multiple concurrent yielding Lua threads. The old implementation would break down when scheduling more than one yielding Lua thread at a time. The new implementation "tricks" OpenResty by scheduling uthreads via C and passing these threads to the OpenResty runloop as if they were created from Lua (via `ngx.thread`). Because all uthreads must resume their "parent thread" when finished (as per OpenResty's implementation), we schedule a stub "entry thread" whenever we are trying to use the Lua bridge. This entry thread itself does nothing and is collected at request pool cleanup. List of significant changes for this refactor: - **Breaking:** the `proxy_wasm.start()` FFI function is **removed**. Only `proxy_wasm.attach()` is now necessary, and the filter chain is only resumed once the ngx_http_wasm_module `rewrite` or `access` phases are entered. Prior, `proxy_wasm.start()` would resume the filter chain during the ngx_http_lua_module phase handlers, which was incompatible with Lua threads yielding. - In ngx_wasm_socket_tcp, the `sock->env` member is now a pointer to the request's `env` instead of a copy so as to manipulate the `env->state` control variable. - The `wasm_call` directive can now yield, which allows for sanity testing of the Lua bridge yielding functionality. Fix #524
Major refactor of the Lua bridge to support multiple concurrent yielding Lua threads. The old implementation would break down when scheduling more than one yielding Lua thread at a time. The new implementation "tricks" OpenResty by scheduling uthreads via C and passing these threads to the OpenResty runloop as if they were created from Lua (via `ngx.thread`). Because all uthreads must resume their "parent thread" when finished (as per OpenResty's implementation), we schedule a stub "entry thread" whenever we are trying to use the Lua bridge. This entry thread itself does nothing and is collected at request pool cleanup. List of significant changes for this refactor: - **Breaking:** the `proxy_wasm.start()` FFI function is **removed**. Only `proxy_wasm.attach()` is now necessary, and the filter chain is only resumed once the ngx_http_wasm_module `rewrite` or `access` phases are entered. Prior, `proxy_wasm.start()` would resume the filter chain during the ngx_http_lua_module phase handlers, which was incompatible with Lua threads yielding. - In ngx_wasm_socket_tcp, the `sock->env` member is now a pointer to the request's `env` instead of a copy so as to manipulate the `env->state` control variable. - The `wasm_call` directive can now yield, which allows for sanity testing of the Lua bridge yielding functionality. - A new `rctx->resume_handler` pointer holds the resume entry point back from yielding facilities into `ngx_http_core_run_phases`. For now, only the Lua bridge uses it, but other yielding facilities should be refactored to use it so as to factorize our resuming code. Fix #524
Major refactor of the Lua bridge to support multiple concurrent yielding Lua threads. The old implementation would break down when scheduling more than one yielding Lua thread at a time. The new implementation "tricks" OpenResty by scheduling uthreads via C and passing these threads to the OpenResty runloop as if they were created from Lua (via `ngx.thread`). Because all uthreads must resume their "parent thread" when finished (as per OpenResty's implementation), we schedule a stub "entry thread" whenever we are trying to use the Lua bridge. This entry thread itself does nothing and is collected at request pool cleanup. List of significant changes for this refactor: - **Breaking:** the `proxy_wasm.start()` FFI function is **removed**. Only `proxy_wasm.attach()` is now necessary, and the filter chain is only resumed once the ngx_http_wasm_module `rewrite` or `access` phases are entered. Prior, `proxy_wasm.start()` would resume the filter chain during the ngx_http_lua_module phase handlers, which was incompatible with Lua threads yielding. - In ngx_wasm_socket_tcp, the `sock->env` member is now a pointer to the request's `env` instead of a copy so as to manipulate the `env->state` control variable. - The `wasm_call` directive can now yield, which allows for sanity testing of the Lua bridge yielding functionality. - A new `rctx->resume_handler` pointer holds the resume entry point back from yielding facilities into `ngx_http_core_run_phases`. For now, only the Lua bridge uses it, but other yielding facilities should be refactored to use it so as to factorize our resuming code. Fix #524
Major refactor of the Lua bridge to support multiple concurrent yielding Lua threads. The old implementation would break down when scheduling more than one yielding Lua thread at a time. The new implementation "tricks" OpenResty by scheduling uthreads via C and passing these threads to the OpenResty runloop as if they were created from Lua (via `ngx.thread`). Because all uthreads must resume their "parent thread" when finished (as per OpenResty's implementation), we schedule a stub "entry thread" whenever we are trying to use the Lua bridge. This entry thread itself does nothing and is collected at request pool cleanup. List of significant changes for this refactor: - **Breaking:** the `proxy_wasm.start()` FFI function is **removed**. Only `proxy_wasm.attach()` is now necessary, and the filter chain is only resumed once the ngx_http_wasm_module `rewrite` or `access` phases are entered. Prior, `proxy_wasm.start()` would resume the filter chain during the ngx_http_lua_module phase handlers, which was incompatible with Lua threads yielding. - In ngx_wasm_socket_tcp, the `sock->env` member is now a pointer to the request's `env` instead of a copy so as to manipulate the `env->state` control variable. - The `wasm_call` directive can now yield, which allows for sanity testing of the Lua bridge yielding functionality. - A new `rctx->resume_handler` pointer holds the resume entry point back from yielding facilities into `ngx_http_core_run_phases`. For now, only the Lua bridge uses it, but other yielding facilities should be refactored to use it so as to factorize our resuming code. Fix #524
Major refactor of the Lua bridge to support multiple concurrent yielding Lua threads. The old implementation would break down when scheduling more than one yielding Lua thread at a time. The new implementation "tricks" OpenResty by scheduling uthreads via C and passing these threads to the OpenResty runloop as if they were created from Lua (via `ngx.thread`). Because all uthreads must resume their "parent thread" when finished (as per OpenResty's implementation), we schedule a stub "entry thread" whenever we are trying to use the Lua bridge. This entry thread itself does nothing and is collected at request pool cleanup. List of significant changes for this refactor: - **Breaking:** the `proxy_wasm.start()` FFI function is **removed**. Only `proxy_wasm.attach()` is now necessary, and the filter chain is only resumed once the ngx_http_wasm_module `rewrite` or `access` phases are entered. Prior, `proxy_wasm.start()` would resume the filter chain during the ngx_http_lua_module phase handlers, which was incompatible with Lua threads yielding. - The `ngx.semaphore` API can be used in the Lua bridge. The default Lua resolver now has synchronization enabled. - In ngx_wasm_socket_tcp, the `sock->env` member is now a pointer to the request's `env` instead of a copy so as to manipulate the `env->state` control variable. - The `wasm_call` directive can now yield, which allows for sanity testing of the Lua bridge yielding functionality. - A new `rctx->resume_handler` pointer holds the resume entry point back from yielding facilities into `ngx_http_core_run_phases`. For now, only the Lua bridge uses it, but other yielding facilities should be refactored to use it so as to factorize our resuming code. Fix #524
Problem summary: If you dispatch two parallel HTTP calls, and both call into Lua for DNS resolution, and both yield to wait for the DNS response, only the second one gets a response. The first one waits forever.
Background and some observations of the behavior
This has been observed first with Kong Gateway, but it is also reproducible in this module's test suite (see test cases in PR #523).
Using two different hostnames allows us to see in the logs that the second DNS request is the one that gets a response.
Yielding affects the behavior:
In short, it seems that triggering the Lua resolver a second time discards the pending DNS resolution coroutine.
Logs
Enabling "debug_mail" logs on the Kong Gateway showed some interesting bits which may provide some clues. Here's what I've observed:
First we create a new Lua thread to resolve
gobolinux.org
, all is normal so far:This causes the DNS client to trigger the UDP message for DNS resolution, epoll fd 34. It does not get an immediate response, so it yields:
(we then get this which I don't if it's at all related)
what follows immediately is the next dispatch, which needs resolving
example.com
:but unlike the dispatch above where the three lines
lua reset ctx
,http lua finalize threads
andlua run thread
were next to each other, we get instead this:Looks like it deleted coroutine ending-62C0, which was the one from the
gobolinux.org
DNS resolution above!?The query for
example.com
then proceeds is a similar fashion to the previous one... (even with the same fd:34)...but then after a while we do get data coming from fd:34 and we get a query result for
example.com
.The text was updated successfully, but these errors were encountered: