Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chttpd request processing does not reset pdict predictably #4909

Open
chewbranca opened this issue Dec 11, 2023 · 7 comments
Open

Chttpd request processing does not reset pdict predictably #4909

chewbranca opened this issue Dec 11, 2023 · 7 comments

Comments

@chewbranca
Copy link
Contributor

Description

The Mochiweb http server uses a shared and re-used connection pool with chttpd:handle_request as the entry point for the Mochiweb request processing handoff.

The bug is that we don't start with a fresh process dictionary between requests, as the chttpd processes are reused so pdict values that are not expressly cleared will persist. Even worse, pdict values that are only set by specific code paths will persist until the next time an incoming request makes its way to this particular chttpd worker in the pool, at which point the old values will be updated.

An analogy would be if we used paper receipt to log each request, but instead of starting with a fresh form every time, we re-use the old form but only erase the old values when we need to modify them, so any other form entries would persist from whatever prior request filled them.

We also don't set the nonce value until

erlang:put(nonce, Nonce),
, nearly 100 lines of code into the request handling, which means that any runtime errors happening prior to the setting of the nonce value will re-use the previously generated nonce value and return that as the "unique" value of the request, and will continue to return the existing value until a request is processed that reaches that line of code to generate a new nonce.

I propose we clear the pdict between requests. I don't think it's safe to do erlang:erase() as that'll also clear out the Mochiweb pdict values from https://github.com/apache/couchdb-mochiweb/blob/main/src/mochiweb_request.erl#L64-L78 so instead I propose a filtered clearing of the pdict values, like so:

diff --git a/src/chttpd/src/chttpd.erl b/src/chttpd/src/chttpd.erl
index ab8e1e9a3..ca55a8ac8 100644
--- a/src/chttpd/src/chttpd.erl
+++ b/src/chttpd/src/chttpd.erl
@@ -221,6 +221,7 @@ stop() ->
     mochiweb_http:stop(?MODULE).
 
 handle_request(MochiReq0) ->
+    couch_util:clear_pdict(), %% Make sure we start clean, everytime
     erlang:put(?REWRITE_COUNT, 0),
     MochiReq = couch_httpd_vhost:dispatch_host(MochiReq0),
     handle_request_int(MochiReq).
diff --git a/src/couch/src/couch_util.erl b/src/couch/src/couch_util.erl
index 739df28e5..0bc9a94ba 100644
--- a/src/couch/src/couch_util.erl
+++ b/src/couch/src/couch_util.erl
@@ -46,6 +46,7 @@
 -export([verify_hash_names/2]).
 -export([get_config_hash_algorithms/0]).
 -export([remove_sensitive_data/1]).
+-export([clear_pdict/0, clear_pdict/1]).
 
 -include_lib("couch/include/couch_db.hrl").
 
@@ -870,3 +871,31 @@ remove_sensitive_data(KVList) ->
     KVList1 = lists:keyreplace(<<"password">>, 1, KVList, {<<"password">>, <<"****">>}),
     % some KVList entries are atoms, so test fo this too
     lists:keyreplace(password, 1, KVList1, {password, <<"****">>}).
+
+-spec clear_pdict() -> ok.
+clear_pdict() ->
+    clear_pdict(erlang:get()).
+
+%% Exclude mochiweb markers, otherwise just use erlang:erase/0
+-spec clear_pdict(list()) -> ok.
+clear_pdict([]) ->
+    ok;
+clear_pdict([{mochiweb_request_body, _V} | Rest]) ->
+    clear_pdict(Rest);
+clear_pdict([{mochiweb_request_body_length, _V} | Rest]) ->
+    clear_pdict(Rest);
+clear_pdict([{mochiweb_request_cookie, _V} | Rest]) ->
+    clear_pdict(Rest);
+clear_pdict([{mochiweb_request_force_close, _V} | Rest]) ->
+    clear_pdict(Rest);
+clear_pdict([{mochiweb_request_path, _V} | Rest]) ->
+    clear_pdict(Rest);
+clear_pdict([{mochiweb_request_post, _V} | Rest]) ->
+    clear_pdict(Rest);
+clear_pdict([{mochiweb_request_qs, _V} | Rest]) ->
+    clear_pdict(Rest);
+clear_pdict([{mochiweb_request_recv, _V} | Rest]) ->
+    clear_pdict(Rest);
+clear_pdict([{Key, _V} | Rest]) ->
+    erlang:erase(Key),
+    clear_pdict(Rest).

Where the skipped values are the pdict names from the linked Mochiweb code above. I'm not entirely sure about the legality nuances from copying in the Mochiweb names, so before I PR'ed that I figured it would be to hear back from @janl / @rnewson / etc on preferred approach here.

Alternatively, we could explicitly delete the values set but I really dislike how that results in any values falling through persisting for an unknown amount of time until it's cleared out by a similar request. So I proposed the "clear everything but Mochiweb entries" approach. Let me know what you think.

@rnewson
Copy link
Member

rnewson commented Dec 11, 2023

reusing an erlang process between requests is surprising to me but I can't find docs that reassure me that this was ever a promise from mochiweb.

however, if true, it does point out our folly in using the process dictionary so freely given the blanket advice not to do so.

I'd rather see us fix the real problem(s) instead of this PR. Or, perhaps, a patch for mochiweb to ensure a new process at the start of each request, even for keepalive connections.

@rnewson
Copy link
Member

rnewson commented Dec 11, 2023

chatted with davisp and he confirms the general problem too. 🤷

he did have a better suggestion for fix. 1) record the pdict keys at the start of our handle_request 2) delete any keys not in that set at the end of handle_request (in an after). And, yes, I think we'd have to store that list in the pdict...

@rnewson
Copy link
Member

rnewson commented Dec 11, 2023

noting that neither davisp or I think that a mochiweb erlang process is reused across sockets though. If that is happening we have a big problem, as that would be state/data passing between different clients.

@chewbranca
Copy link
Contributor Author

Yeah I was surprised as well about the chttpd process reuse. I stumbled upon this because I thought it was not doing that lol. @nickva mentioned we probably use the same processes to keep direct access to the same socket allowing for connection re-use from load balancers, so I didn't think injecting spawning of a new process would be a clean solution here.

Hmmmm... what do you consider "better" about that suggestion? It avoids the licensing and bleed over of including the names of the Mochiweb pdict values, but I personally prefer an explicit whitelist of the fields we will allow, rather than a dynamic approach that assumes the found live pdict value is allowed.

We already hardcode mochiweb in chttpd.erl, so it's not like the use of Mochiweb is pluggable. An alternative option would be to patch into Mochiweb a function that exposes a list of atoms used in the pdicts, like mochiweb:pdict_values_used() or something and then exclude that list. We could then achieve similar results as my suggested diff with [erlang:erase(K) || K <- erlang:get() -- mochiweb:pdict_values_used()].

I'm not really attached to any particular approach, but I don't really like the dynamic nature of the alternative suggestion.

@rnewson
Copy link
Member

rnewson commented Dec 11, 2023

my point is that whatever is in the pdict before handle_request is called is, by definition, mochiweb's entries. removing anything other than those at the end should avoid any "leak" now and in the future. your static approach invites problems down the line if mochiweb adds to its set and we forget to update our hard-coded list.

The true fix is to remove all reliance on the process dict in our chttpd handling.

have you a full list of potential issues caused by this process reuse across requests?

@nickva
Copy link
Contributor

nickva commented Dec 11, 2023

stashing away the mochiweb pdict entries and clearing everything except those at the end makes sense.

we may also have some stray messages in the mailbox and could be worth checking if we return early to the client while there are still workers finishing any doc updates or something like that

@chewbranca
Copy link
Contributor Author

chewbranca commented Dec 11, 2023

stashing away the mochiweb pdict entries and clearing everything except those at the end makes sense.

I think this makes sense, but how do you assert those pdict entries are for mochiweb without having a list of all mochiweb pdict entries or doing something we should avoid like string_starts_with("mochi", atom_to_list(Key)).

chewbranca added a commit that referenced this issue Dec 18, 2023
chewbranca added a commit that referenced this issue Dec 18, 2023
We need to potentially extract the usage delta from the incoming RPC
message. Rather than pattern match on all possible message formats that
could potentially include usage deltas, we instead utilize
rexi_util:extract_delta which matches against tuples ending in
`{delta, Delta}`, and separates that out from the underlying message.

The subtlety here is that receiving the message to extract the delta
changes the behavior as this was previously doing a selective receive
keyed off of the Ref, and then ignoring any other messages that arrived.
I don't know if the selective receive was intended, but I don't think
it's appropriate to leave unexpected messages floating around,
especially given things like issue #4909.

Instead of utilizing a selective receive, this switches to extracting
the message and delta like we need to do, and then in the event it finds
unexpected messages they're logged and skipped.

This selective receive was masking the lack of unlink on the linked
rexi_mon pid in fix #4906. I've also noticed some rpc responses arriving
late as well, but I haven't tracked that down, so let's log when it does
happen.
chewbranca added a commit that referenced this issue Feb 12, 2024
chewbranca added a commit that referenced this issue Feb 12, 2024
We need to potentially extract the usage delta from the incoming RPC
message. Rather than pattern match on all possible message formats that
could potentially include usage deltas, we instead utilize
rexi_util:extract_delta which matches against tuples ending in
`{delta, Delta}`, and separates that out from the underlying message.

The subtlety here is that receiving the message to extract the delta
changes the behavior as this was previously doing a selective receive
keyed off of the Ref, and then ignoring any other messages that arrived.
I don't know if the selective receive was intended, but I don't think
it's appropriate to leave unexpected messages floating around,
especially given things like issue #4909.

Instead of utilizing a selective receive, this switches to extracting
the message and delta like we need to do, and then in the event it finds
unexpected messages they're logged and skipped.

This selective receive was masking the lack of unlink on the linked
rexi_mon pid in fix #4906. I've also noticed some rpc responses arriving
late as well, but I haven't tracked that down, so let's log when it does
happen.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants