You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We (@neighboorhoodie, @janl, myself) have observed some incorrect state conditions in CouchDB’s use of ibrowse in the replicator that lead to incorrect parsing of some HTTP responses. This is most likely to occur when a connection has been interrupted or a response not fully consumed, and it can lead to an ibrowse process retaining headers from previous responses and therefore incorrectly parsing later ones. In particular it can lead to non-chunked responses being parsed as chunked ones, which leads to an error where hexlist_to_integer and to_ascii are called with bad input.
Description
We observed this when the following error showed up in customer logs:
[error] 2023-11-20T00:14:06.509562Z nonode@nohost <0.3929.5131> --------
gen_server <0.3929.5131> terminated with reason: no function clause matching ibrowse_http_client:to_ascii(114)(line:1794)
at
ibrowse_http_client:hexlist_to_integer/3(line:1790)
<= ibrowse_http_client:parse_11_response/2(line:1218)
<= ibrowse_http_client:handle_sock_data/2(line:320)
<= gen_server:try_dispatch/4(line:616)
<= gen_server:handle_msg/6(line:686)
<= proc_lib:init_p_do_apply/3(line:247)
ibrowse_http_client:to_ascii is being called with the integer 114, which denotes the letter r in ASCII, which is not a valid hex digit, so the call fails.
The gen_server error includes the last message received by the process:
The error stacktrace passes through a clause of parse_11_response that handles chunked encoding, but the above response message does not have a Transfer-Encoding header. This is being invoked because it's reacting to the gen_serverstate, which in this case included the following header list:
(The full state object is quite large, but includes some other values indicating chunked encoding. We can provide the full state with client info redacted if needed.)
These headers do not match those of the incoming message. An HTTP client is expected to handle streamed data, and so needs to keep some state to record headers it's seen so far. However the incoming message is a complete response, not a chunk of a stream, and so it should be parsed from a clean starting state.
Our client has reported that they're seeing a lot of partial/interrupted responses, and our best theory for why this is happening is that an ibrowse_http_client instance is partially processing a response, which is interrupted, and the process is then reused for another request without being cleaned up first. Two main scenarios seem likely causes for this:
A code path that retries requests: a client makes one request, whose response is partially received, and the client then retries the request using the same ibrowse_http_client process, leading to confused state. For example, couch_replicator_api_wrap:open_doc_revs performs retries, and the above headers indicating a chunked multipart response are consistent with fetching a doc with open_revs=all. The gen_server state also indicates this is the kind of request being made.
Reuse of a client from the pool without resetting its state. The release handler in couch_replicator_connection demonitors the worker and sets its mref to undefined, but otherwise sends no signal to the worker to reset its state. This could lead to a client being reused when it still contains headers from a previous response.
Steps to Reproduce
This is tricky to reproduce mechanically because it requires getting responses interrupted in order to trigger the bad state. We have only observed its effects in customer logs and then diagnosed possible causes by code review.
Expected Behaviour
When reusing an ibrowse process either immediately or by releasing it back to the pool, its state should be reset so that it does not conflate the state from multiple responses. If this cannot be done then the connection should not be reused.
Your Environment
CouchDB version used: observed in production on 2.3.1, but code review makes us believe this is also possible in 3.3.2
Browser name and version: N/A
Operating system and version: Ubuntu 20.04
The text was updated successfully, but these errors were encountered:
I think in this case the first step might be to show that it's still happening in the latest release. 2.3.1 is a few years old (4?) and we've had a number of fixes in the ibrowse / replicator since then.
So perhaps try to set up a mirrored test cluster with the same data, access patterns, and settings, but with 3.3.2. Or, construct a reproducer script to generate a similar scenario synthetically. See if in the original environment tweaking timeouts, worker processes, batch sizes or other settings makes the problem more pronounced, and crank those up in the test script, to make it more likely for the problem to show up. For instance, say shortening fabric request timeouts to 10 sec, makes the issue more likely to happen, then shorten it to 1 sec in the reproducer script.
@nickva that’s a fair ask :) — We tracked it as far as checking that the retry logic has not changed (much) since 2.3.1.
The problem for us is that the app and environment that causes this fails completely with 3.x, but that’s a thing we are working on separately. Will report back :)
We (@neighboorhoodie, @janl, myself) have observed some incorrect state conditions in CouchDB’s use of
ibrowse
in the replicator that lead to incorrect parsing of some HTTP responses. This is most likely to occur when a connection has been interrupted or a response not fully consumed, and it can lead to anibrowse
process retaining headers from previous responses and therefore incorrectly parsing later ones. In particular it can lead to non-chunked responses being parsed as chunked ones, which leads to an error wherehexlist_to_integer
andto_ascii
are called with bad input.Description
We observed this when the following error showed up in customer logs:
ibrowse_http_client:to_ascii
is being called with the integer114
, which denotes the letterr
in ASCII, which is not a valid hex digit, so the call fails.The
gen_server
error includes the last message received by the process:The error stacktrace passes through a clause of
parse_11_response
that handles chunked encoding, but the above response message does not have aTransfer-Encoding
header. This is being invoked because it's reacting to thegen_server
state, which in this case included the following header list:(The full state object is quite large, but includes some other values indicating chunked encoding. We can provide the full state with client info redacted if needed.)
These headers do not match those of the incoming message. An HTTP client is expected to handle streamed data, and so needs to keep some state to record headers it's seen so far. However the incoming message is a complete response, not a chunk of a stream, and so it should be parsed from a clean starting state.
Our client has reported that they're seeing a lot of partial/interrupted responses, and our best theory for why this is happening is that an
ibrowse_http_client
instance is partially processing a response, which is interrupted, and the process is then reused for another request without being cleaned up first. Two main scenarios seem likely causes for this:A code path that retries requests: a client makes one request, whose response is partially received, and the client then retries the request using the same
ibrowse_http_client
process, leading to confused state. For example,couch_replicator_api_wrap:open_doc_revs
performs retries, and the above headers indicating a chunked multipart response are consistent with fetching a doc withopen_revs=all
. Thegen_server
state also indicates this is the kind of request being made.Reuse of a client from the pool without resetting its state. The
release
handler incouch_replicator_connection
demonitors the worker and sets itsmref
toundefined
, but otherwise sends no signal to the worker to reset its state. This could lead to a client being reused when it still contains headers from a previous response.Steps to Reproduce
This is tricky to reproduce mechanically because it requires getting responses interrupted in order to trigger the bad state. We have only observed its effects in customer logs and then diagnosed possible causes by code review.
Expected Behaviour
When reusing an
ibrowse
process either immediately or by releasing it back to the pool, its state should be reset so that it does not conflate the state from multiple responses. If this cannot be done then the connection should not be reused.Your Environment
The text was updated successfully, but these errors were encountered: