Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Previous response state retained when an ibrowse client is reused #4870

Open
jcoglan opened this issue Nov 23, 2023 · 2 comments
Open

Previous response state retained when an ibrowse client is reused #4870

jcoglan opened this issue Nov 23, 2023 · 2 comments

Comments

@jcoglan
Copy link
Contributor

jcoglan commented Nov 23, 2023

We (@neighboorhoodie, @janl, myself) have observed some incorrect state conditions in CouchDB’s use of ibrowse in the replicator that lead to incorrect parsing of some HTTP responses. This is most likely to occur when a connection has been interrupted or a response not fully consumed, and it can lead to an ibrowse process retaining headers from previous responses and therefore incorrectly parsing later ones. In particular it can lead to non-chunked responses being parsed as chunked ones, which leads to an error where hexlist_to_integer and to_ascii are called with bad input.

Description

We observed this when the following error showed up in customer logs:

[error] 2023-11-20T00:14:06.509562Z nonode@nohost <0.3929.5131> --------
gen_server <0.3929.5131> terminated with reason: no function clause matching ibrowse_http_client:to_ascii(114)(line:1794)
at
     ibrowse_http_client:hexlist_to_integer/3(line:1790)
  <= ibrowse_http_client:parse_11_response/2(line:1218)
  <= ibrowse_http_client:handle_sock_data/2(line:320)
  <= gen_server:try_dispatch/4(line:616)
  <= gen_server:handle_msg/6(line:686)
  <= proc_lib:init_p_do_apply/3(line:247)

ibrowse_http_client:to_ascii is being called with the integer 114, which denotes the letter r in ASCII, which is not a valid hex digit, so the call fails.

The gen_server error includes the last message received by the process:

{tcp,#Port<0.17503260>,
    <<"HTTP/1.1 500 Internal Server Error\r\n
       Cache-Control: must-revalidate\r\n
       Content-Length: 65\r\n
       Content-Type: application/json\r\n
       Date: Mon, 20 Nov 2023 00:14:06 GMT\r\n
       Server: CouchDB/2.3.1 (Erlang OTP/20)\r\n
       X-Couch-Request-ID: 5cc908776b\r\n
       X-Couch-Stack-Hash: 3377714225\r\n
       X-CouchDB-Body-Time: 0\r\n
       \r\n
       {\"error\":\"badmatch\",\"reason\":\"{'EXIT',noproc}\",\"ref\":3377714225}\n">>}

The error stacktrace passes through a clause of parse_11_response that handles chunked encoding, but the above response message does not have a Transfer-Encoding header. This is being invoked because it's reacting to the gen_server state, which in this case included the following header list:

[
  {"Content-Type","multipart/mixed; boundary=\"7f6cd44f44a5d757879834cf6e8530f7\""},
  {"Date","Mon, 20 Nov 2023 00:13:34 GMT"},
  {"Server","CouchDB/2.3.1 (Erlang OTP/20)"},
  {"Transfer-Encoding","chunked"}
]

(The full state object is quite large, but includes some other values indicating chunked encoding. We can provide the full state with client info redacted if needed.)

These headers do not match those of the incoming message. An HTTP client is expected to handle streamed data, and so needs to keep some state to record headers it's seen so far. However the incoming message is a complete response, not a chunk of a stream, and so it should be parsed from a clean starting state.

Our client has reported that they're seeing a lot of partial/interrupted responses, and our best theory for why this is happening is that an ibrowse_http_client instance is partially processing a response, which is interrupted, and the process is then reused for another request without being cleaned up first. Two main scenarios seem likely causes for this:

  • A code path that retries requests: a client makes one request, whose response is partially received, and the client then retries the request using the same ibrowse_http_client process, leading to confused state. For example, couch_replicator_api_wrap:open_doc_revs performs retries, and the above headers indicating a chunked multipart response are consistent with fetching a doc with open_revs=all. The gen_server state also indicates this is the kind of request being made.

  • Reuse of a client from the pool without resetting its state. The release handler in couch_replicator_connection demonitors the worker and sets its mref to undefined, but otherwise sends no signal to the worker to reset its state. This could lead to a client being reused when it still contains headers from a previous response.

Steps to Reproduce

This is tricky to reproduce mechanically because it requires getting responses interrupted in order to trigger the bad state. We have only observed its effects in customer logs and then diagnosed possible causes by code review.

Expected Behaviour

When reusing an ibrowse process either immediately or by releasing it back to the pool, its state should be reset so that it does not conflate the state from multiple responses. If this cannot be done then the connection should not be reused.

Your Environment

  • CouchDB version used: observed in production on 2.3.1, but code review makes us believe this is also possible in 3.3.2
  • Browser name and version: N/A
  • Operating system and version: Ubuntu 20.04
@nickva
Copy link
Contributor

nickva commented Nov 25, 2023

Thanks for your report @jcoglan!

I think in this case the first step might be to show that it's still happening in the latest release. 2.3.1 is a few years old (4?) and we've had a number of fixes in the ibrowse / replicator since then.

So perhaps try to set up a mirrored test cluster with the same data, access patterns, and settings, but with 3.3.2. Or, construct a reproducer script to generate a similar scenario synthetically. See if in the original environment tweaking timeouts, worker processes, batch sizes or other settings makes the problem more pronounced, and crank those up in the test script, to make it more likely for the problem to show up. For instance, say shortening fabric request timeouts to 10 sec, makes the issue more likely to happen, then shorten it to 1 sec in the reproducer script.

@janl
Copy link
Member

janl commented Dec 1, 2023

@nickva that’s a fair ask :) — We tracked it as far as checking that the retry logic has not changed (much) since 2.3.1.

The problem for us is that the app and environment that causes this fails completely with 3.x, but that’s a thing we are working on separately. Will report back :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants