Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large file uploads fail when using TLS and Cowboy 2.7+ #1523

Open
Shemeikka opened this issue Jun 16, 2021 · 15 comments
Open

Large file uploads fail when using TLS and Cowboy 2.7+ #1523

Shemeikka opened this issue Jun 16, 2021 · 15 comments

Comments

@Shemeikka
Copy link

Hi,

We are experiencing failed file uploads when using TLS and Cowboy 2.7+ (2.7 - 2.9.0). Large file uploads (1GB+) will be shown as failed and stalled in Chrome and Firefox, and when using Curl, the response is curl: (56) Unexpected EOF.

Uploading smaller files about 100MB works with TLS and uploading large files works if TLS is not used. However, if we use TLS and Cowboy 2.6.3, then uploading large files works.

Test file can be generated using command fallocate -l 10G gentoo_root.img. Uploading the file can be done using command curl -F 'file=@gentoo_root.img' <URL>.

How we configure Cowboy in our Elixir project. Certificates are generated using mkcert.

# Configuring Cowboy in Elixir project's application.ex
{Plug.Cowboy,
scheme: server_options[:scheme],
plug: Knlgw.Endpoint,
options: [
    ip: server_host_ip,
    port: server_port,
    certfile: Keyword.get(server_config, :cert_file),
    keyfile: Keyword.get(server_config, :key_file),
    cacertfile: Keyword.get(server_config, :ca_cert_file),
    cipher_suite: :strong,
  ]
}

We use Plug.Parser for reading the multipart file upload. Those options i.e. length and others, don't have any effect on this problem.

# Use Plug.Parser.Multipart for parsing the request body if it's multipart
plug(Plug.Parsers,
  parsers: [
    {:multipart,
     length: 2_000_000_000, read_length: 5_000_000, read_timeout: 60_000, send_timeout: 10_000}
  ],
  pass: ["*/*"]
)

Versions:


hexpm/elixir:1.11.2-erlang-23.1.5-alpine-3.12.1

cowboy 2.9.0
plug_cowboy 2.5.0

Also tried with Erlang 23.3 versions and same experience.

@essen
Copy link
Member

essen commented Jun 16, 2021

Would be good to trace the Cowboy connection process to see what it's doing. Or get a process_info(Pid) as a start.

@Shemeikka
Copy link
Author

Hi,

Here's a process info for the cowboy supervisor

[
  registered_name: :cowboy_sup,
  current_function: {:gen_server, :loop, 7},
  initial_call: {:proc_lib, :init_p, 5},
  status: :waiting,
  message_queue_len: 0,
  links: [#PID<0.240.0>, #PID<0.242.0>],
  dictionary: [
    "$ancestors": [#PID<0.240.0>],
    "$initial_call": {:supervisor, :cowboy_sup, 1}
  ],
  trap_exit: true,
  error_handler: :error_handler,
  priority: :normal,
  group_leader: #PID<0.239.0>,
  total_heap_size: 986,
  heap_size: 610,
  stack_size: 12,
  reductions: 1530,
  garbage_collection: [
    max_heap_size: %{error_logger: true, kill: true, size: 0},
    min_bin_vheap_size: 46422,
    min_heap_size: 233,
    fullsweep_after: 65535,
    minor_gcs: 4
  ],
  suspending: []
]

Only change to the process info during the failing upload is slightly larger 'reductions'. Don't know how I could get process info for the actual connection process since the connection closes usually within 2-3 seconds.

Here's an link to a Gist which has output of Observer trace of Cowboy during the failed upload. I had to remove large amount of lines, mostly binary data, from the output.

@essen
Copy link
Member

essen commented Jun 17, 2021

No not cowboy_sup, the connection process. If it gets stuck in a function it'll tell you which but you have to find the process first. But nevermind that, it sounds like the connection is being closed, rather than it being stuck?

If it's a crash then you're going to have a crash report. If it's "normal" then you won't. Your best bet then would be to investigate what happens in terminate, it'll give you the close reason.

@Shemeikka
Copy link
Author

Should the tracing show if the connection process closes? The linked Gist only shows when the connection is spawned.

Any ideas how to catch cowboy connection process termination? I'm not receiving any crash reports or other messages to log when using standard plug_cowboy with added logging.

I have tried logging with plug by setting setting

config :plug_cowboy,
    log_exceptions_with_status_code: [1..599]

I also tried using PhoenixCowboyLogging which uses protocol_options: :onresponse but that didn't have any effect.

@essen
Copy link
Member

essen commented Jun 18, 2021

OK I have missed the gist. You can trace for proc events exclusively. That should be enough. No need for calls or messages.

@Shemeikka
Copy link
Author

Hi,

Here's a link to a Gist. This time it's more interesting than the previous.

Here's a process info of PID 0.3924.0 during the upload. Not sure if that's the connection process.

> Process.info(pid("0.3924.0"))
[
  current_function: {:gen_statem, :loop_receive, 3},
  initial_call: {:proc_lib, :init_p, 5},
  status: :waiting,
  message_queue_len: 0,
  links: [#PID<0.3923.0>, #Port<0.704>, #PID<0.132.0>],
  dictionary: [
    "$initial_call": {:ssl_gen_statem, :init, 1},
    ssl_pem_cache: :ssl_pem_cache,
    "$ancestors": [:tls_connection_sup, :tls_sup, :ssl_connection_sup, :ssl_sup,
     #PID<0.121.0>],
    ssl_manager: :ssl_manager
  ],
  trap_exit: true,
  error_handler: :error_handler,
  priority: :normal,
  group_leader: #PID<0.120.0>,
  total_heap_size: 15143,
  heap_size: 4185,
  stack_size: 18,
  reductions: 29117680,
  garbage_collection: [
    max_heap_size: %{error_logger: true, kill: true, size: 0},
    min_bin_vheap_size: 46422,
    min_heap_size: 233,
    fullsweep_after: 65535,
    minor_gcs: 1350
  ],
  suspending: []
]

I did some more testing and noticed that this upload failure happens only when using HTTP/2 with TLS. Uploading large files works when client forces the use of HTTP/1.1 with TLS.

@essen
Copy link
Member

essen commented Jun 21, 2021

There might have been interesting events but they have been dropped. The exit reason is normal and that doesn't help.

OK then I suppose figuring out what cowboy_http2 is doing is the next step. So again a trace this time dbg:tpl(cowboy_http2, []). dbg:p(all, c). - trace to file, and see at what point it goes into the terminate functions. What happens immediately before is likely the reason why the upload fails.

@Shemeikka
Copy link
Author

Hi,

I had to try several different settings due to the large amount of data but I finally got something interesting

08:12:30:156917 (<0.525.0>) in {code_server,call,1}
08:12:30:156987 (<0.525.0>) spawn <0.530.0> as erlang:apply(#Fun<gen_statem.0.120824088>,[])
08:12:30:156999 (<0.530.0>) spawned <0.525.0> {erlang,apply,[#Fun<gen_statem.0.120824088>,[]]}
08:12:30:157010 (<0.525.0>) out {gen_statem,call_clean,4}
08:12:30:157177 (<0.527.0>) in {cowboy_http2,terminate_linger_loop,3}
08:12:30:157200 (<0.527.0>) exit {shutdown,{connection_error,enhance_your_calm,
                            'Frame rate larger than configuration allows. Flood? (CVE-2019-9512, CVE-2019-9515, CVE-2019-9518)'}}
08:12:30:157210 (<0.527.0>) out_exited 0
08:12:30:157291 (<0.530.0>) in {erlang,apply,2}
08:12:30:157298 (<0.530.0>) out {gen,do_call,4}
08:12:30:157315 (<0.525.0>) in {gen_statem,call_clean,4}
08:12:30:157321 (<0.525.0>) out {gen_statem,call_clean,4}
08:12:30:157329 (<0.524.0>) in {gen_statem,loop_receive,3}
08:12:30:157363 (<0.524.0>) out {gen_statem,loop_receive,3}
08:12:30:157369 (<0.530.0>) in {gen,do_call,4}
08:12:30:157374 (<0.530.0>) exit normal
08:12:30:157377 (<0.530.0>) out_exited 0
08:12:30:157383 (<0.524.0>) in {gen_statem,loop_receive,3}
08:12:30:157386 (<0.524.0>) out {gen_statem,loop_receive,3}
08:12:30:157390 (<0.525.0>) in {gen_statem,call_clean,4}
08:12:30:157407 (<0.525.0>) spawn <0.531.0> as erlang:apply(#Fun<tls_connection.0.3952457>,[])
08:12:30:157410 (<0.531.0>) spawned <0.525.0> {erlang,apply,[#Fun<tls_connection.0.3952457>,[]]}
08:12:30:157417 (<0.525.0>) exit {shutdown,peer_close}
08:12:30:157422 (<0.525.0>) out_exited 0
08:12:30:157457 (<0.531.0>) in {erlang,apply,2}
08:12:30:157463 (<0.531.0>) out {tls_connection,'-ensure_sender_terminate/2-fun-0-',1}
08:12:30:157467 (<0.524.0>) in {gen_statem,loop_receive,3}
08:12:30:157472 (<0.524.0>) exit {shutdown,peer_close}
08:12:30:157478 (<0.524.0>) out_exited 0
08:12:35:158429 (<0.531.0>) in {tls_connection,'-ensure_sender_terminate/2-fun-0-',1}
08:12:35:158452 (<0.531.0>) exit normal
08:12:35:158457 (<0.531.0>) out_exited 0

This seems to be same problem as in #1497 ?

I tried setting max_received_frame like protocol_options: [max_received_frame_rate: {150_000, 10_000}] and with that value, the upload works with HTTP/2 and TLS.

Our use-case should be able to handle large files (10-20GB) and I'm thinking if we enable large (500_000+?) values for max_received_frame, then we are vulnerable against those mentioned CVEs. Are there any other mitigations in Cowboy for those CVEs?

@essen
Copy link
Member

essen commented Jun 22, 2021

I guess we should increase it again. I had increased to 10k/10s but if that's not enough then it should be increased again.

Those CVEs are not super important, especially in Erlang, and it's basically impossible to get a value that works out for everyone. So I think we could increase the default to whatever works out for you.

Alternatively, increasing max_frame_size_received should reduce the number of frames you're getting. The default is very conservative (it's the protocol's default). It might make sense to provide a different default for Cowboy.

A third solution would be to produce a warning log for these types of connection errors (enhance_your_calm) with the message included and perhaps some help to remedy to this situation.

@Shemeikka
Copy link
Author

Okay, I think it's enough if we just set both max_received_frame and max_frame_size_received to some good values in our application. Will have to test to figure out what those values could be.

@linchengcs
Copy link

linchengcs commented Jan 24, 2022

I once had the a similar issue, and fixed it by setting idle_timeout larger.
I guess the connection terminated before the file transfer finished.

@essen essen added this to the 2.11 milestone Nov 23, 2023
@essen
Copy link
Member

essen commented Jan 16, 2024

Hello everyone, I am looking for feedback as to what the max_received_frame and max_frame_size_received values should be. What did you end up using? Thank you.

For idle_timeout there is now an option reset_idle_timeout_on_send that should help. It will default to false for backward compatibility reasons but the default will likely change in 3.0.

@essen
Copy link
Member

essen commented Jan 16, 2024

Nginx allows 16MB (the maximum possible): https://github.com/nginx/nginx/blob/6ceef19/src/http/v2/ngx_http_v2.h#L21 and https://github.com/nginx/nginx/blob/6ceef19/src/http/v2/ngx_http_v2.c#L2758

I don't think we want to increase it that much but we can definitely use a higher default value. If we increase it to 1MB then this doesn't invalidate the CVE protection and we can still receive large files. I think I will also increase the max_receive_frame_sent so that we send responses faster by default. Using 1MB (1048576) for both.

@essen
Copy link
Member

essen commented Jan 16, 2024

OK the max frame size sent is already infinity so nothing to do there. For the max_frame_size_received I will only update the manual to indicate the importance of higher values. The default will have to wait for Cowboy 2.12 though because I am not updating Cowlib in 2.11. So this ticket will be left open for now.

@essen
Copy link
Member

essen commented Jan 16, 2024

Documentation update is done. Leaving this ticket open, thanks!

@essen essen removed this from the 2.11 milestone Jan 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants