Frequent OOM kills - Isolated to`http2` - 50 GB of RAM usage on MacBook, causing system OOM #361

huntharo · 2024-01-03T19:39:25Z

Intro

First off - awesome program! This solves the problems I have with hey where it just gets slower when the RTT times increase even though the remote service can support the throughput. The animated UI really helps understand what's happening without having to wait for the whole thing to finish, which I love. Thanks for this!

Pairing

I can pair on this with you if you want. Google Meet or similar is fine. My email is on my profile.

Problem

I've encountered numerous OOM kills when using the program
The problems seem to happen more frequently with http2 and/or may only happen with http2
Had a case where my MacBook said it was out of memory (32 GB of RAM)
- oha was using 50 GB of RAM!
- This was no unusual test it just started failing in some way that causes a memory leak or accumulation
The case below, on AWS CloudShell runs for ~40 seconds each time before getting OOM killed
- Everything is initially fine with CPU usage around 30% and memory usage around 0.5%
- The program appears to freeze around 40 seconds
- Memory usage shoots up to 25%, 35%, and more after the freeze
- CPU usage shoots up to 100% after the freeze starts
- The program exits reliably with exit code 137 (pretty sure this is an OOM kill)
- Runs to completion if --http2 is removed and -c is adjusted to have the same total as -c * -p with --http2

`Killed` Terminations on CloudShell

Note: this endpoint is not open to random IPs but if you want to test against it, it is located in us-east-2 and I can add an IP to the security group for you to test with if you'd like.

./oha-linux-amd64 -t 15s -z 5m --http2 -c 10 -p 20 https://lambdadispatch.ghpublic.pwrdrvr.com/read

Killed at 43 seconds

Happens with `--no-tui` Too

Does NOT Happen without `--http2` with Same Worker Count

Nearly Final Memory

Problem Starts Memory - ~30 seconds of operation

Initial Memory - 2,000 RPS and Stable (0.4% memory usage)

The text was updated successfully, but these errors were encountered:

hatoo · 2024-01-04T07:36:52Z

Thanks for your report!

I had some investigation and I found some weird huge memory consumption for oha against a specific HTTP server (such as node18's https module) backend although this isn't exact same phenomenon for this issue.

Could you tell me about your server's technology? (language, library, etc..)

huntharo · 2024-01-04T14:32:31Z

Could you tell me about your server's technology? (language, library, etc..)

The pwrdrvr.com domain name above is pointed at an AWS ALB which supports both HTTP2 and HTTP1.1. The ALB, in turn, is pointed at a dotnet 8 Kestrel web server using HTTP1.1, which in turn is pointed at a proxy inside of Lambda functions using HTTP2, which is finally proxying the request to a Node.js app. The Node.js app is reading from DynamoDB and returning an average of 1.6 KB in payload size through the layers above.

But essentially, the problem appears to happen when speaking to an AWS ALB with HTTP2.

The local problem I had where it used 50 GB of RAM was potentially pointed at the Node.js app using HTTP2 or at the dotnet 8 Kestrel server using HTTP1.1 (I don't have HTTP2 enabled for the Kestrel server, but I can). The details are fuzzy on this because it only happened once so far.

I could probably setup an AWS ALB route for you that just returns a constant response string and I bet the issue will happen with that.

huntharo · 2024-01-05T03:05:02Z

I've got a solid lead now. I was running the code under the debugger and looking heal profiles using pprof.

What I noticed from normal operations is that memory accumulates steadily. The primary usage, I think, is coming from a vector that holds all the results, so that makes sense. Maybe that can be reduced if the detailed stats are not needed until the end, but maybe it cannot.

Then I was thinking that Node.js, locally, never sends a GOAWAY message on an http2 socket while the ALB likely does send that after a number of requests, say, 10,000 or 100,000 per socket.

I realized that the problem was likely when the sockets were gracefully closed by the server. To simulate what would happen in that case I just ctrl-c'd my node.js process and, sure enough, oha started racing at 800% CPU and RAM usage in the dev container went from 3% of total to 20%, 40%, 80%, then OOM kill.

tl;dr - To Reproduce

For tests with server: Start node app locally with TLS cert and http2 support (probably any other stack would be fine too, just have http2)
Start server, start http2 test: cargo run --release -- -c 20 -p 20 -z 5m --http2 --insecure https://host.docker.internal:3001/ping
a. Ctrl-c the http2 server
b. Observe memory usage of oha with top - It's going to start jumping rapidly until the process is OOM killed within a few seconds to 10s of seconds
c. CPU usage will jump to 800%, if available
d. UI becomes unresponsive and prints no further info
e. UI cannot be ctrl-c'd
Do NOT restart server, start http2 test: cargo run --release -- -c 20 -p 20 -z 5m --http2 --insecure https://host.docker.internal:3001/ping
a. If oha is started when the server is not running, it will report 20 refused connections and immediately exit. This is not the same as what happens if the connections are established but then lost.
Start server, start http1.1 test: cargo run --release -- -c 200 -z 10m --insecure https://host.docker.internal:3001/ping
a. Ctrl-C the http server
b. Observe that oha reports refused connections and remains responsive for http1.1 when the server goes away
c. Observe that the UI remains responsive and can be ctrl-c'd

hatoo · 2024-01-05T05:05:57Z

Thank you. It's very helpful!

I've succeeded in reproducing. I will work on this Saturday.

- Partial fix for hatoo#361 - ONLY implemented for the `-z 10s` (work_until) case - TODO: - [ ] The futures are not aborted when the timer is hit, which will cause long running requests to delay the program exit - this is only due to a borrow/move problem that I cannot figure out - [ ] Implement for the non-`work_until` cases - [ ] Add a timeout to the TCP socket setup - this appears to be where some of the delay on shutdown is happening if the server closes after startup - [ ] Consider adding a delay to the reconnect loop so that it will not try to connect more than 1 time per second per concurrent connection - Without this the connect loop will spin at ~23k connect attempts/second for `-c 20`, for example - Test cases: - Start with the server not running at all (never connects) - Currently this will exit on time - IMPROVED: Previously this would attempt to connect once for each `-c`, fail, and immediately exit - IMPROVED: Currently this will repeatedly try to connect until the specified timeout expires, then it will exit - Start with the server running and leave it running - This works fine as before - Start with the server running, exit the server, then restart the server before the test completes - This initially makes requests - IMPROVED: Previously this would OOM even if the server restarted - IMPROVED: Currently this will reconnect and continue making requests if the server restarts

huntharo · 2024-01-05T16:42:59Z

I have submitted a partial PR that operates mostly similarly to the way that this is handled for HTTP1.1: #363

I have a couple to-dos on the PR description.

huntharo changed the title ~~Frequent kills (likely OOM) - Possibly isolated tohttp2 - 50 GB of RAM usage on MacBook, caused system OOM~~ Frequent OOM kills - Possibly isolated tohttp2 - 50 GB of RAM usage on MacBook, caused system OOM Jan 3, 2024

huntharo mentioned this issue Jan 5, 2024

Issue 361 - Improve http2 connection error handling #363

Merged

4 tasks

huntharo changed the title ~~Frequent OOM kills - Possibly isolated tohttp2 - 50 GB of RAM usage on MacBook, caused system OOM~~ Frequent OOM kills - Isolated tohttp2 - 50 GB of RAM usage on MacBook, caused system OOM Jan 5, 2024

huntharo changed the title ~~Frequent OOM kills - Isolated tohttp2 - 50 GB of RAM usage on MacBook, caused system OOM~~ Frequent OOM kills - Isolated tohttp2 - 50 GB of RAM usage on MacBook, causing system OOM Jan 5, 2024

huntharo mentioned this issue Jan 8, 2024

Stats collection causes unbounded memory growth, slowing RPS over time #364

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Frequent OOM kills - Isolated to`http2` - 50 GB of RAM usage on MacBook, causing system OOM #361

Frequent OOM kills - Isolated to`http2` - 50 GB of RAM usage on MacBook, causing system OOM #361

huntharo commented Jan 3, 2024

hatoo commented Jan 4, 2024

huntharo commented Jan 4, 2024

huntharo commented Jan 5, 2024

hatoo commented Jan 5, 2024

huntharo commented Jan 5, 2024

Frequent OOM kills - Isolated tohttp2 - 50 GB of RAM usage on MacBook, causing system OOM #361

Frequent OOM kills - Isolated tohttp2 - 50 GB of RAM usage on MacBook, causing system OOM #361

Comments

huntharo commented Jan 3, 2024

Intro

Pairing

Problem

Killed Terminations on CloudShell

Killed at 43 seconds

Happens with --no-tui Too

Does NOT Happen without --http2 with Same Worker Count

Nearly Final Memory

Problem Starts Memory - ~30 seconds of operation

Initial Memory - 2,000 RPS and Stable (0.4% memory usage)

hatoo commented Jan 4, 2024

huntharo commented Jan 4, 2024

huntharo commented Jan 5, 2024

tl;dr - To Reproduce

hatoo commented Jan 5, 2024

huntharo commented Jan 5, 2024

Frequent OOM kills - Isolated to`http2` - 50 GB of RAM usage on MacBook, causing system OOM #361

Frequent OOM kills - Isolated to`http2` - 50 GB of RAM usage on MacBook, causing system OOM #361

`Killed` Terminations on CloudShell

Happens with `--no-tui` Too

Does NOT Happen without `--http2` with Same Worker Count