Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Frequent OOM kills - Isolated tohttp2 - 50 GB of RAM usage on MacBook, causing system OOM #361

Open
huntharo opened this issue Jan 3, 2024 · 5 comments

Comments

@huntharo
Copy link
Contributor

huntharo commented Jan 3, 2024

Intro

First off - awesome program! This solves the problems I have with hey where it just gets slower when the RTT times increase even though the remote service can support the throughput. The animated UI really helps understand what's happening without having to wait for the whole thing to finish, which I love. Thanks for this!

Pairing

I can pair on this with you if you want. Google Meet or similar is fine. My email is on my profile.

Problem

  • I've encountered numerous OOM kills when using the program
  • The problems seem to happen more frequently with http2 and/or may only happen with http2
  • Had a case where my MacBook said it was out of memory (32 GB of RAM)
    • oha was using 50 GB of RAM!
    • This was no unusual test it just started failing in some way that causes a memory leak or accumulation
  • The case below, on AWS CloudShell runs for ~40 seconds each time before getting OOM killed
    • Everything is initially fine with CPU usage around 30% and memory usage around 0.5%
    • The program appears to freeze around 40 seconds
    • Memory usage shoots up to 25%, 35%, and more after the freeze
    • CPU usage shoots up to 100% after the freeze starts
    • The program exits reliably with exit code 137 (pretty sure this is an OOM kill)
    • Runs to completion if --http2 is removed and -c is adjusted to have the same total as -c * -p with --http2

Killed Terminations on CloudShell

Note: this endpoint is not open to random IPs but if you want to test against it, it is located in us-east-2 and I can add an IP to the security group for you to test with if you'd like.

./oha-linux-amd64 -t 15s -z 5m --http2 -c 10 -p 20 https://lambdadispatch.ghpublic.pwrdrvr.com/read

Killed at 43 seconds

image

Happens with --no-tui Too

image

Does NOT Happen without --http2 with Same Worker Count

image

Nearly Final Memory

image

Problem Starts Memory - ~30 seconds of operation

image

Initial Memory - 2,000 RPS and Stable (0.4% memory usage)

image
@huntharo huntharo changed the title Frequent kills (likely OOM) - Possibly isolated tohttp2 - 50 GB of RAM usage on MacBook, caused system OOM Frequent OOM kills - Possibly isolated tohttp2 - 50 GB of RAM usage on MacBook, caused system OOM Jan 3, 2024
@hatoo
Copy link
Owner

hatoo commented Jan 4, 2024

Thanks for your report!

I had some investigation and I found some weird huge memory consumption for oha against a specific HTTP server (such as node18's https module) backend although this isn't exact same phenomenon for this issue.

Could you tell me about your server's technology? (language, library, etc..)

@huntharo
Copy link
Contributor Author

huntharo commented Jan 4, 2024

Could you tell me about your server's technology? (language, library, etc..)

The pwrdrvr.com domain name above is pointed at an AWS ALB which supports both HTTP2 and HTTP1.1. The ALB, in turn, is pointed at a dotnet 8 Kestrel web server using HTTP1.1, which in turn is pointed at a proxy inside of Lambda functions using HTTP2, which is finally proxying the request to a Node.js app. The Node.js app is reading from DynamoDB and returning an average of 1.6 KB in payload size through the layers above.

But essentially, the problem appears to happen when speaking to an AWS ALB with HTTP2.

The local problem I had where it used 50 GB of RAM was potentially pointed at the Node.js app using HTTP2 or at the dotnet 8 Kestrel server using HTTP1.1 (I don't have HTTP2 enabled for the Kestrel server, but I can). The details are fuzzy on this because it only happened once so far.

I could probably setup an AWS ALB route for you that just returns a constant response string and I bet the issue will happen with that.

@huntharo
Copy link
Contributor Author

huntharo commented Jan 5, 2024

I've got a solid lead now. I was running the code under the debugger and looking heal profiles using pprof.

What I noticed from normal operations is that memory accumulates steadily. The primary usage, I think, is coming from a vector that holds all the results, so that makes sense. Maybe that can be reduced if the detailed stats are not needed until the end, but maybe it cannot.

Then I was thinking that Node.js, locally, never sends a GOAWAY message on an http2 socket while the ALB likely does send that after a number of requests, say, 10,000 or 100,000 per socket.

I realized that the problem was likely when the sockets were gracefully closed by the server. To simulate what would happen in that case I just ctrl-c'd my node.js process and, sure enough, oha started racing at 800% CPU and RAM usage in the dev container went from 3% of total to 20%, 40%, 80%, then OOM kill.

tl;dr - To Reproduce

  1. For tests with server: Start node app locally with TLS cert and http2 support (probably any other stack would be fine too, just have http2)
  2. Start server, start http2 test: cargo run --release -- -c 20 -p 20 -z 5m --http2 --insecure https://host.docker.internal:3001/ping
    a. Ctrl-c the http2 server
    b. Observe memory usage of oha with top - It's going to start jumping rapidly until the process is OOM killed within a few seconds to 10s of seconds
    c. CPU usage will jump to 800%, if available
    d. UI becomes unresponsive and prints no further info
    e. UI cannot be ctrl-c'd
  3. Do NOT restart server, start http2 test: cargo run --release -- -c 20 -p 20 -z 5m --http2 --insecure https://host.docker.internal:3001/ping
    a. If oha is started when the server is not running, it will report 20 refused connections and immediately exit. This is not the same as what happens if the connections are established but then lost.
  4. Start server, start http1.1 test: cargo run --release -- -c 200 -z 10m --insecure https://host.docker.internal:3001/ping
    a. Ctrl-C the http server
    b. Observe that oha reports refused connections and remains responsive for http1.1 when the server goes away
    c. Observe that the UI remains responsive and can be ctrl-c'd

@hatoo
Copy link
Owner

hatoo commented Jan 5, 2024

Thank you. It's very helpful!

I've succeeded in reproducing. I will work on this Saturday.

huntharo added a commit to huntharo/oha that referenced this issue Jan 5, 2024
- Partial fix for hatoo#361
- ONLY implemented for the `-z 10s` (work_until) case
- TODO:
   - [ ] The futures are not aborted when the timer is hit, which will cause long running requests to delay the program exit - this is only due to a borrow/move problem that I cannot figure out
   - [ ] Implement for the non-`work_until` cases
   - [ ] Add a timeout to the TCP socket setup - this appears to be where some of the delay on shutdown is happening if the server closes after startup
   - [ ] Consider adding a delay to the reconnect loop so that it will not try to connect more than 1 time per second per concurrent connection - Without this the connect loop will spin at ~23k connect attempts/second for `-c 20`, for example
- Test cases:
  - Start with the server not running at all (never connects)
    - Currently this will exit on time
    - IMPROVED: Previously this would attempt to connect once for each `-c`, fail, and immediately exit
    - IMPROVED: Currently this will repeatedly try to connect until the specified timeout expires, then it will exit
  - Start with the server running and leave it running
    - This works fine as before
  - Start with the server running, exit the server, then restart the server before the test completes
     - This initially makes requests
     - IMPROVED: Previously this would OOM even if the server restarted
     - IMPROVED: Currently this will reconnect and continue making requests if the server restarts
@huntharo huntharo changed the title Frequent OOM kills - Possibly isolated tohttp2 - 50 GB of RAM usage on MacBook, caused system OOM Frequent OOM kills - Isolated tohttp2 - 50 GB of RAM usage on MacBook, caused system OOM Jan 5, 2024
@huntharo huntharo changed the title Frequent OOM kills - Isolated tohttp2 - 50 GB of RAM usage on MacBook, caused system OOM Frequent OOM kills - Isolated tohttp2 - 50 GB of RAM usage on MacBook, causing system OOM Jan 5, 2024
@huntharo
Copy link
Contributor Author

huntharo commented Jan 5, 2024

I have submitted a partial PR that operates mostly similarly to the way that this is handled for HTTP1.1: #363

I have a couple to-dos on the PR description.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants