Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reth killed by out-of-memory killer #8115

Open
1 task done
gbrew opened this issue May 5, 2024 · 10 comments
Open
1 task done

reth killed by out-of-memory killer #8115

gbrew opened this issue May 5, 2024 · 10 comments
Labels
C-bug An unexpected or incorrect behavior

Comments

@gbrew
Copy link
Contributor

gbrew commented May 5, 2024

Describe the bug

Reth 0.2.0-beta.6 running on ubuntu 22.04 was killed by the OOM killer.

It had been running at the chain tip for more than two weeks before the jemalloc stats started to increase linearly from ~3.5GB to ~40GB over less than an hour, at which point it was killed. The machine has 64GB RAM.

The machine is running reth, lighthouse, an arbitrum node which is using reth, and an RPC client program which is primarily calling eth_call, debug_traceCall and eth_subscribe (for new blocks) on reth. It's not a particularly heavy RPC load, but there may be some bursts of activity. It is primarily using IPC transport for RPC. I don't see any signs of anything bad happening right before the OOM in any of the client logs or the reth log.

Memory graphs are attached along with the relevant bit of reth.log. Happy to upload more info, but none of the other graphs on the dashboard showed any obvious issues.

Screenshot 2024-05-05 at 2 06 25 PM

reth.log

Steps to reproduce

Sync reth to ethereum chain tip
subscribe to new blocks, run debug_traceCall to get logs and state changes from mempool txns using IPC
Wait several weeks
Boom!

Node logs

No response

Platform(s)

Linux (x86)

What version/commit are you on?

reth Version: 0.2.0-beta.6
Commit SHA: ac29b4b
Build Timestamp: 2024-04-26T04:52:11.095680376Z
Build Features: jemalloc
Build Profile: maxperf

What database version are you on?

Current database version: 2
Local database version: 1

Which chain / network are you on?

mainnet

What type of node are you running?

Archive (default)

What prune config do you use, if any?

None

If you've built Reth from source, provide the full command you used

cargo build --profile maxperf --bin reth

Code of Conduct

  • I agree to follow the Code of Conduct
@gbrew gbrew added C-bug An unexpected or incorrect behavior S-needs-triage This issue needs to be labelled labels May 5, 2024
@gbrew
Copy link
Contributor Author

gbrew commented May 5, 2024

I just restarted reth with jemalloc heap profiling enabled. I'll update the bug if I can trigger this again with the profiling, but it may take quite a while to reproduce.

@DaniPopes DaniPopes removed the S-needs-triage This issue needs to be labelled label May 6, 2024
@Rjected
Copy link
Member

Rjected commented May 6, 2024

hey @gbrew do you have debug logs you can send over? They would be in ~/.cache/reth/logs

@gbrew
Copy link
Contributor Author

gbrew commented May 6, 2024

hey @gbrew do you have debug logs you can send over? They would be in ~/.cache/reth/logs

Sorry, the only logs I found there were not current, so I didn't include them. Will see if I need to change my config to get debug logs enabled again.

@Rjected
Copy link
Member

Rjected commented May 6, 2024

hey @gbrew do you have debug logs you can send over? They would be in ~/.cache/reth/logs

Sorry, the only logs I found there were not current, so I didn't include them. Will see if I need to change my config to get debug logs enabled again.

hmm, they should be enabled automatically, if this is running in a service of some sort, or as root? if it's running as a systemd system service or similar, then /root/.cache/reth/logs is the place to look

@mattsse
Copy link
Collaborator

mattsse commented May 6, 2024

maybe this happens because requests are coming in faster than we can write responses so the output buffer runs full

@gbrew
Copy link
Contributor Author

gbrew commented May 6, 2024

hey @gbrew do you have debug logs you can send over? They would be in ~/.cache/reth/logs

Sorry, the only logs I found there were not current, so I didn't include them. Will see if I need to change my config to get debug logs enabled again.

hmm, they should be enabled automatically, if this is running in a service of some sort, or as root? if it's running as a systemd system service or similar, then /root/.cache/reth/logs is the place to look

I just checked, and I'm running with --log.file.filter info on the command line, so I'm assuming that's why I'm not getting debug logs now. I'll change that to be debug on my next restart, but I don't want to interrupt my heap profiling run currently. I'm definitely running as a regular user directly from the command line, not via systemd or as root.

@gbrew
Copy link
Contributor Author

gbrew commented May 6, 2024

maybe this happens because requests are coming in faster than we can write responses so the output buffer runs full

Are you thinking of a specific buffer which I can see on the dashboard? None of the ones I saw looked to be increasing in size during the memory growth FWIW.

@gbrew
Copy link
Contributor Author

gbrew commented May 17, 2024

I just caught this crash again but with heap profiling enabled.
Attached are a couple heap profiles, one the overall profile right before death, and one a diff profile from a minute or so before death to death.

If I'm reading correctly, it looks like it is using 32GB of memory for the Vec allocated in get_headers_response().
I'm not sure if this is serving rpc or p2p requests. Looks like the code already tries to limit the size of this response Vec using MAX_HEADERS_SERVE and SOFT_RESPONSE_LIMIT, so it seems like the possibilities are that the limiting is broken, or that there are a huge number of header requests live at the same time?

diff_prof
last_prof

@gbrew
Copy link
Contributor Author

gbrew commented May 18, 2024

Here's the grafana jemalloc stats for the latest crash. Interesting that it has 5 episodes of linear memory growth in the 3 hours before crashing, but in all but the last of them it recovers before using up all the memory.

It does look consistent with continuing to process p2p requests but not sending the responses for long periods of time, along the lines of what @mattsse was suggesting above. Seems a bit crazy this would happen for more than an hour straight though. Maybe some explicit form of backpressure is needed to prioritize sending responses?

Screenshot 2024-05-18 at 9 14 15 AM

@gbrew
Copy link
Contributor Author

gbrew commented May 20, 2024

I just found that the rate of incoming p2p header requests goes way up when the memory growth happens. It is about 1-2 reqs/s prior, then goes up to about 25 reqs/s during the uncontrolled memory growth. So it does seem like reth is just failing to handle that rate of header requests and the sending of responses gets starved and they pile up until memory is exhausted.

Screenshot 2024-05-19 at 9 14 57 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug An unexpected or incorrect behavior
Projects
Status: Todo
Development

No branches or pull requests

4 participants