reth killed by out-of-memory killer #8115

gbrew · 2024-05-05T21:46:55Z

Describe the bug

Reth 0.2.0-beta.6 running on ubuntu 22.04 was killed by the OOM killer.

It had been running at the chain tip for more than two weeks before the jemalloc stats started to increase linearly from ~3.5GB to ~40GB over less than an hour, at which point it was killed. The machine has 64GB RAM.

The machine is running reth, lighthouse, an arbitrum node which is using reth, and an RPC client program which is primarily calling eth_call, debug_traceCall and eth_subscribe (for new blocks) on reth. It's not a particularly heavy RPC load, but there may be some bursts of activity. It is primarily using IPC transport for RPC. I don't see any signs of anything bad happening right before the OOM in any of the client logs or the reth log.

Memory graphs are attached along with the relevant bit of reth.log. Happy to upload more info, but none of the other graphs on the dashboard showed any obvious issues.

reth.log

Steps to reproduce

Sync reth to ethereum chain tip
subscribe to new blocks, run debug_traceCall to get logs and state changes from mempool txns using IPC
Wait several weeks
Boom!

Node logs

No response

Platform(s)

Linux (x86)

What version/commit are you on?

reth Version: 0.2.0-beta.6
Commit SHA: ac29b4b
Build Timestamp: 2024-04-26T04:52:11.095680376Z
Build Features: jemalloc
Build Profile: maxperf

What database version are you on?

Current database version: 2
Local database version: 1

Which chain / network are you on?

mainnet

What type of node are you running?

Archive (default)

What prune config do you use, if any?

None

If you've built Reth from source, provide the full command you used

cargo build --profile maxperf --bin reth

Code of Conduct

I agree to follow the Code of Conduct

The text was updated successfully, but these errors were encountered:

gbrew · 2024-05-05T22:03:13Z

I just restarted reth with jemalloc heap profiling enabled. I'll update the bug if I can trigger this again with the profiling, but it may take quite a while to reproduce.

Rjected · 2024-05-06T15:47:35Z

hey @gbrew do you have debug logs you can send over? They would be in ~/.cache/reth/logs

gbrew · 2024-05-06T16:46:41Z

hey @gbrew do you have debug logs you can send over? They would be in ~/.cache/reth/logs

Sorry, the only logs I found there were not current, so I didn't include them. Will see if I need to change my config to get debug logs enabled again.

Rjected · 2024-05-06T16:57:15Z

hey @gbrew do you have debug logs you can send over? They would be in ~/.cache/reth/logs

Sorry, the only logs I found there were not current, so I didn't include them. Will see if I need to change my config to get debug logs enabled again.

hmm, they should be enabled automatically, if this is running in a service of some sort, or as root? if it's running as a systemd system service or similar, then /root/.cache/reth/logs is the place to look

mattsse · 2024-05-06T17:40:28Z

maybe this happens because requests are coming in faster than we can write responses so the output buffer runs full

gbrew · 2024-05-06T18:48:59Z

hey @gbrew do you have debug logs you can send over? They would be in ~/.cache/reth/logs

Sorry, the only logs I found there were not current, so I didn't include them. Will see if I need to change my config to get debug logs enabled again.

hmm, they should be enabled automatically, if this is running in a service of some sort, or as root? if it's running as a systemd system service or similar, then /root/.cache/reth/logs is the place to look

I just checked, and I'm running with --log.file.filter info on the command line, so I'm assuming that's why I'm not getting debug logs now. I'll change that to be debug on my next restart, but I don't want to interrupt my heap profiling run currently. I'm definitely running as a regular user directly from the command line, not via systemd or as root.

gbrew · 2024-05-06T18:50:03Z

maybe this happens because requests are coming in faster than we can write responses so the output buffer runs full

Are you thinking of a specific buffer which I can see on the dashboard? None of the ones I saw looked to be increasing in size during the memory growth FWIW.

gbrew · 2024-05-17T23:02:05Z

I just caught this crash again but with heap profiling enabled.
Attached are a couple heap profiles, one the overall profile right before death, and one a diff profile from a minute or so before death to death.

If I'm reading correctly, it looks like it is using 32GB of memory for the Vec allocated in get_headers_response().
I'm not sure if this is serving rpc or p2p requests. Looks like the code already tries to limit the size of this response Vec using MAX_HEADERS_SERVE and SOFT_RESPONSE_LIMIT, so it seems like the possibilities are that the limiting is broken, or that there are a huge number of header requests live at the same time?

gbrew · 2024-05-18T16:21:39Z

Here's the grafana jemalloc stats for the latest crash. Interesting that it has 5 episodes of linear memory growth in the 3 hours before crashing, but in all but the last of them it recovers before using up all the memory.

It does look consistent with continuing to process p2p requests but not sending the responses for long periods of time, along the lines of what @mattsse was suggesting above. Seems a bit crazy this would happen for more than an hour straight though. Maybe some explicit form of backpressure is needed to prioritize sending responses?

gbrew · 2024-05-20T04:22:15Z

I just found that the rate of incoming p2p header requests goes way up when the memory growth happens. It is about 1-2 reqs/s prior, then goes up to about 25 reqs/s during the uncontrolled memory growth. So it does seem like reth is just failing to handle that rate of header requests and the sending of responses gets starved and they pile up until memory is exhausted.

gbrew added C-bug An unexpected or incorrect behavior S-needs-triage This issue needs to be labelled labels May 5, 2024

DaniPopes removed the S-needs-triage This issue needs to be labelled label May 6, 2024

Nojoix mentioned this issue May 6, 2024

op-reth killed by out-of-memory killer #8131

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reth killed by out-of-memory killer #8115

reth killed by out-of-memory killer #8115

gbrew commented May 5, 2024

gbrew commented May 5, 2024

Rjected commented May 6, 2024

gbrew commented May 6, 2024

Rjected commented May 6, 2024

mattsse commented May 6, 2024 •

edited

gbrew commented May 6, 2024

gbrew commented May 6, 2024

gbrew commented May 17, 2024 •

edited

gbrew commented May 18, 2024 •

edited

gbrew commented May 20, 2024

reth killed by out-of-memory killer #8115

reth killed by out-of-memory killer #8115

Comments

gbrew commented May 5, 2024

Describe the bug

Steps to reproduce

Node logs

Platform(s)

What version/commit are you on?

What database version are you on?

Which chain / network are you on?

What type of node are you running?

What prune config do you use, if any?

If you've built Reth from source, provide the full command you used

Code of Conduct

gbrew commented May 5, 2024

Rjected commented May 6, 2024

gbrew commented May 6, 2024

Rjected commented May 6, 2024

mattsse commented May 6, 2024 • edited

gbrew commented May 6, 2024

gbrew commented May 6, 2024

gbrew commented May 17, 2024 • edited

gbrew commented May 18, 2024 • edited

gbrew commented May 20, 2024

mattsse commented May 6, 2024 •

edited

gbrew commented May 17, 2024 •

edited

gbrew commented May 18, 2024 •

edited