Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.26 on hash consumed almost all memory and CPU, did not process blocks #6965

Open
yorickdowne opened this issue Apr 30, 2024 · 12 comments
Open

Comments

@yorickdowne
Copy link

yorickdowne commented Apr 30, 2024

Description

Nethermind 1.26 did not process blocks and gradually consumed all memory, and had high CPU.

A restart fixed it, it caught up again. It would not quit cleanly however, Docker reaped it after the 5min configured timeout for container stop.

CL is Nimbus v24.4.0

Logs don't appear to show a clear root cause.

Nethermind is started with these pruning parameters. It was not running a Full Prune at the time. --Pruning.FullPruningMaxDegreeOfParallelism=3 --Pruning.FullPruningTrigger=VolumeFreeSpace --Pruning.FullPruningThresholdMb=375810 --Pruning.CacheMb=4096 --Pruning.FullPruningMemoryBudgetMb=16384 --Init.StateDbKeyScheme=HalfPath

Full startup parameters as shown by ps auxww:

/nethermind/nethermind --datadir /var/lib/nethermind --Init.WebSocketsEnabled true --Network.DiscoveryPort 30303 --Network.P2PPort 30303 --Network.MaxActivePeers 50 --HealthChecks.Enabled true --HealthChecks.UIEnabled true --JsonRpc.Enabled true --JsonRpc.Host 0.0.0.0 --JsonRpc.Port 8545 --JsonRpc.WebSocketsPort 8546 --JsonRpc.EngineHost 0.0.0.0 --JsonRpc.EnginePort 8551 --JsonRpc.AdditionalRpcUrls=http://127.0.0.1:1337|http|admin --JsonRpc.JwtSecretFile=/var/lib/nethermind/ee-secret/jwtsecret --Metrics.Enabled true --Metrics.PushGatewayUrl  --Metrics.ExposeHost 0.0.0.0 --Metrics.ExposePort 6060 --Pruning.FullPruningCompletionBehavior AlwaysShutdown --log info --config mainnet --JsonRpc.EnabledModules Web3,Eth,Subscribe,Net,Health,Parity,Proof,Trace,TxPool --Pruning.FullPruningMaxDegreeOfParallelism=3 --Pruning.FullPruningTrigger=VolumeFreeSpace --Pruning.FullPruningThresholdMb=375810 --Pruning.CacheMb=4096 --Pruning.FullPruningMemoryBudgetMb=16384 --Init.StateDbKeyScheme=HalfPath

Steps to Reproduce

Unsure

Desktop (please complete the following information):
Please provide the following information regarding your setup:

  • Operating System: Linux
  • Version: 1.26.0+0068729c
  • Installation Method: Docker
  • Consensus Client: Nimbus v24.4.0

Logs

Nimbus logs

Logs of Nethermind in its failure state

Logs of Nethermind after restart, when it catches up

nimbus.log.gz

nm-hang.log.gz

nm-catchup.log.gz

@yorickdowne
Copy link
Author

yorickdowne commented May 1, 2024

This happened again on another machine with a Lodestar CL. It's unlikely be an interop issue then. I will remove --Pruning.CacheMb=4096 to see whether that makes a difference.

@asdacap
Copy link
Contributor

asdacap commented May 2, 2024

Hi, do you have log with the Lodestar CL? Also, can you double check that the CL is synced?

@yorickdowne
Copy link
Author

yorickdowne commented May 3, 2024

We have the logs in Loki, let me see what I can pull out. The CL is synced, yes.

Since removing --Pruning.CacheMb=4096 from all our nodes we haven't seen a failure in 2 days, where previously we had 2 different ones fail in 2 days.

yorickdowne referenced this issue in eth-educators/eth-docker May 5, 2024
@asdacap
Copy link
Contributor

asdacap commented May 8, 2024

May I know why --Init.StateDbKeyScheme=HalfPath is on when running hash? Do you plan to migrate via full pruning?

@yorickdowne
Copy link
Author

Yes and yes. The intent was to allow the auto prune to migrate.

Have not seen another failure since removing the pruning cache parameter

@asdacap
Copy link
Contributor

asdacap commented May 20, 2024

Did not reproduce when forward syncing. Problem may come from outside block processing.

@yorickdowne
Copy link
Author

I’ve seen this once more, it just took those 2 weeks.

I’ve configured one of my servers with the cache parameter and debug logs, to hopefully catch this when it happens

@asdacap
Copy link
Contributor

asdacap commented May 20, 2024

Are those running hash also or halfpath?

@yorickdowne
Copy link
Author

This is hash, with the HalfPath parameter to convert during pruning.

@asdacap
Copy link
Contributor

asdacap commented May 20, 2024

Are these node validator?

@yorickdowne
Copy link
Author

Yes, they are part of my Lido setup. Nimbus or Lodestar depending on the server; 1,000 validators via Vouch per cluster. Means there is a decent amount of block building going on.

@asdacap
Copy link
Contributor

asdacap commented May 21, 2024

Hmm... simply re-running blocks with block producer's block processor does not work. Need actual block production.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants