Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full node unable to apply election block #1704

Open
Eligioo opened this issue Jul 13, 2023 · 9 comments · May be fixed by #2370
Open

Full node unable to apply election block #1704

Eligioo opened this issue Jul 13, 2023 · 9 comments · May be fixed by #2370
Labels
bug Something isn't working

Comments

@Eligioo
Copy link
Member

Eligioo commented Jul 13, 2023

Currently fully synced full nodes are having the issue that after some event it loses the ability to store new incoming blocks and therefore can't keep up.

Multiple incidents have been observed and so far they all show the same pattern just before it can't continue to store blocks:

2023-07-13T13:04:40.637552192Z INFO  micro                | Skip block pushed block_number=233038
2023-07-13T13:04:40.663131683Z DEBUG validator            | Initializing block producer next_block_number=233039
2023-07-13T13:04:40.672803841Z DEBUG micro                | Not our turn, waiting for micro block #233039 block_number=233039 slot_band=4
2023-07-13T13:04:40.674427687Z DEBUG block_queue          | Received block via gossipsub block=#233039:MI:5633601d56
2023-07-13T13:04:40.674597462Z DEBUG block_queue          | Received block via gossipsub block=#233038:MI:0d572aaf9d
2023-07-13T13:04:40.992743659Z INFO  chain_ordering       | New block in Unknown chain fork_block_number=233037 current_block_number=233038 new_block_number=233038
2023-07-13T13:04:40.992827717Z DEBUG push                 | Storing block - on fork block=#233038:MI:74af9b8a35
2023-07-13T13:04:42.066183858Z INFO  chain_ordering       | New block in Inferior chain fork_block_number=233037 current_block_number=233038 new_block_number=233038
2023-07-13T13:04:42.066278922Z DEBUG push                 | Storing block - on inferior chain block=#233038:MI:0d572aaf9d
2023-07-13T13:04:42.681963371Z DEBUG block_queue          | Received block via gossipsub block=#233040:MA:686b97ed50

After this only three log messages are displayed and interestingly the receiving blocks via gossipsub is one of them. Which gives the feeling that the connections with other nodes aren't actually dropped but the node is unable to process them. Deadlock?

2023-07-13T13:04:43.736174015Z ERROR network_impl         | could not send request peer_id=12D3KooWAjftHreEDsNMK3qPBSe63LZcQorv6q8H3FR7CWg57CxV error=Inbound error: No receiver for request
2023-07-13T13:04:43.889421145Z DEBUG consensus            | Initiating head requests (to 11 peers)
2023-07-13T13:04:45.422861040Z DEBUG block_queue          | Received block via gossipsub block=#233040:MA:686b97ed50

Note that the consensus head hangs at #233038 but new gossipsub messages are coming in. This looks similar to #1692 .

2023-07-13T13:06:18.681751953Z DEBUG block_queue          | Received block via gossipsub block=#233127:MI:2373e55674
2023-07-13T13:06:09.389811588Z INFO  nimiq_client         | Consensus status: true - Head: #233038- 5506df7b33945b78f7985aa0b20a876b55ee43e90575349aa0480dc53303c1a2 block_number=233038 num_peers=14 status=true
2023-07-13T13:06:29.675350336Z DEBUG block_queue          | Received block via gossipsub block=#233138:MI:0ac26a9b0e
2023-07-13T13:06:29.389776842Z INFO  nimiq_client         | Consensus status: true - Head: #233038- 5506df7b33945b78f7985aa0b20a876b55ee43e90575349aa0480dc53303c1a2 block_number=233038 num_peers=14 status=true
2023-07-13T13:06:48.813880555Z DEBUG block_queue          | Received block via gossipsub block=#233157:MI:379d8d4a25
2023-07-13T13:06:49.389738268Z INFO  nimiq_client         | Consensus status: true - Head: #233038- 5506df7b33945b78f7985aa0b20a876b55ee43e90575349aa0480dc53303c1a2 block_number=233038 num_peers=14 status=true

At some point the node falls so far behind that verification for mempool txs start to fail

2023-07-13T13:05:52.071047472Z DEBUG verify               | Mempool-verify tx invalid at this block height block_number=233039 validity_start_height=233100

full-node-cant-keep-up.log

@Eligioo Eligioo added the bug Something isn't working label Jul 17, 2023
@jsdanielh jsdanielh added this to the Nimiq PoS Mainnet milestone Aug 1, 2023
@Eligioo
Copy link
Member Author

Eligioo commented Sep 13, 2023

Here is another example with version 0.15.0. This log contains 2 epochs of data. For both epochs block 2980799 to 2980800 (election block) and 3002399 to 3002400 (election block). The error messages

Failed to dispatch gossipsub message topic=blocks error=no available capacity
and
Mempool-verify tx invalid at this block height
start to show up. The node is unable to apply the election block and falls behind, after some time it catches up. If this node happens to be an elected validator, it will be deactivated and because it can't enforce the validity window anymore, it will not reactivate.

nimiq-client.log

@Eligioo Eligioo changed the title Full node unable extend own chain Full node unable to apply election block Sep 13, 2023
@Eligioo
Copy link
Member Author

Eligioo commented Dec 22, 2023

As requested by @jsdanielh, an example in 0.18.0.

2023-12-22T05:23:15.950207573Z DEBUG push                 | Accepted block block=#11458449:MI:8f7c8e758a num_transactions=127 kind="extend"
2023-12-22T05:23:26.920513733Z DEBUG block_queue          | Received block via gossipsub block=#11458450:MA:ddcf6efd90 peer_id=12D3KooWGbKfQvV4r59BCYHoRzGbARTRFsxbx5SpUQb8C3ciQSiM

Then 15 minutes later

2023-12-22T05:38:21.353751307Z DEBUG push                 | Accepted block block=#11458450:MA:ddcf6efd90 num_transactions=0 kind="extend"
2023-12-22T05:38:21.483950188Z DEBUG push                 | Accepted block block=#11458451:MI:a7a98a8ab8 num_transactions=0 kind="extend"
2023-12-22T05:38:21.506463786Z DEBUG push                 | Accepted block block=#11458452:MI:027e23ebd2 num_transactions=0 kind="extend"

It is able to accept blocks up to 11458465 and then it halts since it never requests/receives block 11458466. Other blocks gossipsubs keep coming in but nothing gets done with it.

Full log of this example:
full-node-election-block-issue.log

@jsdanielh
Copy link
Contributor

Per @viquezclaudio reproduction, this slowness is caused by history pruning on election blocks

@ii-cruz
Copy link

ii-cruz commented Feb 6, 2024

We have pushed a temporary solution to this issue that makes the history full nodes keep the history instead of pruning upon epoch finalization.
The real solution would be to have a lighter version of the history store for full nodes.

@sisou
Copy link
Member

sisou commented Feb 6, 2024

I'm guessing you are talking about e4bb4a6. You have a typo in your comment, this is for full nodes, not history nodes.

Also, does this commit mean that full nodes are now history nodes with this "workaround" and need much more disk space?

@ii-cruz
Copy link

ii-cruz commented Feb 6, 2024

For now yes. Testnet full nodes will start storing everything from the next testnet restart.
We are working on a Light History Store that is efficient at pruning and also stores much less data.

@sisou
Copy link
Member

sisou commented Feb 7, 2024

That's a pretty significant change to full node behavior, even if just as a workaround.

However, I guess it will not sync the whole history on startup? It only does not delete history? Which means that if after a while I stop the node, delete the database, and restart it, the node will not sync back the deleted history, but start with a small database and only grow with new blocks?

@Eligioo
Copy link
Member Author

Eligioo commented Feb 7, 2024

However, I guess it will not sync the whole history on startup? It only does not delete history? Which means that if after a while I stop the node, delete the database, and restart it, the node will not sync back the deleted history, but start with a small database and only grow with new blocks?

That is correct. It will use the same syncing mechanism as before to get to the head of the chain. But from there on it won't remove old epochs as soon a we hit an election block.

@viquezclaudio
Copy link
Contributor

viquezclaudio commented Feb 7, 2024

Correct, this is just a temporal workaround while we implement the right solution: a special lightweight version of the history store for full nodes, that allows us to properly construct (and verify the history root) without having to store the full transactions.

@sisou sisou linked a pull request Apr 16, 2024 that will close this issue
3 tasks
viquezclaudio added a commit that referenced this issue Apr 16, 2024
Use the light history store for full node clients
Necessary changes to use the history interface
Several changes to usage of the history store

Fixes #1704
viquezclaudio added a commit that referenced this issue Apr 16, 2024
Use the light history store for full node clients
Necessary changes to use the history interface
Several changes to usage of the history store

Fixes #1704
viquezclaudio added a commit that referenced this issue Apr 16, 2024
Use the light history store for full node clients
Necessary changes to use the history interface
Several changes to usage of the history store

Fixes #1704
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants