Full node unable to apply election block #1704

Eligioo · 2023-07-13T13:42:45Z

Currently fully synced full nodes are having the issue that after some event it loses the ability to store new incoming blocks and therefore can't keep up.

Multiple incidents have been observed and so far they all show the same pattern just before it can't continue to store blocks:

2023-07-13T13:04:40.637552192Z INFO  micro                | Skip block pushed block_number=233038
2023-07-13T13:04:40.663131683Z DEBUG validator            | Initializing block producer next_block_number=233039
2023-07-13T13:04:40.672803841Z DEBUG micro                | Not our turn, waiting for micro block #233039 block_number=233039 slot_band=4
2023-07-13T13:04:40.674427687Z DEBUG block_queue          | Received block via gossipsub block=#233039:MI:5633601d56
2023-07-13T13:04:40.674597462Z DEBUG block_queue          | Received block via gossipsub block=#233038:MI:0d572aaf9d
2023-07-13T13:04:40.992743659Z INFO  chain_ordering       | New block in Unknown chain fork_block_number=233037 current_block_number=233038 new_block_number=233038
2023-07-13T13:04:40.992827717Z DEBUG push                 | Storing block - on fork block=#233038:MI:74af9b8a35
2023-07-13T13:04:42.066183858Z INFO  chain_ordering       | New block in Inferior chain fork_block_number=233037 current_block_number=233038 new_block_number=233038
2023-07-13T13:04:42.066278922Z DEBUG push                 | Storing block - on inferior chain block=#233038:MI:0d572aaf9d
2023-07-13T13:04:42.681963371Z DEBUG block_queue          | Received block via gossipsub block=#233040:MA:686b97ed50

After this only three log messages are displayed and interestingly the receiving blocks via gossipsub is one of them. Which gives the feeling that the connections with other nodes aren't actually dropped but the node is unable to process them. Deadlock?

2023-07-13T13:04:43.736174015Z ERROR network_impl         | could not send request peer_id=12D3KooWAjftHreEDsNMK3qPBSe63LZcQorv6q8H3FR7CWg57CxV error=Inbound error: No receiver for request
2023-07-13T13:04:43.889421145Z DEBUG consensus            | Initiating head requests (to 11 peers)
2023-07-13T13:04:45.422861040Z DEBUG block_queue          | Received block via gossipsub block=#233040:MA:686b97ed50

Note that the consensus head hangs at #233038 but new gossipsub messages are coming in. This looks similar to #1692 .

2023-07-13T13:06:18.681751953Z DEBUG block_queue          | Received block via gossipsub block=#233127:MI:2373e55674
2023-07-13T13:06:09.389811588Z INFO  nimiq_client         | Consensus status: true - Head: #233038- 5506df7b33945b78f7985aa0b20a876b55ee43e90575349aa0480dc53303c1a2 block_number=233038 num_peers=14 status=true
2023-07-13T13:06:29.675350336Z DEBUG block_queue          | Received block via gossipsub block=#233138:MI:0ac26a9b0e
2023-07-13T13:06:29.389776842Z INFO  nimiq_client         | Consensus status: true - Head: #233038- 5506df7b33945b78f7985aa0b20a876b55ee43e90575349aa0480dc53303c1a2 block_number=233038 num_peers=14 status=true
2023-07-13T13:06:48.813880555Z DEBUG block_queue          | Received block via gossipsub block=#233157:MI:379d8d4a25
2023-07-13T13:06:49.389738268Z INFO  nimiq_client         | Consensus status: true - Head: #233038- 5506df7b33945b78f7985aa0b20a876b55ee43e90575349aa0480dc53303c1a2 block_number=233038 num_peers=14 status=true

At some point the node falls so far behind that verification for mempool txs start to fail

2023-07-13T13:05:52.071047472Z DEBUG verify               | Mempool-verify tx invalid at this block height block_number=233039 validity_start_height=233100

full-node-cant-keep-up.log

The text was updated successfully, but these errors were encountered:

Eligioo · 2023-09-13T13:46:08Z

Here is another example with version 0.15.0. This log contains 2 epochs of data. For both epochs block 2980799 to 2980800 (election block) and 3002399 to 3002400 (election block). The error messages

Failed to dispatch gossipsub message topic=blocks error=no available capacity
and
Mempool-verify tx invalid at this block height
start to show up. The node is unable to apply the election block and falls behind, after some time it catches up. If this node happens to be an elected validator, it will be deactivated and because it can't enforce the validity window anymore, it will not reactivate.

nimiq-client.log

Eligioo · 2023-12-22T08:24:11Z

As requested by @jsdanielh, an example in 0.18.0.

2023-12-22T05:23:15.950207573Z DEBUG push                 | Accepted block block=#11458449:MI:8f7c8e758a num_transactions=127 kind="extend"
2023-12-22T05:23:26.920513733Z DEBUG block_queue          | Received block via gossipsub block=#11458450:MA:ddcf6efd90 peer_id=12D3KooWGbKfQvV4r59BCYHoRzGbARTRFsxbx5SpUQb8C3ciQSiM

Then 15 minutes later

2023-12-22T05:38:21.353751307Z DEBUG push                 | Accepted block block=#11458450:MA:ddcf6efd90 num_transactions=0 kind="extend"
2023-12-22T05:38:21.483950188Z DEBUG push                 | Accepted block block=#11458451:MI:a7a98a8ab8 num_transactions=0 kind="extend"
2023-12-22T05:38:21.506463786Z DEBUG push                 | Accepted block block=#11458452:MI:027e23ebd2 num_transactions=0 kind="extend"

It is able to accept blocks up to 11458465 and then it halts since it never requests/receives block 11458466. Other blocks gossipsubs keep coming in but nothing gets done with it.

Full log of this example:
full-node-election-block-issue.log

jsdanielh · 2024-02-03T17:59:51Z

Per @viquezclaudio reproduction, this slowness is caused by history pruning on election blocks

ii-cruz · 2024-02-06T20:39:38Z

We have pushed a temporary solution to this issue that makes the ~~history~~ full nodes keep the history instead of pruning upon epoch finalization.
The real solution would be to have a lighter version of the history store for full nodes.

sisou · 2024-02-06T20:56:50Z

I'm guessing you are talking about e4bb4a6. You have a typo in your comment, this is for full nodes, not history nodes.

Also, does this commit mean that full nodes are now history nodes with this "workaround" and need much more disk space?

ii-cruz · 2024-02-06T21:51:35Z

For now yes. Testnet full nodes will start storing everything from the next testnet restart.
We are working on a Light History Store that is efficient at pruning and also stores much less data.

sisou · 2024-02-07T06:27:19Z

That's a pretty significant change to full node behavior, even if just as a workaround.

However, I guess it will not sync the whole history on startup? It only does not delete history? Which means that if after a while I stop the node, delete the database, and restart it, the node will not sync back the deleted history, but start with a small database and only grow with new blocks?

Eligioo · 2024-02-07T10:32:31Z

However, I guess it will not sync the whole history on startup? It only does not delete history? Which means that if after a while I stop the node, delete the database, and restart it, the node will not sync back the deleted history, but start with a small database and only grow with new blocks?

That is correct. It will use the same syncing mechanism as before to get to the head of the chain. But from there on it won't remove old epochs as soon a we hit an election block.

viquezclaudio · 2024-02-07T18:01:13Z

Correct, this is just a temporal workaround while we implement the right solution: a special lightweight version of the history store for full nodes, that allows us to properly construct (and verify the history root) without having to store the full transactions.

Use the light history store for full node clients Necessary changes to use the history interface Several changes to usage of the history store Fixes #1704

Eligioo added the bug Something isn't working label Jul 17, 2023

jsdanielh added this to the Nimiq PoS Mainnet milestone Aug 1, 2023

Eligioo changed the title ~~Full node unable extend own chain~~ Full node unable to apply election block Sep 13, 2023

Eligioo mentioned this issue Sep 29, 2023

Full node not syncing #1880

Closed

sisou linked a pull request Apr 16, 2024 that will close this issue

Light History Store for full nodes #2370

Open

3 tasks

viquezclaudio added a commit that referenced this issue Apr 16, 2024

Light History Store for full nodes

3dd0f6a

Use the light history store for full node clients Necessary changes to use the history interface Several changes to usage of the history store Fixes #1704

viquezclaudio added a commit that referenced this issue Apr 16, 2024

Light History Store for full nodes

b27ad62

Use the light history store for full node clients Necessary changes to use the history interface Several changes to usage of the history store Fixes #1704

viquezclaudio added a commit that referenced this issue Apr 16, 2024

Light History Store for full nodes

62ad289

Use the light history store for full node clients Necessary changes to use the history interface Several changes to usage of the history store Fixes #1704

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Full node unable to apply election block #1704

Full node unable to apply election block #1704

Eligioo commented Jul 13, 2023

Eligioo commented Sep 13, 2023 •

edited

Eligioo commented Dec 22, 2023

jsdanielh commented Feb 3, 2024

ii-cruz commented Feb 6, 2024 •

edited

sisou commented Feb 6, 2024

ii-cruz commented Feb 6, 2024

sisou commented Feb 7, 2024

Eligioo commented Feb 7, 2024 •

edited

viquezclaudio commented Feb 7, 2024 •

edited

Full node unable to apply election block #1704

Full node unable to apply election block #1704

Comments

Eligioo commented Jul 13, 2023

Eligioo commented Sep 13, 2023 • edited

Eligioo commented Dec 22, 2023

jsdanielh commented Feb 3, 2024

ii-cruz commented Feb 6, 2024 • edited

sisou commented Feb 6, 2024

ii-cruz commented Feb 6, 2024

sisou commented Feb 7, 2024

Eligioo commented Feb 7, 2024 • edited

viquezclaudio commented Feb 7, 2024 • edited

Eligioo commented Sep 13, 2023 •

edited

ii-cruz commented Feb 6, 2024 •

edited

Eligioo commented Feb 7, 2024 •

edited

viquezclaudio commented Feb 7, 2024 •

edited