Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(bootstrap): More efficient snarked ledger synchronization #292

Merged
merged 4 commits into from Mar 29, 2024

Conversation

tizoc
Copy link
Collaborator

@tizoc tizoc commented Mar 21, 2024

  • Starts with a num_accounts query to fetch the number of accounts in the ledger
  • Using the data obtained from the num_accounts query results, skip all the empty tree levels and go straight to the first node containing child accounts.
  • Uses as a base the previous available ledger (genesis -> staking -> next epoch -> transition frontier root) and queries only the nodes that changed
  • Fetches accounts in groups of 64 (up from 2) to lower the amount of queries
  • Validations for num accounts, hashes and accounts

@tizoc tizoc marked this pull request as ready for review March 22, 2024 17:43
@tizoc tizoc requested a review from binier March 22, 2024 17:49
@tizoc
Copy link
Collaborator Author

tizoc commented Mar 22, 2024

@binier I still need to clean it up a bit more, and maybe finish a few small things still missing, but if you have time please take a look at the general approach (and actions naming, etc) and let me know if there is anything you would change.

let content_hash = content_hash.0.to_field();

let computed_hash = (subtree_height..LEDGER_DEPTH).fold(content_hash, |prev_hash, height| {
let empty_right = ledger::V2::empty_hash_at_height(height);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ledger::V2::empty_hash_at_height is quite slow, you can use HASH_EMPTIES instead:

static HASH_EMPTIES: Lazy<Mutex<Vec<Fp>>> = Lazy::new(|| {

If you could also change it to be Lazy<Arc<[Fp; 36]>>, not sure why I've used a mutex there

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sebastiencs good catch. Just pushed the version that uses the cached values

@tizoc tizoc force-pushed the feat/optimized-bootstrap branch 4 times, most recently from c17fac8 to 7e038f6 Compare March 25, 2024 17:34
Copy link
Collaborator

@binier binier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.
Must change is regarding storing pending mask in self.sync.snarked_ledgers, (unless I'm wrong there), but the rest can be ignored.

return Ok(false);
}

let origin = self
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should use self.mask(hash) here instead? So that we copy mask either from self.snarked_ledgers or self.sync.snarked_ledgers. Otherwise we won't be able to reuse mid sync data (not sure if that falls within scope of this pr yet though).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@binier it falls in the scope of this PR yes, I just don't handle that case yet

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the use of mask, I made the calls more strict (new pending_sync_snarked_ledger_mask method) so that they only check for snarked ledgers, and also ones that exists (instead of creating a new empty mask).

For this case in particular I check both origins (already synced, then syncing).

))?;

let target = origin.copy();
self.snarked_ledgers
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should insert that snarked ledger in self.sync.snarked_ledgers instead right? Coz until we sync mismatched parts, target_snarked_ledger_hash != target.merkle_root().

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@binier ufff good catch, when cleaning up some code at some point I got the impression that I was starting to see less reuse than I was expecting, this would explain it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI this small change reduced the 3 snarked ledgers sync time from 60s to 20s, so it was indeed skipping all the reuse because of that mistake.

snarked_ledger_hash: LedgerHash,
parent: &LedgerAddress,
) -> Result<(LedgerHash, LedgerHash), String> {
let mask = self.ctx_mut().sync.snarked_ledger_mut(snarked_ledger_hash);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use .mask() here as well, because name of fn doesn't imply it will only get hashes if ledger is mid-sync.

let accounts = Vec::<ledger::Account>::binprot_read(&mut bytes)?;

let (mut mask, total_currency) = Self::build_ledger_from_accounts(accounts);
let (mut mask, total_currency) =
Self::build_ledger_from_accounts_and_hashes(accounts, hashes);
let ledger_hash = ledger_hash(&mut mask);
if let Some(expected_hash) = expected_hash.filter(|h| h != &ledger_hash) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now since we will be using cached hashes, hash mismatch check seems useless.

How slow is recomputing those hashes? Is it worth storing them?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@binier for the berkeleynet ledger it took about 10 secs here, on mainnet it would take much longer.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah fairly slow then... What if we cache "merge" hashes, but recompute and compare account hashes? We could also parallelize hashing.

Any ideas how ocaml node handles this? Is it slow there as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in OCaml it is just slow (remember they load from the JSON directly), but only an issue the first time, because after the initial load they keep a persisted version of all the important ledgers which include all the relevant data. We are not doing that yet, so this binprot loading we are doing is just a workaround.

end_addr: LedgerAddress,
pending_addresses: BTreeMap<LedgerAddress, LedgerAddressQueryPending>,
/// Pending num account query attempts
pending_num_accounts: Option<LedgerNumAccountsQueryPending>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we combine num_accounts and pending_num_accounts to avoid impossible state? So that num_accounts can't exist unless pending_num_accounts is in accepted state.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I'd split this TransitionFrontierSyncLedgerSnarkedState::Pending state into more states.

Like:

NumAccountsFetchPending { ... },
NumAccountsFetchSuccess { ... },
TreeSyncPending { .. },
TreeSyncSuccess{ .. },

Of course assuming that fetching of addresses and num accounts doesn't happen in parallel.
If it's sequential, that should be clear based on state.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also naming makes it confusing num_accounts and num_accounts_accepted. From first glance I saw num_accounts_accepted and thought it was the response received from Num_accounts that we've verified. Ofc comment made it clear that wasn't the case.
Maybe total_accounts and accounts_synced or something like that would be better.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

num_accounts_expected + num_accounts_added ? The validation of the number of accounts claimed by the peer is kind of weak because at most we validate that the depth of the subtree is correct, but any value for num_accounts that matches that depth will be accepted (so it should just be considered an estimate)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course assuming that fetching of addresses and num accounts doesn't happen in parallel. If it's sequential, that should be clear based on state.

It is fully sequential (only after we know the number of accounts we know the root node containing all the accounts), so yes, that split makes total sense, will do that

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I split the Pending state in two (one for the num accounts and another for the merkle tree sync). In your comment above you added some Success states, which I didn't add. Can you elaborate on that? should I add them too?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, nevermind, it is just cleaner that way, I will add them.

@@ -35,7 +42,23 @@ pub enum TransitionFrontierSyncLedgerSnarkedState {
}

#[derive(Serialize, Deserialize, Debug, Clone)]
pub struct LedgerQueryPending {
pub enum LedgerQueryQueued {
NumAccounts,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If splitting state into parts is taken into account, this can be a struct instead of enum.

NumAccountsFetchPending { ... },
NumAccountsFetchSuccess { ... },
TreeSyncPending { .. },
TreeSyncSuccess{ .. },

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got rid of this one, now the queue is just for the address queries.

// we use it as a base to save work during synchronization.
let best_tip = store.state().transition_frontier.sync.best_tip().unwrap();
let target = super::ledger::SyncLedgerTarget::staking_epoch(best_tip);
let origin = best_tip.genesis_ledger_hash().clone();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For very unlikely edge case.
If we have any ledgers for store.state().transition_frontier.best_tip() (already synced best tip), use that instead of genesis.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@binier can you expand on this one? note sure I follow.

@tizoc tizoc force-pushed the feat/optimized-bootstrap branch 4 times, most recently from f9f83da to 28d1114 Compare March 28, 2024 18:52
@tizoc tizoc requested a review from binier March 28, 2024 19:32
@tizoc
Copy link
Collaborator Author

tizoc commented Mar 28, 2024

@binier I will probably split the genesis ledger format changes and loading to a separate PR to merge it separately when I start rebasing and squashing these commits before merging. The async staged ledger reconstruction is at #308

@tizoc tizoc force-pushed the feat/optimized-bootstrap branch 2 times, most recently from e689423 to b0fda52 Compare March 29, 2024 00:34
If this is not done, the cloned mask will need to recompute its internal hashes, which is expensive.
@tizoc tizoc force-pushed the feat/optimized-bootstrap branch 2 times, most recently from ee63a83 to b8b830e Compare March 29, 2024 15:32
fixup(bootstrap): adjust enabling conditions

review: fix to where mid-sync ledgers are stored and retrieved from

review: Split num accounts fetch vs merkle tree sync states in transition frontier

review: rename some sync status fields to make them less ambiguous

fix(transition-frontier): Check that the ledger doesn't exist in the sync snarked ledgers

feat(transition-frontier): Make copying of ledgers for sync more strict

fix(transition-frontier): Fix ledger copying handling for current state machine that produces the genesis block and syncs immediately

refactor(transition-frontier): Remove some code duplication

refactor(transition-frontier): Add success states for NumAccounts and MerkleTreeSync (WIP)

feat(transition-frontier): Implement missing enabling conditions
@tizoc tizoc merged commit d88cfa2 into develop Mar 29, 2024
3 of 4 checks passed
@tizoc tizoc deleted the feat/optimized-bootstrap branch March 29, 2024 15:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants