New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(bootstrap): More efficient snarked ledger synchronization #292
Conversation
76cdf13
to
8186e0f
Compare
ee83b4c
to
77d8634
Compare
@binier I still need to clean it up a bit more, and maybe finish a few small things still missing, but if you have time please take a look at the general approach (and actions naming, etc) and let me know if there is anything you would change. |
df11465
to
f44c2ad
Compare
node/src/ledger/mod.rs
Outdated
let content_hash = content_hash.0.to_field(); | ||
|
||
let computed_hash = (subtree_height..LEDGER_DEPTH).fold(content_hash, |prev_hash, height| { | ||
let empty_right = ledger::V2::empty_hash_at_height(height); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ledger::V2::empty_hash_at_height
is quite slow, you can use HASH_EMPTIES
instead:
Line 199 in adf6f81
static HASH_EMPTIES: Lazy<Mutex<Vec<Fp>>> = Lazy::new(|| { |
If you could also change it to be Lazy<Arc<[Fp; 36]>>
, not sure why I've used a mutex there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sebastiencs good catch. Just pushed the version that uses the cached values
c17fac8
to
7e038f6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
Must change is regarding storing pending mask in self.sync.snarked_ledgers
, (unless I'm wrong there), but the rest can be ignored.
return Ok(false); | ||
} | ||
|
||
let origin = self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should use self.mask(hash)
here instead? So that we copy mask either from self.snarked_ledgers
or self.sync.snarked_ledgers
. Otherwise we won't be able to reuse mid sync data (not sure if that falls within scope of this pr yet though).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@binier it falls in the scope of this PR yes, I just don't handle that case yet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the use of mask
, I made the calls more strict (new pending_sync_snarked_ledger_mask
method) so that they only check for snarked ledgers, and also ones that exists (instead of creating a new empty mask).
For this case in particular I check both origins (already synced, then syncing).
node/src/ledger/ledger_service.rs
Outdated
))?; | ||
|
||
let target = origin.copy(); | ||
self.snarked_ledgers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should insert that snarked ledger in self.sync.snarked_ledgers
instead right? Coz until we sync mismatched parts, target_snarked_ledger_hash != target.merkle_root()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@binier ufff good catch, when cleaning up some code at some point I got the impression that I was starting to see less reuse than I was expecting, this would explain it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI this small change reduced the 3 snarked ledgers sync time from 60s to 20s, so it was indeed skipping all the reuse because of that mistake.
node/src/ledger/ledger_service.rs
Outdated
snarked_ledger_hash: LedgerHash, | ||
parent: &LedgerAddress, | ||
) -> Result<(LedgerHash, LedgerHash), String> { | ||
let mask = self.ctx_mut().sync.snarked_ledger_mut(snarked_ledger_hash); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should use .mask()
here as well, because name of fn doesn't imply it will only get hashes if ledger is mid-sync.
let accounts = Vec::<ledger::Account>::binprot_read(&mut bytes)?; | ||
|
||
let (mut mask, total_currency) = Self::build_ledger_from_accounts(accounts); | ||
let (mut mask, total_currency) = | ||
Self::build_ledger_from_accounts_and_hashes(accounts, hashes); | ||
let ledger_hash = ledger_hash(&mut mask); | ||
if let Some(expected_hash) = expected_hash.filter(|h| h != &ledger_hash) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now since we will be using cached hashes, hash mismatch check seems useless.
How slow is recomputing those hashes? Is it worth storing them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@binier for the berkeleynet ledger it took about 10 secs here, on mainnet it would take much longer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah fairly slow then... What if we cache "merge" hashes, but recompute and compare account hashes? We could also parallelize hashing.
Any ideas how ocaml node handles this? Is it slow there as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in OCaml it is just slow (remember they load from the JSON directly), but only an issue the first time, because after the initial load they keep a persisted version of all the important ledgers which include all the relevant data. We are not doing that yet, so this binprot loading we are doing is just a workaround.
end_addr: LedgerAddress, | ||
pending_addresses: BTreeMap<LedgerAddress, LedgerAddressQueryPending>, | ||
/// Pending num account query attempts | ||
pending_num_accounts: Option<LedgerNumAccountsQueryPending>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we combine num_accounts
and pending_num_accounts
to avoid impossible state? So that num_accounts
can't exist unless pending_num_accounts
is in accepted state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I'd split this TransitionFrontierSyncLedgerSnarkedState::Pending
state into more states.
Like:
NumAccountsFetchPending { ... },
NumAccountsFetchSuccess { ... },
TreeSyncPending { .. },
TreeSyncSuccess{ .. },
Of course assuming that fetching of addresses and num accounts doesn't happen in parallel.
If it's sequential, that should be clear based on state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also naming makes it confusing num_accounts
and num_accounts_accepted
. From first glance I saw num_accounts_accepted
and thought it was the response received from Num_accounts
that we've verified. Ofc comment made it clear that wasn't the case.
Maybe total_accounts
and accounts_synced
or something like that would be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
num_accounts_expected
+ num_accounts_added
? The validation of the number of accounts claimed by the peer is kind of weak because at most we validate that the depth of the subtree is correct, but any value for num_accounts
that matches that depth will be accepted (so it should just be considered an estimate)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course assuming that fetching of addresses and num accounts doesn't happen in parallel. If it's sequential, that should be clear based on state.
It is fully sequential (only after we know the number of accounts we know the root node containing all the accounts), so yes, that split makes total sense, will do that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I split the Pending state in two (one for the num accounts and another for the merkle tree sync). In your comment above you added some Success
states, which I didn't add. Can you elaborate on that? should I add them too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, nevermind, it is just cleaner that way, I will add them.
@@ -35,7 +42,23 @@ pub enum TransitionFrontierSyncLedgerSnarkedState { | |||
} | |||
|
|||
#[derive(Serialize, Deserialize, Debug, Clone)] | |||
pub struct LedgerQueryPending { | |||
pub enum LedgerQueryQueued { | |||
NumAccounts, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If splitting state into parts is taken into account, this can be a struct instead of enum.
NumAccountsFetchPending { ... },
NumAccountsFetchSuccess { ... },
TreeSyncPending { .. },
TreeSyncSuccess{ .. },
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got rid of this one, now the queue is just for the address queries.
// we use it as a base to save work during synchronization. | ||
let best_tip = store.state().transition_frontier.sync.best_tip().unwrap(); | ||
let target = super::ledger::SyncLedgerTarget::staking_epoch(best_tip); | ||
let origin = best_tip.genesis_ledger_hash().clone(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For very unlikely edge case.
If we have any ledgers for store.state().transition_frontier.best_tip()
(already synced best tip), use that instead of genesis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@binier can you expand on this one? note sure I follow.
f9f83da
to
28d1114
Compare
e689423
to
b0fda52
Compare
If this is not done, the cloned mask will need to recompute its internal hashes, which is expensive.
ee63a83
to
b8b830e
Compare
fixup(bootstrap): adjust enabling conditions review: fix to where mid-sync ledgers are stored and retrieved from review: Split num accounts fetch vs merkle tree sync states in transition frontier review: rename some sync status fields to make them less ambiguous fix(transition-frontier): Check that the ledger doesn't exist in the sync snarked ledgers feat(transition-frontier): Make copying of ledgers for sync more strict fix(transition-frontier): Fix ledger copying handling for current state machine that produces the genesis block and syncs immediately refactor(transition-frontier): Remove some code duplication refactor(transition-frontier): Add success states for NumAccounts and MerkleTreeSync (WIP) feat(transition-frontier): Implement missing enabling conditions
b8b830e
to
549e600
Compare
num_accounts
query to fetch the number of accounts in the ledgernum_accounts
query results, skip all the empty tree levels and go straight to the first node containing child accounts.