Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

benchmark parachain and standalone chain #232

Open
ltfschoen opened this issue Oct 7, 2021 · 0 comments
Open

benchmark parachain and standalone chain #232

ltfschoen opened this issue Oct 7, 2021 · 0 comments

Comments

@ltfschoen
Copy link
Collaborator

ltfschoen commented Oct 7, 2021

based on my review of previous discussion between Alan S, Basti and Sergei in Element's Parachain Technical room, Alan S shared how he profiled their parachain block authority execution time for benchmarking and stack analysis with trace debugging as follows:

profiled a parachain's block authority execution time for benchmarking and stack analysis with trace debugging

  • run your node using flags --dev, -lsync=trace, -lsub-libp2p=trace
  • run perf record -F 999 -p <pid_of_your_node> --call-graph dwarf
  • wait for the block to be produced by your node and then Ctrl+C to stop the perf (you can keep the node running to repeat later)
  • get the perf script perf script --no-inline > perf.script.data
  • open it at https://www.speedscope.app to view execution (i.e. perf.basti-cache-runtime-fix.data from PR #9611 shared in Element's "Parachain Technical" room)

they were using the default cumulus authorship deadline is 500ms (i.e. 12000*(1/24) = SLOT_DURATION * block_proposal_slot_portion), where SLOT_DURATION equals their MILLISECS_PER_BLOCK.

but for the DataHighway's Westlake, we're currently using 4320 for MILLISECS_PER_BLOCK, so our slot duration is much less at 180ms, so maybe we need to change it to the following (i.e. 500/4230 and 750/4320 if we want 500ms as our cumulus authorship deadline too

// We got around 500ms for proposing
block_proposal_slot_portion: SlotProportion::new(1f32 / 8f32),
// And a maximum of 750ms if slots are skipped
max_block_proposal_slot_portion: Some(SlotProportion::new(1f32 / 6f32)),

Note that in the polkadot repo https://github.com/paritytech/polkadot, both millau and rialto are using 6000 for MILLISECS_PER_BLOCK, and they are using block_proposal_slot_portion: SlotProportion::new(2f32 / 3f32), and max_block_proposal_slot_portion: None,

Alan S they discovered that their 500ms was split up as follows:

500ms - parachain block authoring
140ms - reserved for initialization/finalization (i.e. sc_basic_authorship::basic_authorship)
65% - block production (i.e. including verifying extrinsic signatures for inclusion)
35% - block finalization
360ms - applying extrinsics and overhead (apply_extrinsic)
25% - overhead retrieving runtime_code() from storage cached (i.e. sc_client_db::storage_cache) runtime_code() (only if there is no new runtime code, otherwise fetch it from TrieBackend)
50% - overhead of runtime_code() execution blake2 related before each extrinsic is applied apply_extrinsic_call_at...contextual_call/runtime_code with blake2 (when running node with --dev there isn't this overhead)
25% - apply extrinsics extrinsic.check (i.e. ecdsa signature verification) (requires ~100ms for 100 extrinsics using system::remark)

then Basti created this PR paritytech/substrate#9611 that resulted in an improvement with basic extrinsics from 180tx/block max to 450tx/block

i believe we need to:

  • profile our parachain using perf as mentioned previously with the kinds of extrinsics we'll be using to undertake benchmarking and stack analysis of the block authoring execution time, and use trace debugging to determine whether we need to:

note: some user mentioned that "transactions take progressively longer the later they go into a block in a linear way"

here are extracts of relevant parts of codebases that we should consider in possible changes in our 'ilya/parachain-update' branch:

pub const MILLISECS_PER_BLOCK: u64 = 12000;
pub const SLOT_DURATION: u64 = MILLISECS_PER_BLOCK;

// We got around 500ms for proposing
block_proposal_slot_portion: SlotProportion::new(1f32 / 24f32),
// And a maximum of 750ms if slots are skipped
max_block_proposal_slot_portion: Some(SlotProportion::new(1f32 / 16f32)),

...

/// We assume that ~10% of the block weight is consumed by `on_initalize` handlers.
/// This is used to limit the maximal weight of a single extrinsic.
const AVERAGE_ON_INITIALIZE_RATIO: Perbill = Perbill::from_percent(10);
/// We allow `Normal` extrinsics to fill up the block up to 75%, the rest can be used
/// by  Operational  extrinsics.
const NORMAL_DISPATCH_RATIO: Perbill = Perbill::from_percent(75);
/// We allow for 0.5 of a second of compute with a 12 second average block time.
const MAXIMUM_BLOCK_WEIGHT: Weight = WEIGHT_PER_SECOND / 2;
pub const WEIGHT_PER_SECOND: Weight = 1_000_000_000_000;
pub const WEIGHT_PER_MILLIS: Weight = WEIGHT_PER_SECOND / 1000; // 1_000_000_000
pub const WEIGHT_PER_MICROS: Weight = WEIGHT_PER_MILLIS / 1000; // 1_000_000

/// Executing 10,000 System remarks (no-op) txs takes ~1.26 seconds -> ~125 µs per tx
pub const ExtrinsicBaseWeight: Weight = 125 * WEIGHT_PER_MICROS;
@ltfschoen ltfschoen changed the title benchmark parachain benchmark parachain and standalone chain Oct 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant