-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[EXP] perf(decode): approx. 100x speed improvement w/ various optimizations #310
base: main
Are you sure you want to change the base?
Conversation
Hey! Thank you for opening! I'll take a look today :) If you wouldnt mind sharing profiling results & flamegraphs, you can comment them here or send to |
Hey Jon! I'll enclose a flamegraph I made before making changes to Heimdall, and one from the latest iteration (with damerau toggled-off). Please note my code is using both parallelism and concurrency, resolving transactions input in batches of 10,000. You can easily run your own flamegraph using https://github.com/flamegraph-rs/flamegraph, or profile your app using https://github.com/cmyr/cargo-instruments if on MacOS (can't share these as they have sensitive information) |
I've removed the I'll take a look at implementing more optimizations from this PR shortly <3 |
any questions just shoot thanks a lot for having created heimdall-rs in any cases ! 🫡 |
Motivation
Faced with millions of transactions, Heimdall wasn't nearly fast enough, decoding calldata at a rate of approx. 10,000 txs per min, with a cap at length 15,000 bytes (I know gas or gas-normalized metrics would be more accurate, sorry!).
As such, after profiling both the CPU and IO components, I've picked up the low-hanging fruits to make Heimdall more than just a single-run tool.
Please note, I have added "Mock" in the title because while this works well, it would require some cleaning to be accepted as a legitimate PR. So, see this as some form of "heads-up".
Solution
The solution consisted of three main components:
Obviously, you'll notice that cache update can be done better, can be flushed at the end of the process into the /cache directory, etc. I am happy with these limitations, as my goal is met here. Ultimately, these elements justify making this PR "mock", and subject to relatively mild, but yet sensible final polishing work.
I've saved some flamegraph and other CPU/IO profiling done along the way (even though I'd argue most of the performance was picked up via flamegraph, profiling on MacOS requires more upfront work than on Linux...). If you want me to pass these around, feel free to ask!