[EXP] perf(decode): approx. 100x speed improvement w/ various optimizations #310

sfyll · 2024-01-26T10:29:29Z

Motivation

Faced with millions of transactions, Heimdall wasn't nearly fast enough, decoding calldata at a rate of approx. 10,000 txs per min, with a cap at length 15,000 bytes (I know gas or gas-normalized metrics would be more accurate, sorry!).

As such, after profiling both the CPU and IO components, I've picked up the low-hanging fruits to make Heimdall more than just a single-run tool.

Please note, I have added "Mock" in the title because while this works well, it would require some cleaning to be accepted as a legitimate PR. So, see this as some form of "heads-up".

Solution

The solution consisted of three main components:

Pass around a real cache of function signatures and their respective ResolvedFunction arrays. Indeed, what's defined as cache in the codebase would be better called speculative cache as it is implicitly assumed that once loaded the OS will keep that in memory. Nonetheless, it is good practice not to rely on OS abstraction for this very speculative nature. As such, we load the whole key:value mapping and allow it to be passed around threads.
Lazy-loading a single HTTP Client that can be passed around threads, since the previous implementation would spin up a client per request. That would result in significant CPU cycles being spent re-doing cryptographic functions of the TLS protocol (notably, SSL Handshake).
Added optionality to avoid expensive compute such as similarity checks via normalized_damerau_levenshtein. I'd argue that more formal verification is required to understand how much of an improvement these bring versus the 4-5x added latency once the above two are implemented.

Obviously, you'll notice that cache update can be done better, can be flushed at the end of the process into the /cache directory, etc. I am happy with these limitations, as my goal is met here. Ultimately, these elements justify making this PR "mock", and subject to relatively mild, but yet sensible final polishing work.

I've saved some flamegraph and other CPU/IO profiling done along the way (even though I'd argue most of the performance was picked up via flamegraph, profiling on MacOS requires more upfront work than on Linux...). If you want me to pass these around, feel free to ask!

Jon-Becker · 2024-01-26T15:29:36Z

Hey! Thank you for opening! I'll take a look today :)

If you wouldnt mind sharing profiling results & flamegraphs, you can comment them here or send to jonathan@jbecker.dev!

sfyll · 2024-01-27T10:44:09Z

Hey Jon!

I'll enclose a flamegraph I made before making changes to Heimdall, and one from the latest iteration (with damerau toggled-off). Please note my code is using both parallelism and concurrency, resolving transactions input in batches of 10,000. You can easily run your own flamegraph using https://github.com/flamegraph-rs/flamegraph, or profile your app using https://github.com/cmyr/cargo-instruments if on MacOS (can't share these as they have sensitive information)

Jon-Becker · 2024-02-09T17:06:17Z

I've removed the normalized_damerau_levenshtein checks from the decode module on nightly, with the previous fixes & improvements it became obselete and unnecessary!

I'll take a look at implementing more optimizations from this PR shortly <3

sfyll · 2024-02-11T00:17:27Z

any questions just shoot

thanks a lot for having created heimdall-rs in any cases ! 🫡

sfyll added 16 commits January 17, 2024 17:29

quick fix calldata

61c0ff1

make similarity optional as its usage is criminal clockwise

ecdd418

deterministic cache

64e2abd

further optimisations

04f5ced

update cargo.toml with cache

caabfd5

wip

cf032e8

fix

226c286

fix check

12041a7

share http_client across threads

e72efe6

accept invalid certs

2cdfd0e

pass ref to client

c7d1ebd

log level

71a9924

revert

a85c375

release lock on client

1ca6088

clean format

7447acf

trying without mutex

67f6571

Jon-Becker changed the title ~~feat(performance improvement): Mock Update with approx. 100x speed improvement~~ [EXP] perf(decode): approx. 100x speed improvement w/ various optimizations Jan 26, 2024

fala13 mentioned this pull request Mar 19, 2024

Support Ws and Ipc providers #369

Merged

github-staff deleted a comment from carlosfgti May 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EXP] perf(decode): approx. 100x speed improvement w/ various optimizations #310

[EXP] perf(decode): approx. 100x speed improvement w/ various optimizations #310

sfyll commented Jan 26, 2024

Jon-Becker commented Jan 26, 2024 •

edited

sfyll commented Jan 27, 2024

Jon-Becker commented Feb 9, 2024

sfyll commented Feb 11, 2024

[EXP] perf(decode): approx. 100x speed improvement w/ various optimizations #310

Are you sure you want to change the base?

[EXP] perf(decode): approx. 100x speed improvement w/ various optimizations #310

Conversation

sfyll commented Jan 26, 2024

Motivation

Solution

Jon-Becker commented Jan 26, 2024 • edited

sfyll commented Jan 27, 2024

Jon-Becker commented Feb 9, 2024

sfyll commented Feb 11, 2024

Jon-Becker commented Jan 26, 2024 •

edited