Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Low-level cluster linearization code #30126

Open
wants to merge 24 commits into
base: master
Choose a base branch
from

Conversation

sipa
Copy link
Member

@sipa sipa commented May 16, 2024

Depends on #30160 and #30161. Eventually #28676 will end up being based on this.

This introduces low-level optimized cluster linearization code, including tests and some benchmarks. It is currently not hooked up to anything.

Roughly the commits are organized into 3 groups:

  • Repeat of part of Several randomness improvements #29625.
  • Introduce unoptimized versions of candidate finding and linearizations, plus benchmarks and tests.
  • Add various optimizations step by step.

Ultimately, what this PR adds is functions Linearize, PostLinearize, and MergeLinearizations which operate on instances of DepGraph (instances of which represent pre-processed transaction clusters) to produce and/or improve linearizations for that cluster.

Along the way two new data structures are introduced (util/bitset.h and util/ringbuffer.h), which could be useful more broadly. They have their own commits, which include tests.


To provide assurance, the code heavily relies on fuzz tests. A novel approach is used here, where the fuzz input is parsed using the serialization.h framework rather than FuzzedDataProvider, with a custom serializer/deserializer for DepGraph objects. By including serialization, it's possible to ascertain that the format can represent every relevant cluster, as well as potentially permitting the construction of ad-hoc fuzz inputs from clusters (not included in this PR, but used during development).


The Linearize(depgraph, iteration_limit, rng_seed, old_linearization) function is an implementation of the (single) LIMO algorithm, with the $S$ in every iteration found as the best out of (a) the best remaining ancestor set and (b) randomized computationally-bounded search. It incrementally builds up a linearization by finding good topologically-valid subsets to move to the front, in such a way that the resulting linearization has a diagram that is at least as good as the old_linearization passed in (if any).

  • Despite using both best ancestor set and search, this is not Double LIMO, as no intersections between these are involved; just the best of the two.
  • The iteration_limit and rng_seed only control the (b) randomized search. Even with 0 iterations, the result will be as good as the old linearization, and the included sets at every point will have a feerate at least as high as the best remaining ancestor set at that point.

The search algorithm used in the (b) step above largely follows Section 2 of How to linearize your cluster, though with a few changes:

  • Connected component analysis is performed inside the search algorithm (creating initial work items per component for each candidate), rather than once at a higher level. This duplicates some work but is significantly simpler in implementation.
  • No ancestor-set based presplitting inside the search is performed; instead, the best value is initialized with the best topologically valid set known to the LIMO algorithm before search starts: the better one out of the highest-feerate remaining ancestor set, and the highest-feerate prefix of remaining transactions in old_linearization.
  • Work items are represented using an included set inc and an undefined set und, rather than included and excluded.
  • Potential sets pot are not computed for work items with empty inc.

At a high level, the only missing optimization from that post is bottleneck analysis; my thinking is that it only really helps with clusters that are already relatively cheap to linearize (doing so would need to be done at a higher level, not inside the search algorithm).

The PostLinearize(depgraph, linearization) function performs an in-place improvement of linearization, using two iterations of the Linearization post-processing algorithm. The first running from back to front, the second from front to back.

The MergeLinearizations(depgraph, linearization1, linearization2) function computes a new linearization for the provided cluster, given two existing linearizations for that cluster, which is at least as good as both inputs. The algorithm is described at a high level in merging incomparable linearizations.

@DrahtBot
Copy link
Contributor

DrahtBot commented May 16, 2024

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage

For detailed information about the code coverage, see the test coverage report.

Reviews

See the guideline for information on the review process.
A summary of reviews will appear here.

Conflicts

Reviewers, this pull request conflicts with the following ones:

  • #30161 (util: add VecDeque by sipa)
  • #29625 (Several randomness improvements by sipa)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

@DrahtBot
Copy link
Contributor

🚧 At least one of the CI tasks failed. Make sure to run all tests locally, according to the
documentation.

Possibly this is due to a silent merge conflict (the changes in this pull request being
incompatible with the current code in the target branch). If so, make sure to rebase on the latest
commit of the target branch.

Leave a comment here, if you need help tracking down a confusing failure.

Debug: https://github.com/bitcoin/bitcoin/runs/25072594213

@sipa
Copy link
Member Author

sipa commented May 20, 2024

Benchmarks on my Ryzen 5950X system:

ns/op op/s err% total benchmark
2,373.94 421,240.11 0.1% 1.10 LinearizeNoIters16TxWorstCase
7,530.22 132,798.26 0.0% 1.07 LinearizeNoIters32TxWorstCase
16,585.34 60,294.20 0.1% 1.10 LinearizeNoIters48TxWorstCase
28,591.70 34,975.18 0.1% 1.10 LinearizeNoIters64TxWorstCase
53,918.56 18,546.49 0.0% 1.10 LinearizeNoIters75TxWorstCase
93,589.21 10,684.99 0.1% 1.10 LinearizeNoIters99TxWorstCase
ns/iters iters/s err% total benchmark
45.36 22,045,550.98 0.5% 1.10 LinearizePerIter16TxWorstCase
35.57 28,111,376.58 0.1% 1.10 LinearizePerIter32TxWorstCase
33.04 30,262,951.89 0.0% 1.10 LinearizePerIter48TxWorstCase
33.21 30,107,745.17 0.1% 1.10 LinearizePerIter64TxWorstCase
75.98 13,161,530.63 0.4% 1.07 LinearizePerIter75TxWorstCase
76.62 13,051,066.77 0.5% 1.08 LinearizePerIter99TxWorstCase
ns/op op/s err% total benchmark
332.97 3,003,274.74 0.0% 1.10 PostLinearize16TxWorstCase
1,121.92 891,330.77 0.0% 1.10 PostLinearize32TxWorstCase
3,358.33 297,767.01 0.3% 1.13 PostLinearize48TxWorstCase
5,826.72 171,623.05 0.5% 1.11 PostLinearize64TxWorstCase
7,453.31 134,168.55 0.1% 1.07 PostLinearize75TxWorstCase
12,476.44 80,151.09 0.1% 1.10 PostLinearize99TxWorstCase

This means that for a 64-transaction cluster, it should be possible to linearize (28.59 µs) with 100 candidate search iterations (3.32 µs) plus postlinearize (5.83 µs), within a total of 37.74 µs, on my system.

src/util/bitset.h Outdated Show resolved Hide resolved
@sipa
Copy link
Member Author

sipa commented May 23, 2024

I've dropped the dependency on #29625, and switched to using FastRandomContext instead; there is a measurable slowdown from using the (ChaCha20-based) FastRandomContext over the (xoroshiro128++-based) InsecureRandomContext introduced there, but it's no more than 1-2%. I can switch back to that approach if 29625 were to make it in.

@DrahtBot
Copy link
Contributor

Guix builds (on x86_64) [untrusted test-only build, possibly unsafe, not for production use]

File commit 83ae1ba
(master)
commit e5cbc23
(master and this pull)
SHA256SUMS.part 24fd016e03e8c7da... 15fae3483445e33b...
*-aarch64-linux-gnu-debug.tar.gz 94942cf7dedf3604... 23eeccf77ee5799d...
*-aarch64-linux-gnu.tar.gz 4b30ca93b6788f48... ed8e5024d960f53e...
*-arm-linux-gnueabihf-debug.tar.gz a0f57c45e5f02bb1... f22f89c1eba49dda...
*-arm-linux-gnueabihf.tar.gz 9f0376baaf54b988... 17da8a968635c492...
*-arm64-apple-darwin-unsigned.tar.gz 9b952b32db70d099... 16d805ab4bcf8d54...
*-arm64-apple-darwin-unsigned.zip d49361bbbc5529fc... e225d79a24b058a5...
*-arm64-apple-darwin.tar.gz 34e9cf4b79cbc190... 29b28e6d57761201...
*-powerpc64-linux-gnu-debug.tar.gz 5f322a7b213e244e... cb5f37b036b5c52c...
*-powerpc64-linux-gnu.tar.gz bb57b46482c5b1e6... 57adf954458a27d5...
*-riscv64-linux-gnu-debug.tar.gz d1a3a405c5b45fff... 237eb467f8547d22...
*-riscv64-linux-gnu.tar.gz 68d7e6671e2dba30... 29d9f1e9052e96d3...
*-x86_64-apple-darwin-unsigned.tar.gz 6fb22000e8c14c40... 67e5bd5b86483c8a...
*-x86_64-apple-darwin-unsigned.zip 1c5f2a216e87cbf5... abdbca97fafc146f...
*-x86_64-apple-darwin.tar.gz 66f17a574163ecaf... f002830b4b8da330...
*-x86_64-linux-gnu-debug.tar.gz a5044f956a824228... 0791685e39e80672...
*-x86_64-linux-gnu.tar.gz 23af1dc6cb921b37... ff9625165c3f19c2...
*.tar.gz caac4a182deb1e04... ba8abeef4165dafb...
guix_build.log c7cc0190f7085f04... 100da60c2f0e6686...
guix_build.log.diff 7c460aa3b1aafc32...

sipa added 23 commits May 29, 2024 11:44
This adds a bitset module that implements a BitSet<N> class, a variant
of std::bitset with a few additional features that cannot be implemented
in a wrapper without performance loss (specifically, finding first and
last bit set, or iterating over all set bits).
…ypes

This primarily adds the DepGraph class, which encapsulated precomputed
ancestor/descendant information for a given transaction cluster, with a
number of a utility features (inspectors for set feerates, computing
reduced parents/children, adding transactions, adding dependencies), which
will become needed in future commits.
This introduces a bespoke fuzzing-focused serialization format for DepGraphs,
and then tests that this format can represent any graph, roundtrips, and then
uses that to test the correctness of DepGraph itself.

This forms the basis for future fuzz tests that need to work with interesting
graph.
This is a class that encapsulated precomputes ancestor set feerates, and
presents an interface for getting the best remaining ancestor set.
Similar to AncestorCandidateFinder, this encapsulates the state needed for
finding good candidate sets using a search algorithm.
This adds a first version of the overall linearization interface, which given
a DepGraph constructs a good linearization, by incrementally including good
candidate sets (found using AncestorCandidateFinder and SearchCandidateFinder).
Add benchmarks for known bad graphs for the purpose of search (as
an upper bound on work per search iterations) and ancestor sorting
(as an upper bound on linearization work with no search iterations).
Add a correctness test for the overall linearization algorithm.
Add utility functions to DepGraph for finding connected components.
Before this commit, the worst case for linearization involves clusters which
break apart in several smaller components after the first candidate is
included in the output linearization.

Address this by never considering work items that span multiple components
of what remains of the cluster.
This is an STL-like container that interface-wise looks like std::deque, but
is backed by a (fixed size, with vector-like capacity/reserve) circular buffer.
Switch to BFS exploration of the search tree in SearchCandidateFinder
instead of DFS exploration. This appears to behave better for real
world clusters.

As BFS has the downside of needing far larger search queues, switch
back to DFS temporarily when the queue grows too large.
To make search non-deterministic, change the BFS logic from always picking
the first queue item, randomly picking the first or second queue item.
This implements the LIMO algorithm for linearizing by improving an existing
linearization. See
https://delvingbitcoin.org/t/limo-combining-the-best-parts-of-linearization-search-and-merging
for details.
This is a requirement for a future commit, which will rely on quickly iterating
over transaction sets in decreasing individual feerate order.
…ion)

In each work item, keep track of a conservative overestimate of the best
possible feerate that can be reached from it, and then use these to avoid
exploring hopeless work items.
Keep track of which transactions in the graph have an individual
feerate that is better than the best included set so far. Others do not
need to be added to the pot set, as they cannot possibly help beating
best.
Automatically add topologically-valid subsets of the potential set pot
to inc. It can be proven that these must be part of the best reachable
topologically-valid set from that work item.
Emperically, this approach seems to be more efficient in common real-life
clusters, and does not change the worst case.
…ion)

Cache the potential set inside work items, and use it to skip part of
the computation of split-off work items from it.
@sipa
Copy link
Member Author

sipa commented May 29, 2024

I've added support for merging linearizations to this PR (MergeLinearizations() function), plus benchmarks and tests.

@DrahtBot
Copy link
Contributor

🚧 At least one of the CI tasks failed. Make sure to run all tests locally, according to the
documentation.

Possibly this is due to a silent merge conflict (the changes in this pull request being
incompatible with the current code in the target branch). If so, make sure to rebase on the latest
commit of the target branch.

Leave a comment here, if you need help tracking down a confusing failure.

Debug: https://github.com/bitcoin/bitcoin/runs/25568469819

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Priority Project Primary Blocker
Development

Successfully merging this pull request may close these issues.

None yet

6 participants