contrib: add tool to convert compact-serialized UTXO set to SQLite database #27432

theStack · 2023-04-06T17:59:23Z

Problem description

There is demand from users to get the UTXO set in form of a SQLite database (#24628). Bitcoin Core currently only supports dumping the UTXO set in a binary compact-serialized format, which was crafted specifically for AssumeUTXO snapshots (see PR #16899), with the primary goal of being as compact as possible. Previous PRs tried to extend the dumptxoutset RPC with new formats, either in human-readable form (e.g. #18689, #24202), or most recently, directly as SQLite database (#24952). Both are not optimal: due to the huge size of the ever-growing UTXO set with already more than 80 million entries on mainnet, human-readable formats are practically useless, and very likely one of the first steps would be to put them in some form of database anyway. Directly adding SQLite3 dumping support on the other hand introduces an additional dependency to the non-wallet part of bitcoind and the risk of increased maintenance burden (see e.g. #24952 (comment), #24628 (comment)).

Proposed solution

This PR follows the "external tooling" route by adding a simple Python script for achieving the same goal in a two-step process (first create compact-serialized UTXO set via dumptxoutset, then convert it to SQLite via the new script). Executive summary:

single file, no extra dependencies (sqlite3 is included in Python's standard library [1])
~150 LOC, mostly deserialization/decompression routines ported from the Core codebase and (probably the most difficult part) a little elliptic curve / finite field math to decompress pubkeys (essentialy solving the secp256k1 curve equation y^2 = x^3 + 7 for y given x, respecting the proper polarity as indicated by the compression tag)
creates a database with only one table utxos with the following schema:
(txid TEXT, vout INT, value INT, coinbase INT, height INT, scriptpubkey TEXT)
the resulting file has roughly 2x the size of the compact-serialized UTXO set (this is mostly due to encoding txids and scriptpubkeys as hex-strings rather than bytes)

[1] note that there are some rare cases of operating systems like FreeBSD though, where the sqlite3 module has to installed explicitly (see #26819)

A functional test is also added that creates UTXO set entries with various output script types (standard and also non-standard, for e.g. large scripts) and verifies that the UTXO sets of both formats match by comparing corresponding MuHashes. One MuHash is supplied by the bitcoind instance via gettxoutsetinfo muhash, the other is calculated in the test by reading back the created SQLite database entries and hashing them with the test framework's MuHash3072 module.

Manual test instructions

I'd suggest to do manual tests also by comparing MuHashes. For that, I've written a go tool some time ago which would calculate the MuHash of a sqlite database in the created format (I've tried to do a similar tool in Python, but it's painfully slow).

$ [run bitcoind instance with -coinstatsindex]
$ ./src/bitcoin-cli dumptxoutset ~/utxos.dat
$ ./src/bitcoin-cli gettxoutsetinfo muhash <block height returned in previous call>
(outputs MuHash calculated from node)

$ ./contrib/utxo-tools/utxo_to_sqlite.py ~/utxos.dat ~/utxos.sqlite
$ git clone https://github.com/theStack/utxo_dump_tools
$ cd utxo_dump_tools/calc_utxo_hash
$ go run calc_utxo_hash.go ~/utxos.sqlite
(outputs MuHash calculated from the SQLite UTXO set)

=> verify that both MuHashes are equal

For a demonstration what can be done with the resulting database, see #24952 (review) for some example queries. Thanks go to LarryRuane who gave me to the idea of rewriting this script in Python and adding it to contrib.

DrahtBot · 2023-04-06T17:59:26Z

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage

For detailed information about the code coverage, see the test coverage report.

Reviews

See the guideline for information on the review process.

Type	Reviewers
Concept ACK	jamesob, dunxen, pablomartin4btc, Sjors
Approach ACK	ajtowns
Stale ACK	willcl-ark

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#28984 (Cluster size 2 package rbf by instagibbs)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

pinheadmz · 2023-04-06T18:01:47Z

This also closes #21670 ;-)

theStack · 2023-04-06T18:03:49Z

Pinging users who worked on or reviewed / expressed interest in the previous attempt to solve this issue (PR #24952):
@dunxen @0xB10C @jamesob @prusnak @willcl-ark @w0xlt @jonatack @brunoerg @laanwj @fanquake

achow101 · 2023-04-20T19:43:27Z

the resulting file has roughly 2x the size of the compact-serialized UTXO set (this is mostly due to encoding txids and scriptpubkeys as hex-strings rather than bytes)

What is the rationale for encoding as text rather than bytes? SQLite can store byte values as BLOBs.

theStack · 2023-04-20T20:45:12Z

the resulting file has roughly 2x the size of the compact-serialized UTXO set (this is mostly due to encoding txids and scriptpubkeys as hex-strings rather than bytes)

What is the rationale for encoding as text rather than bytes? SQLite can store byte values as BLOBs.

Fair question. There was already some discussion in #24952 about whether to store txids/scriptPubKeys as TEXT or BLOB, see #24952 (review), #24952 (comment) and #24952 (comment). The two main points were:

conversion from this database to json/csv etc should be as simple as possible, and ideally "select * from utxos" should already lead to human-readable output. Converting would then be as trivial as a short one-liner sqlite3 -json utxo.db "SELECT * FROM utxos" > utxo.json, without even having to specify the columns
more annoying: if using BLOB for the txid, we would have to decide if we store the txid in little or big endian byte order. Big endian would be more natural, as that's how we internally store the txid and also serialize it on the wire, but we show everything in little endian byte order in Bitcoin Core, so a simple "select hex(txid) from utxos" would just show the txid in the wrong order, and something like "reverse(...)" doesn't exist in SQLite (though some hacky workarounds have been proposed: rpc: Add sqlite format option for dumptxoutset #24952 (comment) and rpc: Add sqlite format option for dumptxoutset #24952 (comment)). There remains the possibility to just store the txid in little endian order, but that's the opposite of what we do in leveldb or in the serialization of outpoints, so it could lead to huge confusion for users if not clearly documented. For TEXT the order is clear, it's just stored what is shown everywhere in Bitcoin Core and all wallets, block explorers etc.

Considering the scriptPubKey column individually, there is no good reason to use TEXT rather than BLOB, but I went for TEXT mostly for consistency reasons, to not mix TEXT and BLOB in different columns when it's both binary data.

That said, I'm also very open also for using BLOB instead, it's just a matter of trade-offs.

ajtowns · 2023-07-27T14:29:31Z

Approach ACK. Seems like a fine idea to me.

What is the rationale for encoding as text rather than bytes? SQLite can store byte values as BLOBs.

It's a python conversion script: can't you just add a command-line option for the resulting db to have hex txids or big/little endian blobs if there's user demand for it? Hex encoding seems a fine default to me, for what it's worth.

If people end up wanting lots of different options (convert scriptPubKeys to addresses? some way to update the db to a new state, rather than just create a new one?) maybe it would make sense for this script to have its own repo even; but while it stays simple/small, seems fine for contrib.

jamesob · 2023-07-27T14:33:54Z

Concept ACK, will test soon

dunxen · 2023-07-27T17:08:14Z

Concept ACK

willcl-ark

tACK 3ce180a

Left two nits which don't need addressing unless being re-touched, but overall this works well in testing and seems like a useful contrib script. Converting the output to json also worked as described in the comments above.

contrib/utxo-tools/utxo_to_sqlite.py

pablomartin4btc

Concept ACK

theStack · 2024-01-06T22:54:39Z

Could mark as draft while CI is red?

Sorry for the extra-late reply, missed this message and the CI fail. Rebased on master and resolved the silent merge conflict (caused by the module move test_framework.muhash -> test_framework.crypto.muhash in #28374). Also fixed the Windows CI issue by closing the sqlite connections properly with explicit con.close() calls (see e.g. #28204). CI is green now.

What is the rationale for encoding as text rather than bytes? SQLite can store byte values as BLOBs.

It's a python conversion script: can't you just add a command-line option for the resulting db to have hex txids or big/little endian blobs if there's user demand for it? Hex encoding seems a fine default to me, for what it's worth.

Good idea, planning to tackle this as a follow-up.

…P2PK outputs 28287cf test: add script compression coverage for not-on-curve P2PK outputs (Sebastian Falbesoner) Pull request description: This PR adds unit test coverage for the script compression functions `{Compress,Decompress}Script` in the special case of uncompressed P2PK outputs (scriptPubKey: OP_PUSH65 <0x04 ....> OP_CHECKSIG) with [pubkeys that are not fully valid](https://github.com/bitcoin/bitcoin/blob/44b05bf3fef2468783dcebf651654fdd30717e7e/src/pubkey.cpp#L297-L302), i.e. where the encoded point is not on the secp256k1 curve. For those outputs, script compression is not possible, as the y coordinate of the pubkey can't be recovered (see also call-site of `IsToPubKey`): https://github.com/bitcoin/bitcoin/blob/44b05bf3fef2468783dcebf651654fdd30717e7e/src/compressor.cpp#L49-L50 Likewise, for a compressed script of an uncompressed P2PK script (i.e. compression ids 4 and 5) where the x coordinate is not on the curve, decompression fails: https://github.com/bitcoin/bitcoin/blob/44b05bf3fef2468783dcebf651654fdd30717e7e/src/compressor.cpp#L122-L129 Note that the term "compression" is used here in two different meanings (though they are related), which might be a little confusing. The encoding of a pubkey can either be compressed (33-bytes with 0x02/0x03 prefixes) or uncompressed (65-bytes with 0x04 prefix). On the other hand there is also compression for whole output scripts, which is used for storing scriptPubKeys in the UTXO set in a compact way (and also for the `dumptxoutset` result, accordingly). P2PK output scripts with uncompressed pubkeys get compressed by storing only the x-coordinate and the sign as a prefix (0x04 = even, 0x05 = odd). Was diving deeper into the subject while working on #27432, where the script decompression of uncompressed P2PK needed special handling (see also #24628 (comment)). Trivia: as of now (block 801066), there are 13 uncompressed P2PK outputs in the UTXO set with a pubkey not on the curve (which obviously means they are unspendable). ACKs for top commit: achow101: ACK 28287cf tdb3: ACK for 28287cf. cbergqvist: ACK 28287cf! marcofleon: Nicely done, ACK 28287cf. Built the PR branch, ran the unit and functional tests, everything passed. Tree-SHA512: 777b6c3065654fbfa1ce94926f4cadb91a9ca9dc4dd4af6008ad77bd1da5416f156ad0dfa880d26faab2e168bf9b27e0a068abc9a2be2534d82bee61ee055c65

fjahr · 2024-03-14T13:08:25Z

Unfortunately, this will need to be updated again once #29612 is in, so probably best to put it on hold until then.

theStack · 2024-03-14T13:11:34Z

Unfortunately, this will need to be updated again once #29612 is in, so probably best to put it on hold until then.

Good point, changed to draft state for now.

DrahtBot · 2024-03-31T22:33:05Z

🚧 At least one of the CI tasks failed. Make sure to run all tests locally, according to the
documentation.

Possibly this is due to a silent merge conflict (the changes in this pull request being
incompatible with the current code in the target branch). If so, make sure to rebase on the latest
commit of the target branch.

Leave a comment here, if you need help tracking down a confusing failure.

_{Debug: https://github.com/bitcoin/bitcoin/runs/23283557207}

Sjors · 2024-04-03T14:48:01Z

Concept ACK

Sjors · 2024-04-03T14:50:20Z

contrib/utxo-tools/utxo_to_sqlite.py

+
+def decompress_script(f):
+    """Equivalent of `DecompressScript()` (see compressor module)."""
+    size = read_varint(f)  # sizes 0-5 encode compressed script types


TIL we compress certain standard scriptPubKey types.

theStack · 2024-05-05T09:15:01Z

Rebased on #29612, supporting the latest format with enhanced metadata (magic bytes, version, network magic, block height, block hash, coins count).

…tabase

DrahtBot added the Scripts and tools label Apr 6, 2023

theStack force-pushed the add-utxo_to_sqlite-conversion-tool branch 2 times, most recently from 494be8c to 3ce180a Compare April 6, 2023 20:05

DrahtBot mentioned this pull request Apr 7, 2023

test: autogenerate bash completion #25243

Closed

bitcoin deleted a comment from meuamigopedro Apr 12, 2023

This was referenced Apr 18, 2023

policy: Ephemeral anchors #26403

Closed

policy: nVersion=3 and Package RBF #25038

Closed

theStack mentioned this pull request Jul 31, 2023

test: add script compression coverage for not-on-curve P2PK outputs #28193

Merged

willcl-ark approved these changes Jul 31, 2023

View reviewed changes

contrib/utxo-tools/utxo_to_sqlite.py Outdated Show resolved Hide resolved

contrib/utxo-tools/utxo_to_sqlite.py Show resolved Hide resolved

achow101 requested a review from josibake September 20, 2023 17:21

pablomartin4btc reviewed Oct 13, 2023

View reviewed changes

DrahtBot requested review from jamesob and ajtowns October 13, 2023 02:42

DrahtBot added the CI failed label Oct 25, 2023

This was referenced Dec 2, 2023

Cluster size 2 package rbf #28984

Open

Ephemeral Anchors #29001

Draft

theStack force-pushed the add-utxo_to_sqlite-conversion-tool branch 2 times, most recently from 73b9cab to a1c1cf3 Compare January 6, 2024 21:36

DrahtBot removed the CI failed label Jan 6, 2024

DrahtBot added the CI failed label Jan 13, 2024

theStack force-pushed the add-utxo_to_sqlite-conversion-tool branch from a1c1cf3 to 9668255 Compare January 22, 2024 12:37

DrahtBot removed the CI failed label Jan 22, 2024

theStack marked this pull request as draft March 14, 2024 13:11

theStack force-pushed the add-utxo_to_sqlite-conversion-tool branch from 9668255 to 6e34600 Compare March 31, 2024 21:31

theStack mentioned this pull request Mar 31, 2024

rpc: Optimize serialization and enhance metadata of dumptxoutset output #29612

Merged

DrahtBot added the CI failed label Mar 31, 2024

theStack force-pushed the add-utxo_to_sqlite-conversion-tool branch from 6e34600 to 254633a Compare April 1, 2024 10:55

This was referenced Apr 1, 2024

scripted-diff: Use LogInfo/LogDebug over LogPrintf/LogPrint #29641

Draft

util: check for errors after close and read in AutoFile #29307

Open

Sjors reviewed Apr 3, 2024

View reviewed changes

This was referenced Apr 10, 2024

test: Validate UTXO snapshot with coin height > base height & amount > MAX_MONEY supply #29617

Merged

assumeutxo, rpc: Improve EOF error when reading snapshot metadata in loadtxoutset #28670

Closed

DrahtBot added the Needs rebase label May 2, 2024

theStack force-pushed the add-utxo_to_sqlite-conversion-tool branch from 254633a to 217bc3b Compare May 5, 2024 09:11

theStack force-pushed the add-utxo_to_sqlite-conversion-tool branch from 217bc3b to 15b0c48 Compare May 5, 2024 09:33

DrahtBot removed Needs rebase CI failed labels May 5, 2024

DrahtBot mentioned this pull request May 5, 2024

Testnet4 including PoW difficulty adjustment fix #29775

Open

theStack marked this pull request as ready for review May 5, 2024 14:19

DrahtBot added the Needs rebase label May 23, 2024

theStack added 2 commits May 24, 2024 00:30

contrib: add tool to convert compact-serialized UTXO set to SQLite da…

8a44db1

…tabase

test: add test for utxo-to-sqlite conversion script

0a89179

theStack force-pushed the add-utxo_to_sqlite-conversion-tool branch from 15b0c48 to 0a89179 Compare May 23, 2024 22:39

DrahtBot removed the Needs rebase label May 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

contrib: add tool to convert compact-serialized UTXO set to SQLite database #27432

contrib: add tool to convert compact-serialized UTXO set to SQLite database #27432

theStack commented Apr 6, 2023 •

edited

DrahtBot commented Apr 6, 2023 •

edited

pinheadmz commented Apr 6, 2023

theStack commented Apr 6, 2023

achow101 commented Apr 20, 2023

theStack commented Apr 20, 2023

ajtowns commented Jul 27, 2023

jamesob commented Jul 27, 2023

dunxen commented Jul 27, 2023

willcl-ark left a comment

pablomartin4btc left a comment

theStack commented Jan 6, 2024

fjahr commented Mar 14, 2024 •

edited

theStack commented Mar 14, 2024

DrahtBot commented Mar 31, 2024

Sjors commented Apr 3, 2024

Sjors Apr 3, 2024

theStack commented May 5, 2024

contrib: add tool to convert compact-serialized UTXO set to SQLite database #27432

Are you sure you want to change the base?

contrib: add tool to convert compact-serialized UTXO set to SQLite database #27432

Conversation

theStack commented Apr 6, 2023 • edited

Problem description

Proposed solution

Manual test instructions

DrahtBot commented Apr 6, 2023 • edited

Code Coverage

Reviews

Conflicts

pinheadmz commented Apr 6, 2023

theStack commented Apr 6, 2023

achow101 commented Apr 20, 2023

theStack commented Apr 20, 2023

ajtowns commented Jul 27, 2023

jamesob commented Jul 27, 2023

dunxen commented Jul 27, 2023

willcl-ark left a comment

Choose a reason for hiding this comment

pablomartin4btc left a comment

Choose a reason for hiding this comment

theStack commented Jan 6, 2024

fjahr commented Mar 14, 2024 • edited

theStack commented Mar 14, 2024

DrahtBot commented Mar 31, 2024

Sjors commented Apr 3, 2024

Sjors Apr 3, 2024

Choose a reason for hiding this comment

theStack commented May 5, 2024

theStack commented Apr 6, 2023 •

edited

DrahtBot commented Apr 6, 2023 •

edited

fjahr commented Mar 14, 2024 •

edited