New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split out addresses table from tx_out #1333
Comments
Considering the size of the |
I haven't found a case where this query takes more than a few ms
It also looks strange to me for a query with an indexed search to take so long. |
The query was shared just as an example - as actual ID might vary, the number of records returned for stake_address is pretty high (~2.7M), you might wanna try below: select count(1) from tx_out where stake_address_id = (select id from stake_address where view='stake1uxqh9rn76n8nynsnyvf4ulndjv0srcc8jtvumut3989cqmgjt49h6'); Ofcourse - once the filtered result is part of Postgres cache tile, the query result reduces to few ms - but that'd be unpredictable (and unrealistic). The gains achieved from absence of those fields will still be benefecial even if this was not an issue |
I understand the benefits in Space and IOPs by splitting in two tables. However I'm not convinced queries will be faster. Working on two tables requires joins, which can't be faster than working with a single table.
I'm not following that one. The |
The performance would be much faster to query against address fields purely due to smaller index size lookup against tx_id (eliminated by amount of duplicates as well as size of lookup string). IMO- this is exactly same classification as stake_address, having a seperate lookup table vs having multiple fields with duplicates for encoding. While the query performance increase will not be massive, it is certainly not negligible, specifically for looking up transactions by addresses (can drill down into specific use cases if needed, but IMO - the gains to dbsync resources even outside of performance are substantial)
IMO - a UTxO does not represent address, so address_has_script cannot be a permanent state, but current state. If the intent is to showcase a UTxO , it could instead be better named ( but not too fussed about this one). Overall I'd reckon splitting addresses out to seperate table seems a bit unanimous decision from data savings pov (and IOPs) alone, while other points are smaller benefits. |
Sorry of extense comment but this change will help me massively. Hopefully others too. I'm using
The problem comes when you have to use Having to use I'm also doing a "view calculation" to distinguish between Example count total addresses:
Example listing addresses (smart contracts):
IMO separating in two tables will help reducing overall table sizes, index sizes and improve versatility on queries since (just as exposed in a simple example) we will not depend on BTW, this examples are using |
This is something that would help in some things we do greatly.
I try to rely on pulling as my data direct from the db as possible. In the case of a query like this where i want the bech32 address while it's trivial to convert using python/js/cli i like to make some of my larger queries "backendless" and looking up an address from a raw (say one in a datum) can be painfully slow. With this query a limit 1 could speed it up but it's no good for querying multiple addrs at a time. |
Versions
The
db-sync
version (egcardano-db-sync --version
): 13.0.5PostgreSQL version: 14.4
Build/Install Method
The method you use to build or install
cardano-db-sync
: N.A.Run method
The method you used to run
cardano-db-sync
(eg Nix/Docker/systemd/none): systemdAdditional context
The background for this request comes from Koios , for an endpoint that allows users to query addresses associated with stake_address causing too long (ones which have millions of transaction entries) - but this applies to almost all query layers that use dbsync. When not part of Postgres cache , simply querying observations themselves could take a while (at times can take 15-50s):
cexplorer=# select count(1) from tx_out where stake_address_id = 3350960; count --------- 2648923 (1 row) Time: 23476.114 ms (00:23.476)
Ofcourse - once in cache, this would run in a few milliseconds.
However, the outcome (unique address list) being sought from same query was simply 12 addresses.
This made us think of starting to maintain seperate cache table for stake_address to address mapping, but it makes little sense to only do it on client side rather than dbsync itself, as the advantages will be higher than this scope.
Problem Report
The request here is to move address field to it's own table, alongwith duplicated information that currently resides in tx_out table.
A new table
addresses
that might contain columns below, that are currently intx_out
:Similarly, the above columns could be removed from
tx_out
and replaced withaddr_id
which would link to address.id field.Advantages:
Currently , size of tx_out table is ~40G and it's indexes is ~20G. A huge chunk of this size is due to repitition of address information across multiple rows every time a transaction references an address. In turn, this also increases index size which could would much larger footprint to process, replacing it with a single address_id field saves on both fronts as well as lookup in tx_out table. The average savings achieved might be ~25-30G atm ( + indexes)
Columns
address
,address_raw
andpayment_cred
can all be derived from address itself. While it might have been initially for UX, the duplication of each of these long char fields for tx does more harm than good in tx_out table. They can still be maintained in separate address table.address_hash_script
:My understanding here might be incorrect, but
address_has_script
seems like a permanent state rather than a transient state - thus, does not seem to be right candidate in tx_out table.For cases like the one mentioned in
Additional context
section above, the query outcomes will be much faster for such endpoints.A seperate cache table was being considered to provide this information based on trigger from tx table (but would not have been an optimal solution given the above)
The text was updated successfully, but these errors were encountered: