Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blockfrost RYO specific improvements #1686

Open
1000101 opened this issue Apr 26, 2024 · 4 comments
Open

Blockfrost RYO specific improvements #1686

1000101 opened this issue Apr 26, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@1000101
Copy link
Contributor

1000101 commented Apr 26, 2024

Hi, I would like to propose additional improvements/additions to the db sync project, which would help to improve speed of RYO in the future.

A separate --blockfrost switch (or more, to allow for granularity) could be introduced, as some of the data might generally be not needed and would take considerably large amount of disk space, since some of the improvements would require additional de-normalized tables. Other improvements include pre-calculation of historic data, which is not live, but takes considerable amount of resources to calculate on the fly.


Running balances

A starting point is to introduce address and payment_cred tables containing a single address just once (as opposed to tx_out), similar to stake_address table, which could then be used in several ways.

e.g.:

CREATE TABLE address (
  id BIGSERIAL,
  address TEXT NOT NULL
);

CREATE TABLE payment_cred (
  id BIGSERIAL,
  payment_cred TEXT NOT NULL
);
  • Introduce "running" balance (i.e. the current state) of an address/payment_cred and stake_address (stake_address table would need additional balance field). This would greatly improve the overall speed when fetching balance. There are issues with multi assets (for address/payment_cred) though, as those could bloat some addresses, especially those related to exchanges etc. so this would require further thought, but calculating ADA balance is doable right off the bat. Having utxos available sure helps, but I believe it would be even better to also keep the balances.
  • Ideally, have balance_in and balance_out fields as well, or use those instead of a single balance field

Live data

In addition, I would like to opt for:

  • Introduce new fields to pool_hash table, or create a new table with live_stake, active_stake, live_delegators, and historical blocks_minted information which is expensive to calculate on the fly.
  • Introduce assets table, which would contain a given asset with id (appended/ordered by MIN(ma_tx_mint.id) to allow for pagination), and calculated quantity and mint_or_burn_count.
  • Introduce stats table, containing the network supply: total, circulating, locked, treasury, reserves and total stake stats: live and active - this info is needed for example to calculate pool saturation, live size etc.

Auxiliary data

Next, data which is not live, but still expensive to calculate on the fly:

  • Introduce stake_amount (SUM) into epoch table (helpful in calculation of historical live_size of a pool).
  • Introduce pool_history table, which would contain data grouped by epoch, e.g.:
CREATE TABLE pools_history (
  epoch_no BIGINT NOT NULL,
  pool_id BIGINT NOT NULL,
  blocks INTEGER NOT NULL,
  delegators INTEGER NOT NULL,
  active_size NUMERIC NOT NULL,
  active_stake PUBLIC.LOVELACE,
  rewards PUBLIC.LOVELACE NOT NULL,
  fees PUBLIC.LOVELACE NOT NULL,
);

Denormalized transactions

Due to how pagination works in RYO, transaction history of address, payment_cred and asset will require denormalized tables in order to speed up lookup, e.g.:

CREATE TABLE transactions_address (
  tx_id BIGINT NOT NULL,
  address_id/payment_cred_id/asset_id BIGINT NOT NULL,
  hash TEXT NOT NULL,
  block_height WORD31TYPE,
  block_id BIGINT NOT NULL,
  tx_index WORD31TYPE NOT NULL,
  block_time INTEGER NOT NULL
);

with a requirement of tx_id being consecutive i.e. without any gaps. Otherwise the lookups won't be optimal. This means not relying on pg's serial, but maintaining the ids internally (or using row_number () OVER ... and always selecting the previous MAX).


Practically all aforementioned improvements have been tested one way or the other (in forms or triggers or matviews), so in case further SQL implementation/examples are needed, I am more than happy to share them with the team.

Note: this is a first attempt to address RYO improvements directly in db-sync, so it will definitely get amended and/or edited.

THANKS!

@1000101 1000101 added the bug Something isn't working label Apr 26, 2024
@kderme
Copy link
Contributor

kderme commented May 1, 2024

Some initial thoughts on this:

address and payment_cred tables

We've made some work on this. It slowed down syncing speed, so we decided to park it. It could become an optional feauture though, or default if we neutralise its effect with separate performance improvements

Introduce "running" balance
Live data

A challenge with these cases is that each tx triggers an update instead of an inertion. When there are updates, handling rollbacks is never easy.

Auxiliary data

DBSync should be able to extract this from the epoch boundary ledger state.

Denormalized transactions

I don't really understand this. Would this have an entry for each tx and address?

@rdlrt
Copy link

rdlrt commented May 1, 2024

separate --blockfrost switch

Wut? 😄 Hope if this is prioritized, the nomen clature continues to be based on utility, not companies. There are multiple API providers as well as projects that have requested parts of this over months (besides other featuresets)🙂 . Sorry for knit picking this - but just want to point it out , now that it is part of IO umbrella.

It slowed down syncing speed, so we decided to park it

@kderme - Given the recent config additions, could maybe make that a flag - as discussed in #1333 , the advantage from query pov (+ resource, particularly storage and IOPs consumption) is abundantly higher than initial 1-time sync

@Cmdv
Copy link
Contributor

Cmdv commented May 1, 2024

Wut? 😄 Hope if this is prioritized, the nomen clature continues to be based on utility, not companies. There are multiple API providers as well as projects that have requested parts of this over months (besides other featuresets)🙂 . Sorry for knit picking this - but just want to point it out , now that it is part of IO umbrella.

Fear not our goal is to improve usability/functionality/performance for the whole community who use cardano-db-sync, no favouritism 😅

In all honesty we're really wanting to hear back from the community to know about their real life user cases so we can better steer what is and isn't needed. Also for us as a team to better explain the restrictions we always have to balance when trying to provide new features/fix existing issues and keeping up with new standards etc.

We have started a public Matrix channel (which we'll publicly advertise asap) we hope it will be a great place for informal discussions/questions to happen and open the communication channels up more. Then if things of importance arises on matrix we can move them to github issues with agreed specs and such. 😄

@1000101
Copy link
Contributor Author

1000101 commented May 20, 2024

Some initial thoughts on this:

address and payment_cred tables

We've made some work on this. It slowed down syncing speed, so we decided to park it. It could become an optional feauture though, or default if we neutralise its effect with separate performance improvements

Optional sound reasonable.

Introduce "running" balance
Live data

A challenge with these cases is that each tx triggers an update instead of an inertion. When there are updates, handling rollbacks is never easy.

Yes, I understand that. If it helps, I believe I have a working and battle-tested trigger solution (inserts/upserts/rollbacks,...), although just for ADA balances.

Auxiliary data

DBSync should be able to extract this from the epoch boundary ledger state.

Great to hear that!

Denormalized transactions

I don't really understand this. Would this have an entry for each tx and address?

Basically yes. The way we are currently retrieving txs are per address. Since there are some addresses which can be very large and we are using offset pagination instead of cursor (multiple reasons why), data can be very sparse. We need them dense and ideally know which page lands exactly where so we can retrieve the data with constant complexity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants