Skip to content

Substrate ETL represents Polkadot + Kusama dataset in Google BigQuery. All finalized data is published in a public dataset in BigQuery daily and Dune on an hourly basis.

Notifications You must be signed in to change notification settings

colorfulnotion/substrate-etl

Repository files navigation

Substrate ETL

Using Substrate ETL, users can query Polkadot and Kusama networks for large scale analysis data of blocks, extrinsics, calls, events, balances, and transfers. Substrate ETL relies on Colorful Notion's Polkaholic.io indexing of Polkadot + Kusama networks into public Google BigQuery datasets (one dataset for each relay chain).

As of September 2023, substrate-etl datasets are now in Google's BigQuery Public Datasets: bigquery-public-data.crypto_polkadot and bigquery-public-data.crypto_kusama! See announcements from Google Cloud Web3 + Parity

NEW: As part of OpenGov #366, Colorful Notion is integrating Polkadot data (relay chain, system chains and all known parachains as of January 2024) with Dune. If you are interested in long-term support for your parachain, please email michael@colorfulnotion.com.

Quick Start: Analytics Hub

You can add these datasets to your Google Cloud project with the following Analytics Hub links:

When you add your project, you will get a Google Cloud assigned project ID (e.g. bumbleholt_14234) and will be able to get linked datasets to the crypto_polkadot and crypto_kusama datasets -- use this ID in place of the bigquery-public-data below.

Quick Start: BigQuery Public Datasets

bigquery-public-data is a public project within BigQuery which anyone can access. You see hundreds of BigQuery tables in the public crypto_polkadot and crypto_kusama datasets. Just open the query editor and try some of the queries below:

See all available tables/views via AAA_tableschema (Schema)

select * from `bigquery-public-data.crypto_polkadot.AAA_tableschema`

See all available tables/views of paraid 2000

select * from `bigquery-public-data.crypto_polkadot.AAA_tableschema` where table_id like '%2000'

Get blocks of paraid 2000 (Schema):

select * from `bigquery-public-data.crypto_polkadot.blocks2000` where DATE(block_time) >= "2024-03-01" and DATE(block_time) <= "2024-03-31"

Get extrinsics of paraid 2000 (Schema):

select * from `bigquery-public-data.crypto_polkadot.extrinsics2000` where DATE(block_time) >= "2024-03-01" and DATE(block_time) <= "2024-03-31"

Get calls of paraid 2000 (Schema):

select * from `bigquery-public-data.crypto_polkadot.calls2000` where DATE(block_time) >= "2024-03-01" and DATE(block_time) <= "2024-03-31"

Schemas for several of the most common source tables are listed below -- a full list of schemas can be found here with further details below.

Public Datasets in BigQuery

Substrate data for each chain is held in 9 tables in one of 2 public datasets, with one dataset for each relay chain and all its parachains. By convention, relaychain data is considered "paraid=0".

Project: (Location: US)

  • bigquery-public-data

Datasets:

  • crypto_polkadot
  • crypto_kusama

Tables: (replace {paraID} with a specific para ID, e.g. 2000 for acala)

  • Blocks: bigquery-public-data.crypto_${relayChain}.blocks${paraID} (date-partitioned by block_time) - Schema
  • Extrinsics: bigquery-public-data.crypto_${relayChain}.extrinsics${paraID} (date-partitioned by block_time) - Schema
  • Calls: bigquery-public-data.crypto_${relayChain}.calls${paraID} (date-partitioned by block_time) - Schema
  • Events: bigquery-public-data.crypto_${relayChain}.events${paraID} (date-partitioned by block_time) - Schema
  • Transfers: bigquery-public-data.crypto_${relayChain}.transfers${paraID} (date-partitioned by block_time) - Schema
  • Balances: bigquery-public-data.crypto_${relayChain}.balances${paraID} (date-partitioned by ts) - Schema

Thus polkadot relay chain blocks are held in bigquery-public-data.crypto_polkadot.blocks0, acala blocks are stored in bigquery-public-data.crypto_polkadot.blocks2000, and similarly for any chain / table name.

See Definitions for how the tables are constructed and tentative definitions.

Every chain has a auto generated README with the chains tables explicitly enumerated, and includes sample queries.

Notes:

  • All tables are date-partitioned to support low cost, high speed scans.
  • If a parachain has a renewal, the first paraid assigned is used for subsequent renewals.

AAA_tableschema.json

See:

Field BigQuery Type
table_id STRING
time_partitioning_field STRING
table_cols STRING REPEATED
table_schema INTEGER

For a quick overview of the available tables/views within the dataset, please query AAA_tableschema view

blocks.json

See:

Field Type BigQuery Type
hash hex_string STRING
parent_hash hex_string STRING
number bigint INTEGER
state_root hex_string STRING
extrinsics_root hex_string STRING
block_time bigint TIMESTAMP
author_ss58 string STRING
author_pub_key string STRING
spec_version bigint INTEGER
relay_block_number bigint INTEGER
relay_state_root hex_string STRING
extrinsic_count bigint INTEGER
event_count bigint INTEGER
transfer_count bigint INTEGER

extrinsics.json

See:

Field Type BigQuery Type
hash hex_string STRING
extrinsic_id string STRING
block_time bigint TIMESTAMP
block_number bigint INTEGER
block_hash hex_string STRING
lifetime JSON JSON
section string STRING
method string STRING
params JSON JSON
fee bigint INTEGER
weight bigint INTEGER
signed boolean BOOLEAN
signer_ss58 string STRING
signer_pub_key hex_string STRING

events.json

Field Type BigQuery Type
event_id string STRING
section string STRING
method string STRING
data JSON JSON
extrinsic_id string STRING
extrinsic_hash hex_string STRING
block_time bigint TIMESTAMP
block_number bigint INTEGER
block_hash hex_string STRING

transfers.json

Field Type BigQuery Type
event_id string STRING
section string STRING
method string STRING
data JSON JSON
extrinsic_id string STRING
extrinsic_hash hex_string STRING
block_time bigint TIMESTAMP
block_number bigint INTEGER
block_hash hex_string STRING
from_ss58 string STRING
to_ss58 string STRING
from_pub_key string STRING
to_pub_key string STRING
asset string STRING
price_usd float FLOAT64
amount_usd float FLOAT64
symbol string STRING
decimals int8 INTEGER
amount float FLOAT64
raw_amount bigint INTEGER

balances.json

Field Type BigQuery Type
symbol string STRING
address_ss58 string STRING
address_pubkey string STRING
ts bigint TIMESTAMP
id string STRING
chain_name string STRING
para_id int INTEGER
free bigint FLOAT
free_usd float FLOAT
reserved float FLOAT
reserved_usd float FLOAT
misc_frozen float FLOAT
misc_frozen_usd float FLOAT
frozen float FLOAT
frozen_usd float FLOAT

Reporting / Missing data

Generally the data is complete as can be, but the sole data source is the Polkaholic.io indexer.

From this single source, the primary causes of missing data stem from:

  • Chains that do not provide a public RPC node. Most of the time, these chains are new with very little actual activity.
  • Chains that have a public RPC node, but no RPC Endpoint is an archive nodes
  • Chains that are being onboarded
  • Some blocks fail decoding due to chain halting, or are missing an up-to-date node.js API package for type definitions.

A daily/hourly github actions process summarizes the state of the index for:

and for every single chain that is being indexed. See the report Issues column for chains with systemic issues or blocks that are missing. Generally the last 24 hrs have blocks that are missing that are filled in by the end of the day.

Included in each summary are sample queries and a complete breakdown. Chain data is appended daily.

Design choices:

  • All temporal BigQuery datasets are date-partitioned and split into multiple tables by {paraId} to enable low-cost low-latency BigQuery scans for specific date, parachain combinations. Timestamped data use BigQuery TIMESTAMP date types.
  • Addresses are provided in “public key” (signer_pub_key) and SS58 Address (signer_ss58) form to support multi-chain queries with wild card table selection egselect * from polkadot.extrinsics* where signer_pub_key='<pubkey>' aggregates multi-chain transactional history for a given account.
  • When assets are mentioned (eg transfers), we "decimalize" the output and include basic USD price valuation if possible. Many assets are not valued with USD values in this way.

Roadmap

Spring/Summer 2023

  • Initial table designs {blocks, extrinsics, events, trasnfers, balances, xcmTransfers}
  • Daily/hourly dump via Github workflow
  • Hourly summary report for all reachable parachains
  • GKE systematization, Reliability Improvements
  • Integration with XCM Global-Asset Registry repo

Fall/Winter 2023

  • Basic WASM contract support (psp22, events, bytecode) using ChainIDE integration (ink!ubator)
  • Deep Account Analytics: Staking + Democracy + Relay Chain Trace Support (supported by Web3F)
  • Dune PoC

Winter 2024

  • Dune Integration of Relay Chain

Spring 2024

  • Dune Integration of All Polkadot Parachains
  • EVM Support within Dune

Your feedback and your ideas are important -- please submit an issue or reach out to us on Telegram (@sourabhniyogi) or Matrix.

Contributions

Contributions are welcome. Contributors will be invited to a dedicated Telegram group and are held to the Polkadot communities' Code of Conduct.

About

Substrate ETL represents Polkadot + Kusama dataset in Google BigQuery. All finalized data is published in a public dataset in BigQuery daily and Dune on an hourly basis.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published