Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming smart contract events and topics (ABI processing in python instead of BQ) #50

Open
medvedev1088 opened this issue Feb 4, 2020 · 8 comments
Labels
enhancement New feature or request

Comments

@medvedev1088
Copy link
Member

medvedev1088 commented Feb 4, 2020

Right now we have ~500 events that we parse. Every day ~500MB of log data is generated. This sums up to 250GB parsed in BigQuery daily, ~7.5 TB per month. Which totals to ~$37 per month.

With 1500 events we'll spend ~$100 per month.

An alternative to parsing logs in BigQuery is export JSON file to GCS download locally in Airflow and filter all events in a dataset at once, then load to BigQuery (free). There is PoC for how to parse logs in Python here blockchain-etl/ethereum-etl@6710e6b.

@medvedev1088 medvedev1088 added the enhancement New feature or request label Feb 4, 2020
@medvedev1088
Copy link
Member Author

Another optimisation idea is to separate parsing into 2 steps:

  1. select * from logs where contract_address in (<all_contract_addresses_from_table_definistions>) into filtered_logs.
  2. parse each table from filtered_logs.

@allenday
Copy link
Member

We are going to encounter this in on other chains that support smart contracts in general, and multiple networks that support EVM-compatible contracts specifically. Here are a few examples:

I propose that we abstract this pattern above Ethereum.

@allenday
Copy link
Member

Related issues, with emphasis on stream processing.
blockchain-etl/ethereum-etl#216
blockchain-etl/ethereum-etl#215

@medvedev1088 medvedev1088 changed the title Parse logs in Python instead of BigQuery to optimized costs Parse logs in Python instead of BigQuery to optimize costs May 18, 2020
@allenday
Copy link
Member

Another related issue for indexing: #28

Here the proposal is to cluster/partition the results in BQ to reduce IO.

I suggest we take this a step further to generally support all contracts for which we have parsing capability (i.e. those for which we have an ABI).

Specifically, at transaction processing time in the stream:

  1. detect contract with a matching ABI using a registry (implementation TBD)
  2. process the contract method call, gather emitted topics and events.
  3. emit events to a sharded and denormalized set of topics
    • first byte of contract (to support streaming contract monitoring) e.g. contract_events_0x00
    • first byte of event topic (to support streaming topic monitoring, such as for templated contracts) e.g. e.g. topic_events_0x00
  4. emit events to an all_events topic with an additional possibly composite key, composed of (contract byte and/or event byte).
  5. streaming insert into a new events table using all_events. This table contains a JSON blob field that is not further optimized for storage, but for which we can leverage BQ native JSON parsing functions to select data of interest with JSONPath, see:
    How to parse JSON in Standard SQL BigQuery?
  6. create a new JS file that can be included to do dynamic data marshaling from events as needed.

The overall effect of this design:

  • make streams available for events and contracts, with a common schema
  • introduce additional overhead to pub/sub and BQ access of 1 byte for sharding. tradeoff is reduced I/O.
  • introduce additional overhead of JSON data marshaling by clients of pub/sub and BQ. tradeoff is reduced cognitive load / data discovery problems of having a large clutter of tiny and specialized tables.

@medvedev1088
Copy link
Member Author

medvedev1088 commented May 21, 2020

Regarding this:

streaming insert into a new events table using all_events. This table contains a JSON blob field that is not further optimized for storage, but for which we can leverage BQ native JSON parsing functions to select data of interest with JSONPath, see:
How to parse JSON in Standard SQL BigQuery?

we need to think about how we process historical data for newly added ABIs, so that this data appears in the evets table in BQ. E.g. have a separate process to reprocess past partitions whenever a new ABI is added to the ABI repository.

@allenday
Copy link
Member

allenday commented May 21, 2020

@sinahab @epheph @jieyilong FYI we are designing a generic solution for ABI event generation in this issue and also blockchain-etl/ethereum-etl#216

@allenday allenday changed the title Parse logs in Python instead of BigQuery to optimize costs Streaming smart contract events and topics (ABI processing in python instead of BQ) May 21, 2020
@jieyilong
Copy link
Member

jieyilong commented May 21, 2020 via email

@franckc
Copy link

franckc commented May 27, 2020

Disclosure: I'm very new to this project... so pardon me if I missed or misunderstood things :)

Here are my thoughts, in no particular order, on the proposal by @allenday above. I focused more heavily on the use case I'm trying to solve for Origin Protocol since that's the one I'm the most familiar with.

  1. Overall, I like the idea of stream processing for blockchain events.
  • The main advantage I see is that it would allow to process events with less delays compared to batch processing. This could be advantageous not only for analytics use cases but also for external systems that rely on blockchain events as "triggers" to perform actions. For the Origin Protocol use case, specific events emitted by our smart contract act as triggers for several different centralized external systems (ex: search indexing, notification, analytics, ...). And freshness matters.
  • One small drawback. It's not a problem for the Origin Protocol use case (and I suspect for most other use cases ), but it's probably worth mentioning that ordering is no guaranteed when using pub/sub. Therefore events may get processed out of order.
  1. A backfill solution for processing legacy events would definitely be required. A simple one-off batch job (using ethereum-etl?) that enqueues legacy events into the relevant pub-sub topics would be sufficient I think. For Origin Protocol's use case, the data processing is idempotent. So it does not matter if the same event would get enqueued more than once (by the backfill and by the live system).
  2. I don't fully understand the need for having both an all_events queue and a sharded queue. From a consumer perspective, even if the sharded queue has topics sharded using the 1st byte of the contract address + first byte of the event topic, the consumer will still need to apply extra filtering on the messages it receives to determine if it should process them.
  3. A registry of ABIs makes sense. Origin Protocol would be happy to include the ABI for its marketplace smart contract (it's under MIT license). Perhaps to start with, it could simply be a new github repositiory under the blockchain-etl project? Then developers/companies interested in the blockchain-etl project could add their ABIs to it.
  4. Regarding the events table. I am a BigQuery newbie... I'd be curious about the performance (e.g. latency of queries) of using Javascript in BigQuery to demarshall the data stored as JSON. Though I guess if that's a concern, people could always derive their own BigQuery tables and use a structured schema for it. And I do agree that this is a minor concern compared to the advantage of having a single table that includes all the events. Compared to hundreds of smaller specialized tables populated by the parsing stage of the ethereum-etl-airflow project. My argument being that it's more cumbersome to write queries against lots of small tables. For example the Origin Protocol contract emits 16 different events and we have 2 version of our contracts. Which means 16x2 different tables to join if wanting to analyze all the events.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants