Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Lake][DuckDB] ETL - Implement Update Queries #1001

Open
10 tasks
idiom-bytes opened this issue May 6, 2024 · 3 comments
Open
10 tasks

[Lake][DuckDB] ETL - Implement Update Queries #1001

idiom-bytes opened this issue May 6, 2024 · 3 comments
Labels
Type: Enhancement New feature or request

Comments

@idiom-bytes
Copy link
Member

idiom-bytes commented May 6, 2024

Motivation

We have now verified that the basic lake functionality is working as expected.

We now want to verify the data quality and completeness.

This means that additional SQL queries are being run, such that more tables are being processed and richer data is being generated.

  • This means that more SQL queries are being run in the ETL step
  • This means that slot and other tables are being processed
  • This means that tables like pdr_payouts cause bronze_predictions to be updated
  • This means that null entries inside bronze_predictions are eventually updated

Update Step - Incrementally updating the Lake

When you run the "lake update" command, later SQL queries are responsible for updating w/ the most recent information.

  1. When the lake updates, new records have arrived that need to be processed
  2. These new records (such as pdr_payout) if applicable should be: (a) cleaned up into their raw/bronze table, (b) update other tables to reflect this event arriving
  3. After all records have been yielded to temp tables and the pipeline ends, records should then be available on live/production tables.

Screenshot from 2024-05-06 13-27-25

Data Workflows
All data workflows should operate in the same way.

  1. All data that needs to be written out, is first written into a temp table.
  2. As temp tables are created w/ new data, views are available so that downstream queries can access both old and new data from a single query.
  3. Once all the processes have completed and data is written out to temp tables, we can do a final merge/update rows into final/live/production tables.

DoD:

  • Tables like truevals and payouts are being processed
  • Bronze prediction is being updated as a result of truevals and payouts being processed
  • Other tables and bronze tables are currently not processed

Task:

  • Process new pdr-payouts into duckdb
  • Process new pdr-truevals into duckdb
  • Verify the incremental update step works #982
  • Create SQL that process new pdr-payouts into update bronze-predictions
  • Create SQL that process new pdr-truevals into update bronze-predictions
  • Verify that null records inside bronze-predictions are being updated correctly
  • Verify everything is working e2e
@idiom-bytes idiom-bytes added the Type: Enhancement New feature or request label May 6, 2024
@idiom-bytes idiom-bytes changed the title [Component name] Benefit_yyy, via building_xxx [Lake][DuckDB] Verify Incremental Update May 6, 2024
@idiom-bytes idiom-bytes changed the title [Lake][DuckDB] Verify Incremental Update [Lake][DuckDB] ETL - Implement Update Queries May 22, 2024
@idiom-bytes
Copy link
Member Author

idiom-bytes commented May 27, 2024

To implement this ticket, we should first start w/ simply updating predictions when truevals and payouts show up

[How this ticket grows]

In the future...

  • Basically 1 Event -> Multiple Table Inserts & Updates
  • We need to consider what each event is doing, across all tables.
  • We need to make sure that each event, leads to all new records being generated and existing records being updated

subscription event
-> new subscription record

slot event
-> new slot record

prediction event
-> new bronze prediction record
-> update bronze slot record

trueval event
-> new trueval record
-> update N bronze prediction records
-> update 1 bronze slot record

payout event
-> new payout record
-> update N bronze prediction records
-> update 1 bronze slot record

@idiom-bytes
Copy link
Member Author

idiom-bytes commented May 28, 2024

Here is one of my ways...

  1. We update ETL so tables aren't attached to queries, it's just a set of queries
  2. We then add multiple queries and another flow ("_update" tables) so we can reconcile everything.

1- process predictions

  • We can simplify the query to just the insert logic, we don't need to join anymore
  • We then process truevals
  • We can just insert new ones to _bronze_truevals if we want to, but we're not doing that right now... So skip
  • We want to create update events for the bronze_prediction table, so we extract the id, slot, trueval, and add it to the update_bronze_prediction

2- process payouts

  • We can just insert new ones to _bronze_payouts if we want to, but we're not doing that right now... So skip
  • We want to create update events for the bronze_prediction table, so we extract the id, slot, payout, and add it to the update_bronze_prediction

3 - Reduce updates

  • We now have all update rows generated, and can begin the swap/finalize process.
  • We then have the final step, where we "reduce", or join _etl_bronze_predictions with _update_bronze_predictions such that we can get all rows written to the final/live bronze_predictions table.

Although this takes a couple of extra steps, the overall amount of rows scanned/computed/joined, is far lower... Increasing the overall performance of the workflow.

Most of this work should look like SQL queries and a swap logic update at the end of the ETL update logic.

1 - Extract prediction update events from trueval

1_extract_truevals

2 - Extract prediction update events from payout

2_extract_payouts

3 - Prepare prediction updates and merge to final table

3_update_to_final

All prediction events per source

Note that in the end, we should expect a smaller number of payouts relative to predictions made, and a lot of bronze_predictions with null payouts. But, 100% of all payouts should be registered in the bronze_predictions table.
4_all_prediction_events

@idiom-bytes
Copy link
Member Author

[Feedback Mustafa]
With reference to the code/design provided, as I explained to Mustafa after reviewing his proposal.

  • doing an onEventHandler approach will likely lead to more scans/compute than required
  • the way it's recommended, it looks to be doing more copies than needed

[Effective Processing of Events]
I have instead, done a pseudo-implementation of the SQL queries + logic required to get this working.

  • It only reads from each raw tables once (small, incoming tables)
  • It only joins with each bronze_table once (large historical tables)
  • rather than joining all update records, it tries to group/reduce them together to have only 1:1 join with the large historical table

[Simplify Requirements even Further]
I have also emphasized how much simpler all of this can be to deliver on the goal of: predictoor revenue dashboard by not requiring the trueval table.

Trueval does not contain the user id anywhere, so it cannot update the prediction table directly.
First, it will need to update a slot, and we're not caring about that at the moment.

Literally, all we need is to join payouts with predictions.
The rest can come later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant