[Lake][DuckDB] ETL - Implement Update Queries #1001

idiom-bytes · 2024-05-06T21:39:13Z

Motivation

We have now verified that the basic lake functionality is working as expected.

We now want to verify the data quality and completeness.

This means that additional SQL queries are being run, such that more tables are being processed and richer data is being generated.

This means that more SQL queries are being run in the ETL step
This means that slot and other tables are being processed
This means that tables like pdr_payouts cause bronze_predictions to be updated
This means that null entries inside bronze_predictions are eventually updated

Update Step - Incrementally updating the Lake

When you run the "lake update" command, later SQL queries are responsible for updating w/ the most recent information.

When the lake updates, new records have arrived that need to be processed
These new records (such as pdr_payout) if applicable should be: (a) cleaned up into their raw/bronze table, (b) update other tables to reflect this event arriving
After all records have been yielded to temp tables and the pipeline ends, records should then be available on live/production tables.

Data Workflows
All data workflows should operate in the same way.

All data that needs to be written out, is first written into a temp table.
As temp tables are created w/ new data, views are available so that downstream queries can access both old and new data from a single query.
Once all the processes have completed and data is written out to temp tables, we can do a final merge/update rows into final/live/production tables.

DoD:

Tables like truevals and payouts are being processed
Bronze prediction is being updated as a result of truevals and payouts being processed
Other tables and bronze tables are currently not processed

Task:

Process new pdr-payouts into duckdb
Process new pdr-truevals into duckdb
Verify the incremental update step works #982
Create SQL that process new pdr-payouts into update bronze-predictions
Create SQL that process new pdr-truevals into update bronze-predictions
Verify that null records inside bronze-predictions are being updated correctly
Verify everything is working e2e

idiom-bytes · 2024-05-27T17:56:54Z

To implement this ticket, we should first start w/ simply updating predictions when truevals and payouts show up

[How this ticket grows]

In the future...

Basically 1 Event -> Multiple Table Inserts & Updates
We need to consider what each event is doing, across all tables.
We need to make sure that each event, leads to all new records being generated and existing records being updated

subscription event
-> new subscription record

slot event
-> new slot record

prediction event
-> new bronze prediction record
-> update bronze slot record

trueval event
-> new trueval record
-> update N bronze prediction records
-> update 1 bronze slot record

payout event
-> new payout record
-> update N bronze prediction records
-> update 1 bronze slot record

idiom-bytes · 2024-05-28T16:35:55Z

Here is one of my ways...

We update ETL so tables aren't attached to queries, it's just a set of queries
We then add multiple queries and another flow ("_update" tables) so we can reconcile everything.

1- process predictions

We can simplify the query to just the insert logic, we don't need to join anymore
We then process truevals
We can just insert new ones to _bronze_truevals if we want to, but we're not doing that right now... So skip
We want to create update events for the bronze_prediction table, so we extract the id, slot, trueval, and add it to the update_bronze_prediction

2- process payouts

We can just insert new ones to _bronze_payouts if we want to, but we're not doing that right now... So skip
We want to create update events for the bronze_prediction table, so we extract the id, slot, payout, and add it to the update_bronze_prediction

3 - Reduce updates

We now have all update rows generated, and can begin the swap/finalize process.
We then have the final step, where we "reduce", or join _etl_bronze_predictions with _update_bronze_predictions such that we can get all rows written to the final/live bronze_predictions table.

Although this takes a couple of extra steps, the overall amount of rows scanned/computed/joined, is far lower... Increasing the overall performance of the workflow.

Most of this work should look like SQL queries and a swap logic update at the end of the ETL update logic.

1 - Extract prediction update events from trueval

2 - Extract prediction update events from payout

3 - Prepare prediction updates and merge to final table

All prediction events per source

Note that in the end, we should expect a smaller number of payouts relative to predictions made, and a lot of bronze_predictions with null payouts. But, 100% of all payouts should be registered in the bronze_predictions table.

idiom-bytes · 2024-05-29T23:14:10Z

[Feedback Mustafa]
With reference to the code/design provided, as I explained to Mustafa after reviewing his proposal.

doing an onEventHandler approach will likely lead to more scans/compute than required
the way it's recommended, it looks to be doing more copies than needed

[Effective Processing of Events]
I have instead, done a pseudo-implementation of the SQL queries + logic required to get this working.

It only reads from each raw tables once (small, incoming tables)
It only joins with each bronze_table once (large historical tables)
rather than joining all update records, it tries to group/reduce them together to have only 1:1 join with the large historical table

[Simplify Requirements even Further]
I have also emphasized how much simpler all of this can be to deliver on the goal of: predictoor revenue dashboard by not requiring the trueval table.

Trueval does not contain the user id anywhere, so it cannot update the prediction table directly.
First, it will need to update a slot, and we're not caring about that at the moment.

Literally, all we need is to join payouts with predictions.
The rest can come later.

idiom-bytes added the Type: Enhancement New feature or request label May 6, 2024

idiom-bytes changed the title ~~[Component name] Benefit_yyy, via building_xxx~~ [Lake][DuckDB] Verify Incremental Update May 6, 2024

idiom-bytes mentioned this issue May 6, 2024

[Lake][ETL] DuckDB E2E - Ingestion -> Dashboards #685

Open

47 tasks

idiom-bytes changed the title ~~[Lake][DuckDB] Verify Incremental Update~~ [Lake][DuckDB] ETL - Implement Update Queries May 22, 2024

kdetry mentioned this issue May 29, 2024

issue-1001 - update_bronze_predictions #1099

Closed

idiom-bytes mentioned this issue May 29, 2024

WIP - [ETL][Update] Issue1001 - pseudo-implement ETL update procedure #1100

Draft

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Lake][DuckDB] ETL - Implement Update Queries #1001

[Lake][DuckDB] ETL - Implement Update Queries #1001

idiom-bytes commented May 6, 2024 •

edited

idiom-bytes commented May 27, 2024 •

edited by kdetry

idiom-bytes commented May 28, 2024 •

edited

idiom-bytes commented May 29, 2024

[Lake][DuckDB] ETL - Implement Update Queries #1001

[Lake][DuckDB] ETL - Implement Update Queries #1001

Comments

idiom-bytes commented May 6, 2024 • edited

Motivation

Update Step - Incrementally updating the Lake

DoD:

idiom-bytes commented May 27, 2024 • edited by kdetry

idiom-bytes commented May 28, 2024 • edited

1- process predictions

2- process payouts

3 - Reduce updates

1 - Extract prediction update events from trueval

2 - Extract prediction update events from payout

3 - Prepare prediction updates and merge to final table

All prediction events per source

idiom-bytes commented May 29, 2024

idiom-bytes commented May 6, 2024 •

edited

idiom-bytes commented May 27, 2024 •

edited by kdetry

idiom-bytes commented May 28, 2024 •

edited