[Lake][Incremental Pipeline] Use bronze_pdr_predictions to manage active checkpoint #982

idiom-bytes · 2024-05-01T15:47:29Z

Background / motivation

Inferring where to resume the pipeline from is having issues.

It's not about "removing subscriptions" from this logic, we need to think about the system and what it's trying t do.

Rather than querying all tables to find the timestamp from where to resume the pipeline, just use the following for now:

raw_predictions for GQLDF
bronze_pdr_predictions for ETL

Macro

let's break down what's happening here

we're querying the data tables from duckdb, in order to understand how to resume the jobs pipeline incrementally

basically, resuming from where it left off

i had mentioned in the past updating my_ppss.yaml such that st_ts is mutated w/ the last_run_timestamp and the pipeline enforces being incremental rather than trying to do it through data-inference (like ohlcv data factory)... but, this kind of breaks the pattern for how the yaml file is being used.

it also doesn't provide a way to track the etl/workflow runs, such that they can be rolledback in a systemic way

so, let's do this:

document this (so it's clear how this is expected to work) and how we plan to solve it properly

use raw_predictions/bronze_predictions max_timestamps to resume GQLDF/ETL.
ship duckdb integration
start tracking the runs in duckdb, and using these as the checkpoint for how to run the next GQLDF/ETL run

TODOs / DoD

Checkpoints & Incremental Pipeline

Update GQLDF and ETL logic, to resume from their equivalent prediction tables.

To run GQLDF

We're going to use the st_ts from the last(raw_pdr_predictions) in duck_db
We're going to use the end_ts from ppss.lake_ss.end_ts = now

To run ETL

We're going to use the st_ts from the last(raw_pdr_predictions) in duck_db
We're going to use the end_ts from the last(temp_raw_pdr_predictions) in duck_db (it could also be ppss.lake_ss.end_ts = now)

This will KISS and let us ship a simple incremental pipeline that works reliably and gives us the data quality we need.
We'll worry about implementing a job object to abstract this later.

This means your ppss can be:
st_ts: 01-01-2024
end_ts: now

...and it should just work

Tasks:

Update GQLDF to start/end from the correct spot
Update ETL to start/end from the correct spot

kdetry · 2024-05-02T11:21:28Z

The problem we are facing is not related to the start time of the ETL process. Therefore, updating the ETL logic and running the process from the last timestamp of the bronze_predictions table is not a viable solution.

The actual issue is that the fin_ms value on the bronze_prediction (or bronze_slots) table is incorrect and is taken from the pdr_subscriptions table.

KatunaNorbert · 2024-05-02T12:06:47Z

Since we are not doing any kind of updates on the bronze tables, just inserting into the table, I think the logic should be the following:
fin_ms should be fin_timestr from ppss.yaml which then should be same as the raw tables value

kdetry · 2024-05-02T12:13:54Z

If we set the value to be the same as the one in the ppss.yaml file and the value is set to "now", some truevals and payouts may not be ready yet. Therefore, we need to use the same logic but calculate only the necessary tables for each bronze table.

KatunaNorbert · 2024-05-02T12:18:02Z

If truevals or payouts or other values are not ready then we should do updates when the values are ready. Right now the bronze step is only doing inserts and I don't think we should be holding back on rows if data is missing. We should have same rows as raw tables and do updates whenever missing data becomes available

kdetry · 2024-05-02T14:26:48Z

Updating means also checking the data (row by row) and we have to take the data from DB and convert to polars again. That is what we avoid from

idiom-bytes · 2024-05-02T14:30:12Z

If we set the value to be the same as the one in the ppss.yaml file and the value is set to "now", some truevals and payouts may not be ready yet. Therefore, we need to use the same logic but calculate only the necessary tables for each bronze table.

... this is exactly what we want... we only want to process events that happened within the time period that we're processing. Once we've processed all the events that took place, we don't have to touch that data any longer.

	Run 1	Run 2	Run 3
Time	1:00	2:00	3:00

What we need is to support updates to existing records, such that bronze_pdr_predictions is updated.
This happens after we finish processing all other tables... Example:

write to raw tables -> inserts_to_temp_table
write to bronze tables bronze_pdr_predictions.py -> inserts_to_temp_table
write to other bronze tables bronze_pdr_truevals.py -> inserts_to_temp_table
write to other bronze tables bronze_pdr_subscriptions.py -> inserts_to_temp_table
write to other bronze tables other bronze_pdr_payouts.py -> inserts_to_temp_table
final update of bronze tables [post_process_update_bronze_tables] -> update .features temp_bronze_pdr_predictions select .features from temp_tables
insert from all temp_tables to final tables (write to live tables)
dump temp tables
end etl

Updating means also checking the data (row by row) and we have to take the data from DB and convert to polars again. That is what we avoid from

We do it in a way where it's an upsert operation.

The new data coming in from payout event, should use the latest-record from predictions table... such that the "null" fields or whatever is being mutated, is now up-to-date.

idiom-bytes · 2024-05-21T18:51:36Z

Basic implementation of Raw + ETL checkpoints is now working.
#1077

idiom-bytes · 2024-05-27T18:21:19Z

This has been implemented and completed.

idiom-bytes added the Type: Enhancement New feature or request label May 1, 2024

idiom-bytes assigned kdetry May 3, 2024

kdetry added a commit that referenced this issue May 5, 2024

Fix #982: active checkpoint system

6f8ddb9

kdetry mentioned this issue May 5, 2024

Fix #982: active checkpoint system #993

Closed

This was referenced May 13, 2024

[Lake][DuckDB] ETL - Implement Update Queries #1001

Open

[Lake][ETL] DuckDB E2E - Ingestion -> Dashboards #685

Open

idiom-bytes closed this as completed May 27, 2024

idiom-bytes assigned idiom-bytes and unassigned kdetry May 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Lake][Incremental Pipeline] Use bronze_pdr_predictions to manage active checkpoint #982

[Lake][Incremental Pipeline] Use bronze_pdr_predictions to manage active checkpoint #982

idiom-bytes commented May 1, 2024 •

edited

kdetry commented May 2, 2024

KatunaNorbert commented May 2, 2024

kdetry commented May 2, 2024

KatunaNorbert commented May 2, 2024 •

edited

kdetry commented May 2, 2024

idiom-bytes commented May 2, 2024 •

edited

idiom-bytes commented May 21, 2024

idiom-bytes commented May 27, 2024

[Lake][Incremental Pipeline] Use bronze_pdr_predictions to manage active checkpoint #982

[Lake][Incremental Pipeline] Use bronze_pdr_predictions to manage active checkpoint #982

Comments

idiom-bytes commented May 1, 2024 • edited

Background / motivation

Macro

TODOs / DoD

Checkpoints & Incremental Pipeline

kdetry commented May 2, 2024

KatunaNorbert commented May 2, 2024

kdetry commented May 2, 2024

KatunaNorbert commented May 2, 2024 • edited

kdetry commented May 2, 2024

idiom-bytes commented May 2, 2024 • edited

idiom-bytes commented May 21, 2024

idiom-bytes commented May 27, 2024

idiom-bytes commented May 1, 2024 •

edited

KatunaNorbert commented May 2, 2024 •

edited

idiom-bytes commented May 2, 2024 •

edited