Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Lake][Incremental Pipeline] Use bronze_pdr_predictions to manage active checkpoint #982

Closed
3 tasks done
idiom-bytes opened this issue May 1, 2024 · 8 comments
Closed
3 tasks done
Assignees
Labels
Type: Enhancement New feature or request

Comments

@idiom-bytes
Copy link
Member

idiom-bytes commented May 1, 2024

Background / motivation

Inferring where to resume the pipeline from is having issues.

It's not about "removing subscriptions" from this logic, we need to think about the system and what it's trying t do.

Rather than querying all tables to find the timestamp from where to resume the pipeline, just use the following for now:

  • raw_predictions for GQLDF
  • bronze_pdr_predictions for ETL

Macro

let's break down what's happening here

we're querying the data tables from duckdb, in order to understand how to resume the jobs pipeline incrementally

basically, resuming from where it left off

i had mentioned in the past updating my_ppss.yaml such that st_ts is mutated w/ the last_run_timestamp and the pipeline enforces being incremental rather than trying to do it through data-inference (like ohlcv data factory)... but, this kind of breaks the pattern for how the yaml file is being used.

it also doesn't provide a way to track the etl/workflow runs, such that they can be rolledback in a systemic way

so, let's do this:

  • document this (so it's clear how this is expected to work) and how we plan to solve it properly
  1. use raw_predictions/bronze_predictions max_timestamps to resume GQLDF/ETL.
  2. ship duckdb integration
  3. start tracking the runs in duckdb, and using these as the checkpoint for how to run the next GQLDF/ETL run

TODOs / DoD

Checkpoints & Incremental Pipeline

Update GQLDF and ETL logic, to resume from their equivalent prediction tables.

To run GQLDF

  • We're going to use the st_ts from the last(raw_pdr_predictions) in duck_db
  • We're going to use the end_ts from ppss.lake_ss.end_ts = now

To run ETL

  • We're going to use the st_ts from the last(raw_pdr_predictions) in duck_db
  • We're going to use the end_ts from the last(temp_raw_pdr_predictions) in duck_db (it could also be ppss.lake_ss.end_ts = now)

This will KISS and let us ship a simple incremental pipeline that works reliably and gives us the data quality we need.
We'll worry about implementing a job object to abstract this later.

This means your ppss can be:
st_ts: 01-01-2024
end_ts: now

...and it should just work

Tasks:

  • Update GQLDF to start/end from the correct spot
  • Update ETL to start/end from the correct spot
@idiom-bytes idiom-bytes added the Type: Enhancement New feature or request label May 1, 2024
@kdetry
Copy link
Contributor

kdetry commented May 2, 2024

The problem we are facing is not related to the start time of the ETL process. Therefore, updating the ETL logic and running the process from the last timestamp of the bronze_predictions table is not a viable solution.

The actual issue is that the fin_ms value on the bronze_prediction (or bronze_slots) table is incorrect and is taken from the pdr_subscriptions table.

@KatunaNorbert
Copy link
Member

Since we are not doing any kind of updates on the bronze tables, just inserting into the table, I think the logic should be the following:
fin_ms should be fin_timestr from ppss.yaml which then should be same as the raw tables value

@kdetry
Copy link
Contributor

kdetry commented May 2, 2024

If we set the value to be the same as the one in the ppss.yaml file and the value is set to "now", some truevals and payouts may not be ready yet. Therefore, we need to use the same logic but calculate only the necessary tables for each bronze table.

@KatunaNorbert
Copy link
Member

KatunaNorbert commented May 2, 2024

If truevals or payouts or other values are not ready then we should do updates when the values are ready. Right now the bronze step is only doing inserts and I don't think we should be holding back on rows if data is missing. We should have same rows as raw tables and do updates whenever missing data becomes available

@kdetry
Copy link
Contributor

kdetry commented May 2, 2024

Updating means also checking the data (row by row) and we have to take the data from DB and convert to polars again. That is what we avoid from

@idiom-bytes
Copy link
Member Author

idiom-bytes commented May 2, 2024

If we set the value to be the same as the one in the ppss.yaml file and the value is set to "now", some truevals and payouts may not be ready yet. Therefore, we need to use the same logic but calculate only the necessary tables for each bronze table.

... this is exactly what we want... we only want to process events that happened within the time period that we're processing. Once we've processed all the events that took place, we don't have to touch that data any longer.

Run 1 Run 2 Run 3
Time 1:00 2:00 3:00

What we need is to support updates to existing records, such that bronze_pdr_predictions is updated.
This happens after we finish processing all other tables... Example:

  1. write to raw tables -> inserts_to_temp_table
  2. write to bronze tables bronze_pdr_predictions.py -> inserts_to_temp_table
  3. write to other bronze tables bronze_pdr_truevals.py -> inserts_to_temp_table
  4. write to other bronze tables bronze_pdr_subscriptions.py -> inserts_to_temp_table
  5. write to other bronze tables other bronze_pdr_payouts.py -> inserts_to_temp_table
  6. final update of bronze tables [post_process_update_bronze_tables] -> update .features temp_bronze_pdr_predictions select .features from temp_tables
  7. insert from all temp_tables to final tables (write to live tables)
  8. dump temp tables
  9. end etl
Updating means also checking the data (row by row) and we have to take the data from DB and convert to polars again. That is what we avoid from

We do it in a way where it's an upsert operation.

The new data coming in from payout event, should use the latest-record from predictions table... such that the "null" fields or whatever is being mutated, is now up-to-date.

@idiom-bytes
Copy link
Member Author

Basic implementation of Raw + ETL checkpoints is now working.
#1077

@idiom-bytes
Copy link
Member Author

This has been implemented and completed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants