-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Lake][Incremental Pipeline] Use bronze_pdr_predictions to manage active checkpoint #982
Comments
The problem we are facing is not related to the start time of the ETL process. Therefore, updating the ETL logic and running the process from the last timestamp of the bronze_predictions table is not a viable solution. The actual issue is that the fin_ms value on the bronze_prediction (or bronze_slots) table is incorrect and is taken from the pdr_subscriptions table. |
Since we are not doing any kind of updates on the bronze tables, just inserting into the table, I think the logic should be the following: |
If we set the value to be the same as the one in the ppss.yaml file and the value is set to "now", some truevals and payouts may not be ready yet. Therefore, we need to use the same logic but calculate only the necessary tables for each bronze table. |
If truevals or payouts or other values are not ready then we should do updates when the values are ready. Right now the bronze step is only doing inserts and I don't think we should be holding back on rows if data is missing. We should have same rows as raw tables and do updates whenever missing data becomes available |
Updating means also checking the data (row by row) and we have to take the data from DB and convert to polars again. That is what we avoid from |
... this is exactly what we want... we only want to process events that happened within the time period that we're processing. Once we've processed all the events that took place, we don't have to touch that data any longer.
What we need is to support updates to existing records, such that bronze_pdr_predictions is updated.
We do it in a way where it's an upsert operation. The new data coming in from payout event, should use the latest-record from predictions table... such that the "null" fields or whatever is being mutated, is now up-to-date. |
Basic implementation of Raw + ETL checkpoints is now working. |
This has been implemented and completed. |
Background / motivation
Inferring where to resume the pipeline from is having issues.
It's not about "removing subscriptions" from this logic, we need to think about the system and what it's trying t do.
Rather than querying all tables to find the timestamp from where to resume the pipeline, just use the following for now:
Macro
let's break down what's happening here
we're querying the data tables from duckdb, in order to understand how to resume the jobs pipeline incrementally
basically, resuming from where it left off
i had mentioned in the past updating my_ppss.yaml such that st_ts is mutated w/ the last_run_timestamp and the pipeline enforces being incremental rather than trying to do it through data-inference (like ohlcv data factory)... but, this kind of breaks the pattern for how the yaml file is being used.
it also doesn't provide a way to track the etl/workflow runs, such that they can be rolledback in a systemic way
so, let's do this:
TODOs / DoD
Checkpoints & Incremental Pipeline
Update GQLDF and ETL logic, to resume from their equivalent prediction tables.
To run GQLDF
To run ETL
This will KISS and let us ship a simple incremental pipeline that works reliably and gives us the data quality we need.
We'll worry about implementing a job object to abstract this later.
This means your ppss can be:
st_ts: 01-01-2024
end_ts: now
...and it should just work
Tasks:
The text was updated successfully, but these errors were encountered: