-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Lake][DuckDB] Verify lake functionality and behavior is working as expected #1000
Comments
Issue 1000! :) |
Can we keep this tables so we have all the raw tables working? I don't see how these could slow us down |
@KatunaNorbert they have been slowing us down in the testing, iteration, and many other things. Objective Before: We implemented them because we wanted to move many things in parallel. Objective Now: We want to pause them now so we can verify things in-order. |
Issues:
|
Fetching the data on the |
Updates in the latest PR are working well Basically, tables are starting + ending at the same time, reliably across all 4 initial tables (predictions, truevals, payouts, and bronze_predictions). The number of rows/records look correct too. |
I created tickets were we discovered functionality is missing and are closing this ticket as we have been able to harden the lake end-to-end and the core objectives of this ticket have been achieved. |
Motivation
To verify that
We need to improve basic reliability and stability of the lake.
The basic duckdb behavior needs to be working as expected.
We should verify things are working by keeping it simple, and focusing on the
bronze_pdr_predictions table
.I am recommending that we ignore pdr-subcriptions, pdr-slots, and possibly other tables so we can validate that the lake is behaving as expected.
Verification - Inserting data into the lake and manipulating it
When you first start interacting with the lake, there will be a large a fetch/update step that will try to build everything into the lake. As these records are processed, we begin inserting them into our DB.
lake update
command to start fetching data, and fill the whole lake.Once the lake is built, it's very likely that many records will have
null
entries as they are initially inserted into the database. We are not worried about this for the moment.Test - Cutting off the lake (dropping)
Let's first consider how our lake works.
A certain amount of data and events arrive that need to be processed. Each time we do a run, we update a certain amount of records.
Let's say we wanted to drop everything since Run 1. We would call our cli drop command, and get rid of that data.
pdr lake drop 10000001 my_ppss.yaml sapphire-mainnet
Which might be the equivalent of dropping all records since Run 1 -> End.
This would include the data from
[Run 2, Run 3]
.The user would continue updating the lake by calling
pdr lake update
... which would refetch and rebuild[Run 2, Run 3]
, getting the system up-to-date, and then continuing on from there,Verifying
We could consider that by dropping/cutting off part of the lake, all tables would have the same data cut-off/rows-dropped like below. Such that the data pipeline can resume from here, and all tables can be updated/resumed from the same "height".
DoD
Testing Data Pipeline Behavior
We need to verify that the basic workflows for inserting data are working. You should be able to do this step-by-step and have the lake and tables working, as expected.
lake update
should just work.Core Components - Raw Table
Core Components - ETL Table
The text was updated successfully, but these errors were encountered: