Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Lake][DuckDB] Verify lake functionality and behavior is working as expected #1000

Closed
34 tasks done
idiom-bytes opened this issue May 6, 2024 · 8 comments
Closed
34 tasks done
Assignees
Labels
Type: Enhancement New feature or request

Comments

@idiom-bytes
Copy link
Member

idiom-bytes commented May 6, 2024

Motivation

To verify that

  • The lake is working reliably, as expected
  • The basic flows and tables are working reliably, as expected
  • The commands and lake behavior is working as expected

We need to improve basic reliability and stability of the lake.
The basic duckdb behavior needs to be working as expected.

We should verify things are working by keeping it simple, and focusing on the bronze_pdr_predictions table.
I am recommending that we ignore pdr-subcriptions, pdr-slots, and possibly other tables so we can validate that the lake is behaving as expected.

image

Verification - Inserting data into the lake and manipulating it

When you first start interacting with the lake, there will be a large a fetch/update step that will try to build everything into the lake. As these records are processed, we begin inserting them into our DB.

  1. User should edit ppss such that the st_ts and end_ts to define their lake start & end time.
  2. User should run lake update command to start fetching data, and fill the whole lake.
  3. At any point, the user should be able to pause/cancel and resume w/o any errors
  4. Show user how to set end_ts to "Now" so lake continues to update forever.

Once the lake is built, it's very likely that many records will have null entries as they are initially inserted into the database. We are not worried about this for the moment.

Test - Cutting off the lake (dropping)

Let's first consider how our lake works.
A certain amount of data and events arrive that need to be processed. Each time we do a run, we update a certain amount of records.

Run 1 Run 2 Run 3
Time 1:00 2:00 3:00

Let's say we wanted to drop everything since Run 1. We would call our cli drop command, and get rid of that data.

pdr lake drop 10000001 my_ppss.yaml sapphire-mainnet

Which might be the equivalent of dropping all records since Run 1 -> End.
This would include the data from [Run 2, Run 3].

The user would continue updating the lake by calling pdr lake update... which would refetch and rebuild [Run 2, Run 3], getting the system up-to-date, and then continuing on from there,

Verifying

We could consider that by dropping/cutting off part of the lake, all tables would have the same data cut-off/rows-dropped like below. Such that the data pipeline can resume from here, and all tables can be updated/resumed from the same "height".

WITH 
max_pdr_predictions AS (
  SELECT MAX(timestamp) AS max_timestamp FROM pdr_predictions
),
max_pdr_payouts AS (
  SELECT MAX(timestamp) AS max_timestamp FROM pdr_payouts
),
max_pdr_subscriptions AS (
  SELECT MAX(timestamp) AS max_timestamp FROM pdr_subscriptions
),
max_bronze_predictions AS (
  SELECT MAX(timestamp) AS max_timestamp FROM bronze_pdr_predictions
),
max_bronze_slots AS (
  SELECT MAX(timestamp) AS max_timestamp FROM bronze_pdr_slots
),
SELECT 
  'pdr_predictions' AS table_name, max_pdr_predictions.max_timestamp
FROM max_pdr_predictions
UNION ALL
SELECT 
  'pdr_payouts' AS table_name, max_pdr_payouts.max_timestamp
FROM max_pdr_payouts
UNION ALL
SELECT 
  'pdr_subscriptions' AS table_name, max_pdr_subscriptions.max_timestamp
FROM max_pdr_subscriptions
UNION ALL
SELECT 
  'bronze_pdr_predictions' AS table_name, max_bronze_predictions.max_timestamp
FROM max_bronze_predictions
UNION ALL
SELECT 
  'bronze_pdr_slots' AS table_name, max_bronze_slots.max_timestamp
FROM max_bronze_slots

DoD

  • It's hard to break the lake
  • The basic lake flow is working end-to-end as expected
  • Raw and Bronze tables are updating as expected
  • Data is fetching, yielding, pausing, resuming, as expected
  • You are able to manipulate the lake to get the results you expected
  • bronze-predictions table has null records in it and that's ok
  • other tables (bronze-slots) and queries (update bronze-predictions) are disabled for the moment
  • update drop command to not use raw or etl filter and remove data from all the table

Testing Data Pipeline Behavior

We need to verify that the basic workflows for inserting data are working. You should be able to do this step-by-step and have the lake and tables working, as expected.

  • Changing data feeds (BTC/ETH 5m or 1h) should not impact how we fetch data from subgraph for lake. Lake fetches everything from subgraph, filtering is done at the end.
  • Pausing/resuming the pipeline shouldn't cause any issues. Resuming the pipeline with lake update should just work.
  • Setting up st_ts (2024-04-01) and end_ts (now) should cause all data to be updated correctly
  • Changing st_ts (2024-04-01) to earlier should not cause data to backfill
  • Changing end_ts (2024-05-01) to earlier than max(timestamp) should not cause data to drop or anything drastic. Build command will just stop doing anything (the lake is considered full/complete).
  • All data should be fetched correctly and we should have all basic tables up-to-date.
  • pdr-slots, pdr-subscriptions, and other tables should be removed from the main etl-flow for now
  • There will be null records inside of pdr-predictions and pdr-slots
  • If we drop lake tables, we should also drop etl tables for the same st_ts -> end_ts
  • If we use drop CLI command all tables should be updated but CSVs should remain intact
  • The only way to fetch retroactively, is to change ppss.yaml and drop the raw tables.

Core Components - Raw Table

  • checkpoint is identifying the right places to st_ts and end_ts
  • you can stop/cancel/resume/pause, and things resume correctly and reliably
  • the tables and records are being filled/appended correctly
  • inserting to duckdb starts after all of GQL + CSV has ended
  • inserting to duckdb cannot start/end part-way
  • raw tables are updated correctly
  • there are no duplicates in the raw table

Core Components - ETL Table

  • checkpoint is identifying the right places to st_ts and end_ts #1046
  • you can stop/cancel/resume/pause, and things resume correctly and reliably
  • the tables and records are being filled/appended correctly
  • inserting to bronze tables starts after all of GQL + CSV + Raw Tables has ended
  • inserting to bronze cannot start/end part-way
  • bronze predictions table are updated correctly
  • there are no gaps in bronze_predictions table
  • there are no duplicates in the bronze table
@idiom-bytes idiom-bytes added the Type: Enhancement New feature or request label May 6, 2024
@idiom-bytes idiom-bytes changed the title [Lake][DuckDB] Verifying lake functionality and behavior [Lake][DuckDB] Verify lake functionality and behavior is working as expected May 6, 2024
@trentmc
Copy link
Member

trentmc commented May 7, 2024

Issue 1000! :)

@KatunaNorbert KatunaNorbert self-assigned this May 8, 2024
@KatunaNorbert
Copy link
Member

pdr-slots, pdr-subscriptions, and other tables should be removed from the main etl-flow for now

Can we keep this tables so we have all the raw tables working? I don't see how these could slow us down

@idiom-bytes
Copy link
Member Author

@KatunaNorbert they have been slowing us down in the testing, iteration, and many other things.

Objective Before: We implemented them because we wanted to move many things in parallel.

Objective Now: We want to pause them now so we can verify things in-order.

@idiom-bytes
Copy link
Member Author

idiom-bytes commented May 13, 2024

checkpoint is identifying the right places to st_ts and end_ts

Yes, I have reviewed the code end-to-end.

  • It looks to correctly calculate where to start and ed
  • It looks at the CSV data to figure out where to resume from
  • It does not look to be creating any gaps
  • It looks like it's fetching all the way to the end

[Fetching GQL data from the right place]

  • it gets each table and figures out where it should start from the last CSV record
  • identifies the last_timestamp in the csv file
  • now it can resume from where it left off
    image

[Preloading from CSV for SQL]

  • now it checks where the CSV last_timestamp and the DB last_timestamp
  • if the CSV last_timestamp greater than DB last_timestamp, it will dump from CSV => DB temp table
  • now the temp-table is preloaded with everything it will need
  • now it starts fetching from GQL + adding to CSV & table
  • if the loop fails, then _prepare_temp_table() should fill the table w/ whatever records are needed before fetching more
    image

[Fetch all the way to the end]

  • Now SQL + raw data should fetch to the very end ppss.lake_ss.end_ts
  • Once all the data is fetched and inside _temp_tables, the data is moved to _live_tables

-[x] you can stop/cancel/resume/pause, and things resume correctly and reliably

Yes, I have reviewed the code end-to-end.

  • It looks to correctly calculate where to start and end
  • It looks at the CSV data to figure out where to resume fetching from
  • It copies data from CSV -> temp table before fetching new data, so the new data is ordered correctly
  • It does not look to be creating any gaps
  • It looks like it's fetching all the way to the end
  • If Run 1 Table A completed, Run 2 Table A will pick up a few extra records now that a bit of time has passed
  • Table A finishes updating before Table B starts
  • Table A resumes correctly, before Table B starts

-[x] the tables and records are being filled/appended correctly

  • I have reviewed Table A -> Table B -> Table C: failure/resuming/cancel/pausing/etc.... many many times and it's all working pretty well, reliably, and accurately

  • I have observed the log output to verify completeness and accuracy many times
    Screenshot from 2024-05-13 15-11-32

  • there are no gaps in the csv files
    I haven't quite stressed this, but I see that things are starting/resuming correctly and believe it to be working as expected.

  • inserting to duckdb starts after all of GQL + CSV has ended
    It does... and it's also inserted before GQL + CSV resumes such that there are no gaps in the data. I have shared this screenshot above, but _prepare_temp_table() does a great job at backfilling the data before GQL + CSV resumes fetching

All the data from GQL is updated to temp_tables, and the whole job needs to complete succesfully, before rows are added to duckdb.

I believe this is working correctly
Screenshot from 2024-05-13 14-36-17

  • inserting to duckdb cannot start/end part-way
    Yes, just like above.

  • raw tables are updated correctly

  • there are no gaps in raw table

I believe both of these to be correct

@KatunaNorbert
Copy link
Member

KatunaNorbert commented May 14, 2024

Issues:

  • raw udate flow crashes at fetching slots step. Due to this the data is not moved from temp tables to productions table because the flow is not completed #1036

Screenshot 2024-05-14 at 11.49.49.png

Screenshot 2024-05-14 at 11.54.21.png

  • If data all the data gets deleted from production raw table and there is data in csv then the update process breaks with the following error #1038

Screenshot 2024-05-14 at 12.29.20.png

  • THIS ONE WILL BE HANDLED LATTER - If some csv files or rows from csv files are getting deleted then those values are going to be refetched and inserted into the corresponding raw production table regardless if the data already exists in the table and ends up with depricated data #1042

@KatunaNorbert
Copy link
Member

KatunaNorbert commented May 20, 2024

Fetching the data on the sapphire testnet is not working due to a subgraph issue on the payout data query side which is described inside this issue: #768

@idiom-bytes
Copy link
Member Author

Updates in the latest PR are working well
#1077

Basically, tables are starting + ending at the same time, reliably across all 4 initial tables (predictions, truevals, payouts, and bronze_predictions). The number of rows/records look correct too.
Screenshot from 2024-05-21 09-10-28

@idiom-bytes
Copy link
Member Author

I created tickets were we discovered functionality is missing and are closing this ticket as we have been able to harden the lake end-to-end and the core objectives of this ticket have been achieved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants