[Lake][DuckDB] Verify lake functionality and behavior is working as expected #1000

idiom-bytes · 2024-05-06T20:44:49Z

Motivation

To verify that

The lake is working reliably, as expected
The basic flows and tables are working reliably, as expected
The commands and lake behavior is working as expected

We need to improve basic reliability and stability of the lake.
The basic duckdb behavior needs to be working as expected.

We should verify things are working by keeping it simple, and focusing on the bronze_pdr_predictions table.
I am recommending that we ignore pdr-subcriptions, pdr-slots, and possibly other tables so we can validate that the lake is behaving as expected.

Verification - Inserting data into the lake and manipulating it

When you first start interacting with the lake, there will be a large a fetch/update step that will try to build everything into the lake. As these records are processed, we begin inserting them into our DB.

User should edit ppss such that the st_ts and end_ts to define their lake start & end time.
User should run lake update command to start fetching data, and fill the whole lake.
At any point, the user should be able to pause/cancel and resume w/o any errors
Show user how to set end_ts to "Now" so lake continues to update forever.

Once the lake is built, it's very likely that many records will have null entries as they are initially inserted into the database. We are not worried about this for the moment.

Test - Cutting off the lake (dropping)

Let's first consider how our lake works.
A certain amount of data and events arrive that need to be processed. Each time we do a run, we update a certain amount of records.

	Run 1	Run 2	Run 3
Time	1:00	2:00	3:00

Let's say we wanted to drop everything since Run 1. We would call our cli drop command, and get rid of that data.

pdr lake drop 10000001 my_ppss.yaml sapphire-mainnet

Which might be the equivalent of dropping all records since Run 1 -> End.
This would include the data from [Run 2, Run 3].

The user would continue updating the lake by calling pdr lake update... which would refetch and rebuild [Run 2, Run 3], getting the system up-to-date, and then continuing on from there,

Verifying

We could consider that by dropping/cutting off part of the lake, all tables would have the same data cut-off/rows-dropped like below. Such that the data pipeline can resume from here, and all tables can be updated/resumed from the same "height".

WITH 
max_pdr_predictions AS (
  SELECT MAX(timestamp) AS max_timestamp FROM pdr_predictions
),
max_pdr_payouts AS (
  SELECT MAX(timestamp) AS max_timestamp FROM pdr_payouts
),
max_pdr_subscriptions AS (
  SELECT MAX(timestamp) AS max_timestamp FROM pdr_subscriptions
),
max_bronze_predictions AS (
  SELECT MAX(timestamp) AS max_timestamp FROM bronze_pdr_predictions
),
max_bronze_slots AS (
  SELECT MAX(timestamp) AS max_timestamp FROM bronze_pdr_slots
),
SELECT 
  'pdr_predictions' AS table_name, max_pdr_predictions.max_timestamp
FROM max_pdr_predictions
UNION ALL
SELECT 
  'pdr_payouts' AS table_name, max_pdr_payouts.max_timestamp
FROM max_pdr_payouts
UNION ALL
SELECT 
  'pdr_subscriptions' AS table_name, max_pdr_subscriptions.max_timestamp
FROM max_pdr_subscriptions
UNION ALL
SELECT 
  'bronze_pdr_predictions' AS table_name, max_bronze_predictions.max_timestamp
FROM max_bronze_predictions
UNION ALL
SELECT 
  'bronze_pdr_slots' AS table_name, max_bronze_slots.max_timestamp
FROM max_bronze_slots

DoD

It's hard to break the lake
The basic lake flow is working end-to-end as expected
Raw and Bronze tables are updating as expected
Data is fetching, yielding, pausing, resuming, as expected
You are able to manipulate the lake to get the results you expected
bronze-predictions table has null records in it and that's ok
other tables (bronze-slots) and queries (update bronze-predictions) are disabled for the moment
update drop command to not use raw or etl filter and remove data from all the table

Testing Data Pipeline Behavior

We need to verify that the basic workflows for inserting data are working. You should be able to do this step-by-step and have the lake and tables working, as expected.

Core Components - Raw Table

checkpoint is identifying the right places to st_ts and end_ts
you can stop/cancel/resume/pause, and things resume correctly and reliably
the tables and records are being filled/appended correctly
inserting to duckdb starts after all of GQL + CSV has ended
inserting to duckdb cannot start/end part-way
raw tables are updated correctly
there are no duplicates in the raw table

Core Components - ETL Table

checkpoint is identifying the right places to st_ts and end_ts #1046
you can stop/cancel/resume/pause, and things resume correctly and reliably
the tables and records are being filled/appended correctly
inserting to bronze tables starts after all of GQL + CSV + Raw Tables has ended
inserting to bronze cannot start/end part-way
bronze predictions table are updated correctly
there are no gaps in bronze_predictions table
there are no duplicates in the bronze table

The text was updated successfully, but these errors were encountered:

trentmc · 2024-05-07T10:35:21Z

Issue 1000! :)

KatunaNorbert · 2024-05-09T14:40:52Z

pdr-slots, pdr-subscriptions, and other tables should be removed from the main etl-flow for now

Can we keep this tables so we have all the raw tables working? I don't see how these could slow us down

idiom-bytes · 2024-05-13T00:18:00Z

@KatunaNorbert they have been slowing us down in the testing, iteration, and many other things.

Objective Before: We implemented them because we wanted to move many things in parallel.

Objective Now: We want to pause them now so we can verify things in-order.

idiom-bytes · 2024-05-13T22:02:40Z

checkpoint is identifying the right places to st_ts and end_ts

Yes, I have reviewed the code end-to-end.

It looks to correctly calculate where to start and ed
It looks at the CSV data to figure out where to resume from
It does not look to be creating any gaps
It looks like it's fetching all the way to the end

[Fetching GQL data from the right place]

it gets each table and figures out where it should start from the last CSV record
identifies the last_timestamp in the csv file
now it can resume from where it left off

[Preloading from CSV for SQL]

now it checks where the CSV last_timestamp and the DB last_timestamp
if the CSV last_timestamp greater than DB last_timestamp, it will dump from CSV => DB temp table
now the temp-table is preloaded with everything it will need
now it starts fetching from GQL + adding to CSV & table
if the loop fails, then _prepare_temp_table() should fill the table w/ whatever records are needed before fetching more

[Fetch all the way to the end]

Now SQL + raw data should fetch to the very end ppss.lake_ss.end_ts
Once all the data is fetched and inside _temp_tables, the data is moved to _live_tables

-[x] you can stop/cancel/resume/pause, and things resume correctly and reliably

Yes, I have reviewed the code end-to-end.

It looks to correctly calculate where to start and end
It looks at the CSV data to figure out where to resume fetching from
It copies data from CSV -> temp table before fetching new data, so the new data is ordered correctly
It does not look to be creating any gaps
It looks like it's fetching all the way to the end
If Run 1 Table A completed, Run 2 Table A will pick up a few extra records now that a bit of time has passed
Table A finishes updating before Table B starts
Table A resumes correctly, before Table B starts

-[x] the tables and records are being filled/appended correctly

I have reviewed Table A -> Table B -> Table C: failure/resuming/cancel/pausing/etc.... many many times and it's all working pretty well, reliably, and accurately
I have observed the log output to verify completeness and accuracy many times
there are no gaps in the csv files
I haven't quite stressed this, but I see that things are starting/resuming correctly and believe it to be working as expected.
inserting to duckdb starts after all of GQL + CSV has ended
It does... and it's also inserted before GQL + CSV resumes such that there are no gaps in the data. I have shared this screenshot above, but _prepare_temp_table() does a great job at backfilling the data before GQL + CSV resumes fetching

All the data from GQL is updated to temp_tables, and the whole job needs to complete succesfully, before rows are added to duckdb.

I believe this is working correctly

inserting to duckdb cannot start/end part-way
Yes, just like above.
raw tables are updated correctly
there are no gaps in raw table

I believe both of these to be correct

KatunaNorbert · 2024-05-14T09:43:06Z

Issues:

raw udate flow crashes at fetching slots step. Due to this the data is not moved from temp tables to productions table because the flow is not completed #1036

If data all the data gets deleted from production raw table and there is data in csv then the update process breaks with the following error #1038

THIS ONE WILL BE HANDLED LATTER - If some csv files or rows from csv files are getting deleted then those values are going to be refetched and inserted into the corresponding raw production table regardless if the data already exists in the table and ends up with depricated data #1042

KatunaNorbert · 2024-05-20T13:37:09Z

Fetching the data on the sapphire testnet is not working due to a subgraph issue on the payout data query side which is described inside this issue: #768

idiom-bytes · 2024-05-21T16:30:06Z

Updates in the latest PR are working well
#1077

Basically, tables are starting + ending at the same time, reliably across all 4 initial tables (predictions, truevals, payouts, and bronze_predictions). The number of rows/records look correct too.

idiom-bytes · 2024-05-27T18:18:57Z

I created tickets were we discovered functionality is missing and are closing this ticket as we have been able to harden the lake end-to-end and the core objectives of this ticket have been achieved.

idiom-bytes added the Type: Enhancement New feature or request label May 6, 2024

idiom-bytes changed the title ~~[Lake][DuckDB] Verifying lake functionality and behavior~~ [Lake][DuckDB] Verify lake functionality and behavior is working as expected May 6, 2024

idiom-bytes mentioned this issue May 6, 2024

[Lake][ETL] DuckDB E2E - Ingestion -> Dashboards #685

Open

47 tasks

KatunaNorbert self-assigned this May 8, 2024

KatunaNorbert mentioned this issue May 8, 2024

Make simple flow working #1007

Closed

4 tasks

idiom-bytes mentioned this issue May 13, 2024

Fix #982: active checkpoint system #993

Closed

idiom-bytes closed this as completed May 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Lake][DuckDB] Verify lake functionality and behavior is working as expected #1000

[Lake][DuckDB] Verify lake functionality and behavior is working as expected #1000

idiom-bytes commented May 6, 2024 •

edited

trentmc commented May 7, 2024

KatunaNorbert commented May 9, 2024

idiom-bytes commented May 13, 2024

idiom-bytes commented May 13, 2024 •

edited

KatunaNorbert commented May 14, 2024 •

edited

KatunaNorbert commented May 20, 2024 •

edited

idiom-bytes commented May 21, 2024

idiom-bytes commented May 27, 2024

[Lake][DuckDB] Verify lake functionality and behavior is working as expected #1000

[Lake][DuckDB] Verify lake functionality and behavior is working as expected #1000

Comments

idiom-bytes commented May 6, 2024 • edited

Motivation

Verification - Inserting data into the lake and manipulating it

Test - Cutting off the lake (dropping)

Verifying

DoD

Testing Data Pipeline Behavior

Core Components - Raw Table

Core Components - ETL Table

trentmc commented May 7, 2024

KatunaNorbert commented May 9, 2024

idiom-bytes commented May 13, 2024

idiom-bytes commented May 13, 2024 • edited

KatunaNorbert commented May 14, 2024 • edited

KatunaNorbert commented May 20, 2024 • edited

idiom-bytes commented May 21, 2024

idiom-bytes commented May 27, 2024

idiom-bytes commented May 6, 2024 •

edited

idiom-bytes commented May 13, 2024 •

edited

KatunaNorbert commented May 14, 2024 •

edited

KatunaNorbert commented May 20, 2024 •

edited