New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Fix #982: active checkpoint system #993

Closed

kdetry wants to merge 13 commits into issue685-duckdb-integration from issue-982

Contributor

kdetry commented May 5, 2024

Fixes #982

Changes proposed in this PR:

the move query is changed
UNIQUE constraints are added
the start-end timestamp logic is changed for the ETL and GQLDF

kdetry added 4 commits

May 5, 2024 16:48


          Fix #982: active checkpoint system

6f8ddb9


          test, mypy, pylint fixes


          test fixes

28cb4ee


          test fixes

a8857f9

kdetry requested a review from idiom-bytes

May 6, 2024 20:23

KatunaNorbert reviewed

View reviewed changes

pdr_backend/lake/gql_data_factory.py Outdated Show resolved Hide resolved

Member

KatunaNorbert commented May 7, 2024

deleted lake_data and run ETL, stopped it at the subscriptions fetching step and run into the following error when run it again:


          reverse merge with the duckdb branch

7a2bec0

Member

KatunaNorbert commented May 7, 2024

I think I see what is happening, if the gql update is stopped the temp table and csv are updated but live table is not created. Then when restarting the process we take the start date from live table but that doesn't exist so we take it from lake_ss but there is data inside cvs and temp table and when trying to insert it gives an error

Member

KatunaNorbert commented May 7, 2024

A fix could be to add a try catch around the self._do_subgraph_fetch so whenever there the process is stopped we catch the error and move the existing tables to production. This was the production tables are reflecting what's inside the temp tables and there are not conflicts.
I tested it and seems to be working, I could push the changes if you agree with this approach

Member

idiom-bytes commented May 8, 2024 •

edited

A fix could be to add a try catch around the self._do_subgraph_fetch so whenever there the process is stopped we catch the error and move the existing tables to production. This was the production tables are reflecting what's inside the temp tables and there are not conflicts.

We do not wan to update the live table if there was an error...

We can simply break down into 2 processes.

Update data to local/csv
Load from CSV to DuckDB

What should happen here is...

GQL Data Factory should resume fetching from where it left off.
When all fetching is completed, all data (from st_ts => to end_ts) is loaded from CSV onto the DuckDB tables.
When all of this completes end-to-end the GQLDF is up-to-date.

idiom-bytes reviewed

View reviewed changes

pdr_backend/lake/persistent_data_store.py Outdated

                           empty_df = pl.DataFrame([], schema=schema)
                           self._create_and_fill_table(empty_df, table_name)
+                  def _create_sql_table_schema(self, schema: SchemaDict) -> str:

Member

idiom-bytes May 8, 2024 •

edited

why does this have to be created now, as part of this PR?

idiom-bytes reviewed

View reviewed changes

pdr_backend/lake/persistent_data_store.py Outdated

+                          # create the permanent table
+                          self.execute_sql(
+                              f"CREATE TABLE {permanent_table_name} ({sql_table_schema})"
+                          )

Member

idiom-bytes May 8, 2024

can we put all of this into a function?

Contributor Author

kdetry May 8, 2024

idiom-bytes reviewed

View reviewed changes

pdr_backend/lake/persistent_data_store.py Outdated

+                              for column in temp_table_columns
+                              if column != "ID"
+                          ]
+                      )

Member

idiom-bytes May 8, 2024 •

edited

I don't think this is correct...

I don't think the pds should handle the conflict. I.e. it should just handle merging from one to the other.
You are trying to handle stuff ahead of time.

What should happen right now....

Only new events are being processed... there should be no conflicts.
There will be gaps in the data "null points", do not try to join/fx for these at the moment.

What will happen in the future...

Create SQL queries that update existing records

Do not try to do this right now...

idiom-bytes reviewed

View reviewed changes

pdr_backend/lake/persistent_data_store.py Outdated

-                          f"INSERT INTO {permanent_table_name} SELECT * FROM {temp_table.fullname}"
+                          f"""INSERT INTO {permanent_table_name}
+                          SELECT * FROM {temp_table.fullname}
+                          ON CONFLICT (ID) DO UPDATE SET {on_conflict_columns}"""

Member

idiom-bytes May 8, 2024

Same here...

idiom-bytes reviewed

View reviewed changes

pdr_backend/lake/persistent_data_store.py Outdated

+                          # can be added based on specific needs and compatibility.
+                      }
+                  def _get_sql_column_type(self, column_type) -> str:

Member

idiom-bytes May 8, 2024

why does this have to be created now, as part of this PR?

Contributor Author

kdetry May 8, 2024

This logic was created because we were using auto-casting for datatypes in the database. However, we now require at least one unique column, and making a column unique by altering it does not work. Therefore, we must specify it while creating the table. As a solution, I created this logic to cast column types manually.

I am open for different suggestions

Member

idiom-bytes May 8, 2024

Can we just use the id column, which should be unique and every object has one at the moment?

idiom-bytes and others added 2 commits

May 7, 2024 21:20


          Fixing black

a423468


          move data from temp to prod when fetching process is stopped

2ae2e73

Member

KatunaNorbert commented May 8, 2024

We do not wan to update the live table if there was an error...

GQL Data Factory should resume fetching from where it left off.

When all fetching is completed, all data (from st_ts => to end_ts) is loaded from CSV onto the DuckDB tables.

When all of this completes end-to-end the GQLDF is up-to-date.

I agree with this, If we move to production and there is missing data then we can't trust the production

KatunaNorbert and others added 3 commits

May 8, 2024 12:09


          undo last commit

eaba058


          changes reverted on pds

4e1d262


          Merge branch 'issue-982' of https://github.com/oceanprotocol/pdr-backend

a5ffe71

 into issue-982

Contributor Author

kdetry commented May 8, 2024 •

edited

I have reverted the UNIQUE feature which is on PDS @idiom-bytes

kdetry added 2 commits

May 8, 2024 16:15


          removed the adding 1000 to the default timestamp logic

302e918


          test fix

430c41f

Member

idiom-bytes commented May 8, 2024 •

edited

I believe what we did at the Data Fetching level is get st_ts and end_ts, then:

Save new records to CSV + SQL raw tables
Process new raw predictions rows => temp bronze prediction rows
Move rows from temp tables => to live tables

idiom-bytes reviewed

View reviewed changes

pdr_backend/lake/gql_data_factory.py Outdated

+                              )
+                          except Exception as e:
+                              self._move_from_temp_tables_to_live()
+                              logger.error("Error on fetching data from %s: %s", table.table_name, e)

Member

idiom-bytes May 8, 2024 •

edited

I don't think this is correct

Member

KatunaNorbert May 9, 2024

removed that part after adding it, realised we don't want this

idiom-bytes reviewed

View reviewed changes

pdr_backend/lake/test/test_gql_data_factory.py Show resolved Hide resolved

Member

KatunaNorbert commented May 9, 2024

Tested it and identified an issue:
When running gql update to fetch the data then stopping it along the way, for example after fetching predictions, slots and trueval, then running it again it fetches again all the data. I think it's looking at the live pdr_predictions table which it's only created at the end of the etl update, which is fine, but if the fetching process is stopped we should take the existing data from csvs and only fetch what is left instead of fetching everything.


          CSVs into temp tables and resume from there

ee54ba4

Member

KatunaNorbert commented May 10, 2024 •

edited

The resume part looks fine now, but when stopping the fetch predictions process and restarting it at timestamp_x that timestamp is going to be used in all the other tables instead of the start date from the lake

Contributor Author

kdetry commented May 10, 2024

The resume part looks fine now, but when stopping the fetch predictions process and restarting it at timestamp_x that timestamp is going to be used in all the other tables instead of the start date from the lake

I tried to explain this situation while discussing the logic, we shouldn't rely on the pdr_predictions table for GQLDF, all other tables should manage their own situations. Only the ETL side look at pdr_predictions, if the bronze_pdr_predictions table does not have a record.

Member

idiom-bytes commented May 13, 2024 •

edited

@kdetry @KatunaNorbert you tagged the wrong ticket...#982 is for incremental updates... which is not what we're doing

the start-end timestamp logic is changed for the ETL and GQLDF

please read the readme, the epic, and ticket #1000
incremental updates are for AFTER we get the first build ready.

The only thing that we care about right now is putting raw-records into GQLDF.
ETL is not part of #1000.

idiom-bytes reviewed

View reviewed changes

pdr_backend/lake/test/test_gql_data_factory.py

+                  csvds = CSVDataStore(ppss.lake_ss.lake_dir)
+                  # Add some predictions
+                  pds.drop_table(get_table_name("pdr_predictions", TableType.TEMP))
+                  csvds.delete("pdr_predictions")

Member

idiom-bytes May 15, 2024

i don't get this...

pytest creates a different tmpdir for every test
you don't need to delete anything, there is nothing there
there is now a function to delete csvs which is only used by this test

Member

idiom-bytes commented May 15, 2024

We've discussed the issues in here, reviewed implementations and agreed that:

Much of the work being done originally and the order-of-events were indeed correct.
We need to look across all tables to get the right timestamp, we can't just use the predictions table
There were a lot of wrong assumptions in this implementation. Example: GQLDF SHOULD look at CSV records for where to resume not the DuckDB tables.

Closing this PR

idiom-bytes closed this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment