Remove rows from production tables if data is missing from CSVs #1044

KatunaNorbert · 2024-05-15T09:29:17Z

kdetry · 2024-05-15T12:46:31Z

pdr_backend/lake/gql_data_factory.py

+                PersistentDataStore(table.base_path), "etl", csv_last_timestamp
+            )
+            return
+


I suggest the following implementation (I write it here so it could have some syntax error)

if db_last_timestamp['max("timestamp")'][0] is not None and ( csv_last_timestamp is None or csv_last_timestamp < db_last_timestamp['max("timestamp")'][0] ): # If CSVs timestamp is before target_timestamp = 0 if csv_last_timestamp is None else csv_last_timestamp PersistentDataStore(table.base_path).execute_sql( f"DELETE FROM {table_name} WHERE timestamp >= {target_timestamp}" ) drop_tables_from_st(PersistentDataStore(table.base_path), "etl", target_timestamp) return

To solve the problem, you added "magic" to GQLDF which it shouldn't have.

The GQLDF should not have any knowledge of the ETL.
The GQLDF should not be smart about downstream tables/workflows, and try to handle it.

Jobs should be self-contained and only be concerned about going from A->B. It takes something from [A]->Does Stuff->[B]

As I described in our meeting and the ticket, GQDLF shouldn't have any knowledge of the ETL or other downstream systems.

When it comes to managing/deleting CSVs:

We specifically didn't add interfaces to delete/manage the CSV records and made the code defensive so CSVs aren't messed with.

If someone breaks stuff.. there are clear protocols for how to get things running back up fast.

Please do not add extra stuff that was not meant to be supported.

If a user deletes random CSVs and wants help, there are ways to get back things up and running... but, we should not solve these edge cases by adding code to GQLDF...

Closing this PR.

idiom-bytes · 2024-05-15T17:44:33Z

The way to solve the original issue, is to clamp the GQLDF output so it doesn't write records to DuckDB Raw Tables that already exist.

Jobs should be self-contained and only be concerned about going from A->B.
It takes an input [A] -> Does Stuff -> generates an output [B]

GQLDF shouldn't have any knowledge of ETL.
We have built commands that enforce this.

KatunaNorbert added 2 commits May 15, 2024 12:28

delete rows from production tables if data is missing from csvs

062eaec

removed prints

20a5e1a

KatunaNorbert marked this pull request as ready for review May 15, 2024 10:30

kdetry requested changes May 15, 2024

View reviewed changes

idiom-bytes closed this May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove rows from production tables if data is missing from CSVs #1044

Remove rows from production tables if data is missing from CSVs #1044

KatunaNorbert commented May 15, 2024

kdetry May 15, 2024

idiom-bytes May 15, 2024 •

edited

idiom-bytes commented May 15, 2024 •

edited

Remove rows from production tables if data is missing from CSVs #1044

Remove rows from production tables if data is missing from CSVs #1044

Conversation

KatunaNorbert commented May 15, 2024

kdetry May 15, 2024

Choose a reason for hiding this comment

idiom-bytes May 15, 2024 • edited

Choose a reason for hiding this comment

idiom-bytes commented May 15, 2024 • edited

idiom-bytes May 15, 2024 •

edited

idiom-bytes commented May 15, 2024 •

edited