Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove rows from production tables if data is missing from CSVs #1044

Conversation

KatunaNorbert
Copy link
Member

Fixes #1042.

@KatunaNorbert KatunaNorbert marked this pull request as ready for review May 15, 2024 10:30
PersistentDataStore(table.base_path), "etl", csv_last_timestamp
)
return

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest the following implementation (I write it here so it could have some syntax error)

           if db_last_timestamp['max("timestamp")'][0] is not None and (
            csv_last_timestamp is None or csv_last_timestamp < db_last_timestamp['max("timestamp")'][0]
        ):
            # If CSVs timestamp is before
            target_timestamp = 0 if csv_last_timestamp is None else csv_last_timestamp

            PersistentDataStore(table.base_path).execute_sql(
                f"DELETE FROM {table_name} WHERE timestamp >= {target_timestamp}"
            )
            drop_tables_from_st(PersistentDataStore(table.base_path), "etl", target_timestamp)
            return

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To solve the problem, you added "magic" to GQLDF which it shouldn't have.

The GQLDF should not have any knowledge of the ETL.
The GQLDF should not be smart about downstream tables/workflows, and try to handle it.

Jobs should be self-contained and only be concerned about going from A->B. It takes something from [A]->Does Stuff->[B]

  1. As I described in our meeting and the ticket, GQDLF shouldn't have any knowledge of the ETL or other downstream systems.

When it comes to managing/deleting CSVs:

  1. We specifically didn't add interfaces to delete/manage the CSV records and made the code defensive so CSVs aren't messed with.
  2. If someone breaks stuff.. there are clear protocols for how to get things running back up fast.
  3. Please do not add extra stuff that was not meant to be supported.
  4. If a user deletes random CSVs and wants help, there are ways to get back things up and running... but, we should not solve these edge cases by adding code to GQLDF...

Closing this PR.

@idiom-bytes
Copy link
Member

idiom-bytes commented May 15, 2024

The way to solve the original issue, is to clamp the GQLDF output so it doesn't write records to DuckDB Raw Tables that already exist.

Jobs should be self-contained and only be concerned about going from A->B.
It takes an input [A] -> Does Stuff -> generates an output [B]

GQLDF shouldn't have any knowledge of ETL.
We have built commands that enforce this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants