Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

insert_or_update can have problems with multiple "issue"s #1427

Open
melange396 opened this issue May 8, 2024 · 0 comments
Open

insert_or_update can have problems with multiple "issue"s #1427

melange396 opened this issue May 8, 2024 · 0 comments
Labels
acquisition changes acquisition logic bug data quality mysql mysql database related

Comments

@melange396
Copy link
Collaborator

insert_or_update() (which is really both insert_or_update_batch() and insert_or_update_bulk() right now; they should be resolved into one in #989) can have problems when given a set of rows that includes more than one issue. It is not that multiple issue values themselves are problematic, but the desired result may not be achieved when there are multiple rows for the same source/signal/time/geo that have different issues.

The problem stems from the fact that the SQL code in fix_is_latest_issue_sql kinda naively inherently assumes only one issue will be present for each (source, signal, geo_type, geo_value, time_type, time_value). It should be noted that it sort of presumes most rows inserted will be a latest issue, which is why the is_latest_issue column defaults to 1 in insert_into_loader_sql. It should also be noted that "overwrites" will happen when load.issue>latest.issue, but ALSO when they are ==, which lets us update/patch previously-written values that may have been incorrect (then the SQL that moves the load table data to latest and history tables has an ON DUPLICATE clause that preserves the unique key epimetric_id for the same unique composite key tuple).

In practice, the problematic behavior does not happen because the only time insert_or_update() is called (in csv_to_database.py:upload_archive()), it is handed rows returned by CsvImporter.load_csv(), which only uses a single issue value at a time. However, it is possible that we might write something else in the future that does not account for this expected invariance, so we should take steps to resolve it now.

Non-compliant calls to insert_or_update() can still happen in test code via calls to _insert_rows(). This github search provides a list of them.

Possible ways to address this:

  • add a constraint to the load table to enforce that no multiple-issue collisions happen
    • can be done just by removing issue from its UNIQUE INDEX)
    • doesnt actually solve the problem, but does prevent us from suffering the ill effects
  • fix the fix_is_latest_issue_sql calculation to work with multiple issues
    • perhaps by incorporating a GROUP BY
    • perhaps by inserting a SQL statement before it that does its own GROUP BY that sets is_latest_issue=0 for non-maximal issue rows.

Doing unit/integration testing for these conditions should be ~easy, and some work on this appears to have already been done in #925

@melange396 melange396 added bug acquisition changes acquisition logic data quality mysql mysql database related labels May 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
acquisition changes acquisition logic bug data quality mysql mysql database related
Projects
None yet
Development

No branches or pull requests

1 participant