Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue #1686] load data from legacy (Oracle) tables to staging tables #1852

Merged
merged 40 commits into from May 8, 2024

Conversation

jamesbursa
Copy link
Collaborator

@jamesbursa jamesbursa commented Apr 26, 2024

Summary

Fixes #1686

Time to review: 10 mins

Changes proposed

  • Add class LoadOracleDataTask which loops through each legacy table and loads the data into the corresponding staging table.
  • Add a module of functions src.data_migration.load.sql to generate sqlalchemy SQL expressions for the loads.
  • Add a script seed_local_legacy_tables.py to generate fake data in the local legacy tables.
  • Add unit tests.

Context for reviewers

This is a new batch task which loads data from the legacy (Oracle) tables to staging tables.

It is incremental and handles the following situations:

Situation Detection logic Action
New row in source table Primary key not present in destination table Insert into destination, set transformed_at=NULL
Updated row in source Primary key exists in both source & destination AND newer last_upd_date Update in destination, reset transformed_at=NULL
Deleted row in source Primary key does not exist in source, AND not is_deleted in destination Set is_deleted=TRUE, reset transformed_at=NULL
Unchanged Primary key exists in both source & destination AND last_upd_date is not newer No write to destination

Additional information

To run the code locally with fake data:

make setup-foreign-tables  # once
make seed-local-legacy-tables  # repeat to add & update data
poetry run python3 -m src.data_migration.load.load_oracle_data_task

api/src/data_migration/load/load_oracle_data_task.py Outdated Show resolved Hide resolved
api/src/data_migration/load/load_oracle_data_task.py Outdated Show resolved Hide resolved
Comment on lines +66 to +67
# COUNT has to be a separate query as INSERTs don't return a rowcount.
insert_count = self.db_session.query(select_sql.subquery()).count()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe as a follow-up if we want to improve performance, we could consider getting the count in the same query using RETURNING https://www.postgresql.org/docs/current/dml-returning.html

Did a little bit of testing locally - seems like its doable:

-- Sequence and defined type
CREATE SEQUENCE IF NOT EXISTS api.my_table_id_seq;

-- Table Definition
CREATE TABLE "api"."my_table" (
    "id" int4 NOT NULL DEFAULT nextval('api.my_table_id_seq'::regclass),
    "name" text,
    PRIMARY KEY ("id")

-- Actual query
with rows as (
INSERT INTO api.my_table (id, name)
		VALUES(5, 'bob'), (6, 'joe'), (7, 'steve')
	RETURNING
		id
	)
	select count(*) from rows

This gives 3 in the response.

api/src/data_migration/load/load_oracle_data_task.py Outdated Show resolved Hide resolved
Comment on lines +81 to +84
def get_max_opportunity_id(db_session):
max_opportunity_id = db_session.query(
sqlalchemy.func.max(foreign.opportunity.Topportunity.opportunity_id)
).scalar()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I think we should look into for the factories is a way to do this automatically - they mention being able to force the sequence counter based on a startup method in https://factoryboy.readthedocs.io/en/stable/recipes.html#forcing-the-sequence-counter

Basically what you're doing here / what I did in the db-seed-local script, but in the factory itself.

jamesbursa added a commit that referenced this pull request May 1, 2024
…s to staging tables (#1894)

## Summary
Part of #1686 (separate for easier review)

### Time to review: __2 mins__

## Changes proposed
- Add `created_at`, `updated_at`, and `deleted_at` columns to staging
tables
- Better logging of migrate queries

## Context for reviewers
These columns will provide helpful metadata for understanding or
troubleshooting the load process.

The `created_at` and `updated_at` columns are set automatically by
SQLAlchemy.

For `deleted_at`, I will update #1852 after this is merged.

## Additional information
Example from local testing:
![Screenshot 2024-05-01 at 11 31
47](https://github.com/HHS/simpler-grants-gov/assets/3811269/e9513de8-ceff-42c9-bccc-d2fcfc5fe7bf)
@jamesbursa jamesbursa marked this pull request as ready for review May 1, 2024 18:12
@github-actions github-actions bot added the ci/cd label May 1, 2024
Copy link
Collaborator

@chouinar chouinar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor suggestions, but looks good to me

.github/workflows/cd-api.yml Outdated Show resolved Hide resolved
api/src/data_migration/load/load_oracle_data_task.py Outdated Show resolved Hide resolved
api/src/data_migration/load/load_oracle_data_task.py Outdated Show resolved Hide resolved
api/src/data_migration/load/load_oracle_data_task.py Outdated Show resolved Hide resolved
api/src/data_migration/load/load_oracle_data_task.py Outdated Show resolved Hide resolved
chouinar
chouinar previously approved these changes May 7, 2024
Copy link
Collaborator

@chouinar chouinar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Comment on lines 72 to 77
try:
with self.db_session.begin():
self.load_data_for_table(table_name)
self.db_session.commit()
except Exception:
logger.exception("table load error", extra={"table": table_name})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commit isn't necessary as it happens in the exit part of the begin method.

Suggested change
try:
with self.db_session.begin():
self.load_data_for_table(table_name)
self.db_session.commit()
except Exception:
logger.exception("table load error", extra={"table": table_name})
try:
with self.db_session.begin():
self.load_data_for_table(table_name)
except Exception:
logger.exception("table load error", extra={"table": table_name})

coilysiren
coilysiren previously approved these changes May 7, 2024
Copy link
Collaborator

@coilysiren coilysiren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rubber stamp to echo Michael's approval 🙏🏼

@jamesbursa jamesbursa dismissed stale reviews from chouinar and coilysiren via 59a2c4a May 7, 2024 22:48
@jamesbursa jamesbursa merged commit 03cf6e9 into main May 8, 2024
8 checks passed
@jamesbursa jamesbursa deleted the jamesbursa/1686-load-legacy-to-staging branch May 8, 2024 14:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Task]: Add New Table Loading Statements
3 participants