Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split covid_hosp into daily & timeseries tables #1126

Open
wants to merge 42 commits into
base: dev
Choose a base branch
from

Conversation

rzats
Copy link
Contributor

@rzats rzats commented Apr 7, 2023

Prerequisites:

  • Unless it is a documentation hotfix it should be merged against the dev branch
  • Branch is up-to-date with the branch to be merged with, i.e. dev
  • Build is successful
  • Code is cleaned up and formatted

Summary

Currently, the covid_hosp_state_timeseries table stores data from two datasets at once - daily snapshots and timeseries data; the nature of this data is distinguished by the record_type column. This PR by:

  • Updating the state_daily DDL with two tables instead of one
  • Creating a migration SQL from one table into two
  • Modifying the timeseries endpoint to use two tables instead of one

@rzats rzats requested a review from melange396 April 7, 2023 13:44
@rzats rzats changed the title [WIP] Split covid_hosp into daily & timeseries tables Split covid_hosp into daily & timeseries tables Apr 7, 2023
@krivard
Copy link
Contributor

krivard commented Apr 7, 2023

Sonarcloud is overzealous

Copy link
Collaborator

@melange396 melange396 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nicely done! i think the suggestions i made in the .sql files might make sonarcloud chillax

src/ddl/covid_hosp.sql Show resolved Hide resolved
@@ -45,7 +45,7 @@ def __init__(self,
self.connection = connection
self.table_name = table_name
self.hhs_dataset_id = hhs_dataset_id
self.publication_col_name = "issue" if table_name == 'covid_hosp_state_timeseries' else \
self.publication_col_name = "issue" if table_name == 'covid_hosp_state_timeseries' or table_name == "covid_hosp_state_daily" else \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line is getting a little gnarly, but we will fix it with the other yaml-schema changes so you can leave it for now

integrations/server/test_covid_hosp.py Show resolved Hide resolved
src/server/endpoints/covid_hosp_state_timeseries.py Outdated Show resolved Hide resolved
@krivard
Copy link
Contributor

krivard commented Apr 7, 2023

I've swapped the quality gate profile to one with a 20% duplication threshold. It should take effect on the next push to this branch (I attempted a manual re-run but it doesn't want to listen to me)

(the duplicated lines are from database.py constructors so wouldn't be affected by .sql file changes)

@rzats rzats force-pushed the rzatserkovnyi/covid-hosp-table-migration branch from d07299f to caa796c Compare April 12, 2023 17:38
@rzats rzats requested a review from melange396 April 12, 2023 17:58
@melange396
Copy link
Collaborator

oof, looks like that force-push lost some changes 😢

@rzats
Copy link
Contributor Author

rzats commented Apr 12, 2023

@melange396 that one's intended - I reverted a couple of changes because of #1126 (comment)

@melange396
Copy link
Collaborator

hmmm, this makes it look like you lost a change to integrations/server/test_covid_hosp.py and re-introduced a comment removed from src/ddl/covid_hosp.sql ¯\_(ツ)_/¯

@rzats
Copy link
Contributor Author

rzats commented Apr 12, 2023

@melange396 good point, re-added those two changes! (The tests actually pass just fine without the truncate table call, since this is the only test that uses these tables, but it's probably nice to have just in case.)

Copy link
Collaborator

@melange396 melange396 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few more points...

src/server/endpoints/covid_hosp_state_timeseries.py Outdated Show resolved Hide resolved
src/ddl/covid_hosp.sql Show resolved Hide resolved
integrations/server/test_covid_hosp.py Outdated Show resolved Hide resolved
@rzats
Copy link
Contributor Author

rzats commented Apr 12, 2023

@melange396 applied the changes from this round as well!

@rzats rzats requested a review from melange396 April 12, 2023 22:04
@rzats rzats requested a review from melange396 April 19, 2023 16:53
@melange396
Copy link
Collaborator

this looks pretty good, though i havent given it a very deep inspection yet. in the meantime, you should include explain plans and timing for some "mocked" sample queries against real data (where we can use covid_hosp_state_timeseries twice, in place of the new table).

@rzats rzats force-pushed the rzatserkovnyi/covid-hosp-table-migration branch 2 times, most recently from 3734972 to 3893f62 Compare May 18, 2023 18:18
@rzats
Copy link
Contributor Author

rzats commented May 19, 2023

Requesting re-review after many iterations on the covid_hosp_state_timeseries query :)

The testing results of this query compared to the current one in dev can be found in this spreadsheet (tl;dr this outperforms dev significantly, in a certain branch by >300%). The optimizations made were:

  1. switching the "get greatest value per group" logic to a more well-performing approach (PARTITION -> INNER JOIN)
  2. incidentally fixing an error which may have caused the query to ignore the index intended for it (that index has the columns ordered as state -> date -> issue, but the query was previously grouping by date -> state -> issue)

@rzats rzats requested review from melange396 and krivard May 19, 2023 22:01
Copy link
Collaborator

@melange396 melange396 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

his is awesome. its much easier to understand (IMNSHO), even though we moved away from the one overloaded table and into two (getting rid of the WITH expressions is a big help). plus the performance increases speak for themselves....

all comments marked 'nit:' are elective, you can decide whether you want to do them or not.

src/server/endpoints/covid_hosp_state_timeseries.py Outdated Show resolved Hide resolved
src/ddl/covid_hosp.sql Outdated Show resolved Hide resolved
src/server/endpoints/covid_hosp_state_timeseries.py Outdated Show resolved Hide resolved
src/server/endpoints/covid_hosp_state_timeseries.py Outdated Show resolved Hide resolved
@sonarcloud
Copy link

sonarcloud bot commented Jun 2, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
2.3% 2.3% Duplication

@rzats
Copy link
Contributor Author

rzats commented Jun 2, 2023

@melange396 applied the fixes!

@rzats rzats requested a review from melange396 June 2, 2023 11:38
Copy link
Collaborator

@melange396 melange396 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome!

@melange396
Copy link
Collaborator

we probbly want to hold off on merging just yet, and first figure out whether to do this as a migration or as a re-load of all the data from the external sources

@rzats rzats mentioned this pull request Jul 14, 2023
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants