New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hhs latest data and version history have mismatches with upstream timeseries and archive data sets #1903
Comments
It looks like this is not an is_latest mismatch issue. I checked the entirety of HHS data with this query and it came back empty (coming back from a bad sinus infection, so please tell me if I made a dumb error).
|
Attempting to round out that test / making sure it's not that the latest table
So, for the latest data differing
For the
|
11-1 Meeting notes:
Next steps:
|
Note the
I forget whether or not we were thinking that the April job runner migration was compatible with later time_values not matching. I also floated the idea that maybe we were diffing against the wrong table, specifically an outdated table. That doesn't seem to check out. If we are diffing against old values, then we should be saying that there is a difference & generate an update row. Instead, top theories are that we are |
Latest working theory: The data point we were exploring in the meeting earlier today (see SQL below) had its value updated in the source data on May 27 (which presumably would've been processed on the 28th), but that update never made it into the covidcast tables... The exploratory SQL for reference:
|
Rough key observation from 2023-11-06 meeting: it appears that archive differ calculates the diff from one cached csv to another, not between an |
Actual Behavior:
hhs
confirmed_admissions_influenza_1d
data, it has mismatches with both the latest version of the upstream "timeseries" data set and with the latest version of the timeseries data archived in the archive data set.The above differences do not seem fully explainable by a single mismatched version propagating to future versions; it seems that multiple mismatched versions must have been ingested. The
time_value
range of mismatches for the historical snapshot I tried were limited totime_value
s close to theas_of
date, but thetime_value
range of mismatches for the latest-version snapshot when I tried it yesterday, 2023-10-12, extended back farther in time, but also did not extend to the most recenttime_value
s.Unexplored possibility: perhaps the historical version mismatch on 2023-04-25 (first date I tried) was due to delays in acquisition (#1889), and the latest/2023-10-12 version mismatch could be due to some other reason.
Mismatches on latest, run on 2023-10-12, after API update due to 2023-10-11 upstream update
Mismatches between API and upstream snapshots for one `as_of` date in the past
Expected behavior
I expected these to match, based on covidcast hhs docs pointing to
covid_hosp_state_timeseries
docs pointing to upstream "timeseries" data set.Context
I was trying to recreate the FluSight baseline model's forecasts from last season, which use the "truth" data here, which are acquired from the healthdata.gov timeseries archive data set.
The text was updated successfully, but these errors were encountered: