Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hhs latest data and version history have mismatches with upstream timeseries and archive data sets #1903

Open
brookslogan opened this issue Oct 14, 2023 · 6 comments
Assignees
Labels
data quality Missing data, weird data, broken data

Comments

@brookslogan
Copy link

brookslogan commented Oct 14, 2023

Actual Behavior:

  • When I query the latest covidcast hhs confirmed_admissions_influenza_1d data, it has mismatches with both the latest version of the upstream "timeseries" data set and with the latest version of the timeseries data archived in the archive data set.
  • When I query a historical snapshot of the data set, it has a mismatch with a snapshot from the archive data set.

The above differences do not seem fully explainable by a single mismatched version propagating to future versions; it seems that multiple mismatched versions must have been ingested. The time_value range of mismatches for the historical snapshot I tried were limited to time_values close to the as_of date, but the time_value range of mismatches for the latest-version snapshot when I tried it yesterday, 2023-10-12, extended back farther in time, but also did not extend to the most recent time_values.

Unexplored possibility: perhaps the historical version mismatch on 2023-04-25 (first date I tried) was due to delays in acquisition (#1889), and the latest/2023-10-12 version mismatch could be due to some other reason.

Mismatches on latest, run on 2023-10-12, after API update due to 2023-10-11 upstream update
library(tidyverse)
library(epidatr)
library(epiprocess)

disable_cache()
test_as_of <- Sys.Date() # 2023-10-12

epidata_api_as_of <- pub_covidcast(
  "hhs", "confirmed_admissions_influenza_1d", "state", "day",
  "*", "*", as_of = as.character(test_as_of)
)

socrata_archive_info <- RSocrata::read.socrata(url = "https://healthdata.gov/resource/qqte-vkut.json") %>%
  dplyr::arrange(update_date) %>%
  as_tibble()

socrata_archive_as_of <- socrata_archive_info %>%
  filter(update_date <= test_as_of) %>%
  slice_max(update_date) %>%
  pull(url) %>%
  # Stop if there were multiple updates published to resolve ambiguity. Except
  # there were supposed to have been multiple updates on 2023-10-11, one only
  # revising values, and another that added new rows. So maybe not all updates
  # are recorded in this archive?
  {stopifnot(length(.) == 1L); .} %>%
  read_csv()

socrata_timeseries <- RSocrata::read.socrata(url = "https://healthdata.gov/resource/g62h-syeh.json") %>%
    as_tibble()
# saveRDS(socrata_timeseries, "socrata_timeseries_2023-10-12-circa-1530PT.rds")
# socrata_timeseries <- readRDS("socrata_timeseries_2023-10-12-circa-1530PT.rds")


epidata_api_as_of_reformatted <- epidata_api_as_of %>%
  select(geo_value, time_value, admissions = value)

socrata_archive_as_of_reformatted =
  socrata_archive_as_of %>%
  transmute(geo_value = tolower(state),
            time_value = date - 1L,
            admissions = previous_day_admission_influenza_confirmed)

socrata_timeseries_reformatted =
  socrata_timeseries %>%
  transmute(
    geo_value = tolower(state),
    time_value = as.Date(date) - 1L,
    admissions = previous_day_admission_influenza_confirmed
  )

joined <-
  list(
    api = epidata_api_as_of_reformatted,
    sarchive = socrata_archive_as_of_reformatted,
    timeseries = socrata_timeseries_reformatted
  ) %>%
  map2(names(.),
       ~ .x %>%
         rename("admissions_{.y}" := admissions)
       ) %>%
  reduce(full_join, by = c("geo_value", "time_value")) %>%
  {.}

# mismatches between API as-of results and the Socrata/healthdata.gov archive
# data as-of results, for this admissions signal, ignoring differences between
# NA values & rows not existing:
mismatches <- joined %>%
  filter(is.na(admissions_api) != is.na(admissions_sarchive) |
           !is.na(admissions_api) &
           admissions_api != admissions_sarchive) %>%
  {.}

print(mismatches)
#> # A tibble: 128 × 5
#>    geo_value time_value admissions_api admissions_sarchive admissions_timeseries
#>    <chr>     <date>              <dbl>               <dbl> <chr>                
#>  1 ak        2022-12-14              8                   7 7                    
#>  2 ak        2022-12-15              4                   3 3                    
#>  3 ak        2022-12-16              9                   6 6                    
#>  4 ak        2022-12-21             12                  10 10                   
#>  5 ak        2022-12-23              5                   4 4                    
#>  6 ak        2022-12-24              4                   3 3                    
#>  7 ak        2022-12-25              4                   3 3                    
#>  8 ak        2022-12-26             12                   8 8                    
#>  9 ak        2022-12-27              7                   6 6                    
#> 10 ak        2022-12-28              9                   7 7                    
#> # ℹ 118 more rows
#> # ℹ Use `print(n = ...)` to see more rows

# When API and archive disagree on what the snapshot should look
# like, the timeseries data set matches the archive:
mismatches %>%
  summarize(mean(
    is.na(admissions_timeseries) != is.na(admissions_sarchive) |
      !is.na(admissions_timeseries) &
      admissions_timeseries != admissions_sarchive
  ))
#> # A tibble: 1 × 1
#>      `mean(...)`
#>            <dbl>
#>    1           0

# More broadly, there are no such mismatches between the archive data set
# snapshot and the timeseries:
joined %>%
  summarize(mean(
    is.na(admissions_timeseries) != is.na(admissions_sarchive) |
      !is.na(admissions_timeseries) &
      admissions_timeseries != admissions_sarchive
  ))
#> # A tibble: 1 × 1
#>      `mean(...)`
#>            <dbl>
#>    1           0


# Disagreements on non-missing values span a significant time period, but end on
# 2023-05-25. There aren't that many disagreements about whether a value was
# missing or not, but those that do pertain to the two days after the final
# time_value with disagreements between non-missing values.
mismatches %>%
  mutate(na_diff = is.na(admissions_api) != is.na(admissions_sarchive)) %>%
  group_by(na_diff) %>%
  summarize(min(time_value), max(time_value), n(),
            MdRAE = median(abs(admissions_api - admissions_sarchive)/(admissions_sarchive+1e-6)),
            Mdval = median(admissions_sarchive),
            Q3val = quantile(admissions_sarchive, 0.75),
            maxval = max(admissions_sarchive)
            )
#> # A tibble: 2 × 8
#>   na_diff `min(time_value)` `max(time_value)` `n()` MdRAE Mdval Q3val maxval
#>   <lgl>   <date>            <date>            <int> <dbl> <dbl> <dbl>  <dbl>
#> 1 FALSE   2022-12-14        2023-05-25          105  1.00     2     4     18
#> 2 TRUE    2023-05-26        2023-05-27           23 NA        0     0      2


# (While the values above seem to pertain to small values, most values in this
# data set appear to be small as well.)
quantile(socrata_archive_as_of_reformatted$admissions, na.rm = TRUE)
#> 0%  25%  50%  75% 100% 
#>  0    0    1    4  527 
Mismatches between API and upstream snapshots for one `as_of` date in the past
test_as_of <- as.Date("2023-04-25")

epidata_api_as_of <- pub_covidcast(
  "hhs", "confirmed_admissions_influenza_1d", "state", "day",
  "*", "*", as_of = as.character(test_as_of)
)

socrata_archive_as_of <- socrata_archive_info %>%
  filter(update_date <= test_as_of) %>%
  slice_max(update_date) %>%
  pull(url) %>%
  # Stop if there were multiple updates published to resolve ambiguity. Except
  # maybe not all of these updates are reliably recorded in the archive data
  # set, based on 2023-10-12?
  {stopifnot(length(.) == 1L); .} %>%
  read_csv()

epidata_api_as_of_reformatted <- epidata_api_as_of %>%
  select(geo_value, time_value, admissions = value)

socrata_archive_as_of_reformatted =
  socrata_archive_as_of %>%
  transmute(geo_value = tolower(state),
            time_value = date - 1L,
            admissions = previous_day_admission_influenza_confirmed)

joined <-
  list(
    api = epidata_api_as_of_reformatted,
    sarchive = socrata_archive_as_of_reformatted
  ) %>%
  map2(names(.),
       ~ .x %>%
         rename("admissions_{.y}" := admissions)
       ) %>%
  reduce(full_join, by = c("geo_value", "time_value")) %>%
  {.}

# mismatches between API as-of results and the Socrata/healthdata.gov archive
# data as-of results, for this admissions signal, ignoring differences between
# NA values & rows not existing:
mismatches <- joined %>%
  filter(is.na(admissions_api) != is.na(admissions_sarchive) |
           !is.na(admissions_api) &
           admissions_api != admissions_sarchive) %>%
  {.}

print(mismatches)
#> # A tibble: 54 × 4
#>    geo_value time_value admissions_api admissions_sarchive
#>    <chr>     <date>              <dbl>               <dbl>
#>  1 sd        2023-04-06              1                   0
#>  2 ak        2023-04-07              0                   1
#>  3 al        2023-04-07              0                   3
#>  4 co        2023-04-07              2                   0
#>  5 ct        2023-04-07              2                   3
#>  6 dc        2023-04-07              1                   0
#>  7 ia        2023-04-07              0                   1
#>  8 in        2023-04-07              4                   2
#>  9 ky        2023-04-07              3                   4
#> 10 ma        2023-04-07              1                   4
#> # ℹ 44 more rows
#> # ℹ Use `print(n = ...)` to see more rows

# The mismatch range for this version is only for some nonmissing values with
# time_values from 2023-04-06 to 2023-04-19, with no mismatches in missingness.
# Thus, it appears that later versions introduced mismatches not only for later
# `time_value`s, but also earlier ones (2022-12-14 to 2023-04-05).
mismatches %>%
  mutate(na_diff = is.na(admissions_api) != is.na(admissions_sarchive)) %>%
  group_by(na_diff) %>%
  summarize(min(time_value), max(time_value), n(),
            MdRAE = median(abs(admissions_api - admissions_sarchive)/(admissions_sarchive+1e-6)),
            Mdval = median(admissions_sarchive),
            Q3val = quantile(admissions_sarchive, 0.75),
            maxval = max(admissions_sarchive)
            )
#> # A tibble: 1 × 8
#>   na_diff `min(time_value)` `max(time_value)` `n()` MdRAE Mdval Q3val maxval
#>   <lgl>   <date>            <date>            <int> <dbl> <dbl> <dbl>  <dbl>
#> 1 FALSE   2023-04-06        2023-04-19           54  1.00     2  4.75     29


# (While the values above seem to pertain to small values, most values in this
# data set appear to be small as well.)
quantile(socrata_archive_as_of_reformatted$admissions, na.rm = TRUE)
#>  0%  25%  50%  75% 100% 
#>   0    0    1    4  527 

Expected behavior

I expected these to match, based on covidcast hhs docs pointing to covid_hosp_state_timeseries docs pointing to upstream "timeseries" data set.

Context

I was trying to recreate the FluSight baseline model's forecasts from last season, which use the "truth" data here, which are acquired from the healthdata.gov timeseries archive data set.

@brookslogan brookslogan added the data quality Missing data, weird data, broken data label Oct 14, 2023
@brookslogan brookslogan changed the title hhs latest data and version history have mismatches with upstream timeseries and archive repository hhs latest data and version history have mismatches with upstream timeseries and archive data sets Oct 14, 2023
@nolangormley
Copy link
Contributor

It looks like this is not an is_latest mismatch issue. I checked the entirety of HHS data with this query and it came back empty (coming back from a bad sinus infection, so please tell me if I made a dumb error).

select 
    * 
from epimetric_latest el 
join (
    select 
        geo_key_id,
        signal_key_id,
        time_type,
        time_value,
        max(issue) as max_issue 
        from epimetric_full 
        join geo_dim geo using (geo_key_id)
        join signal_dim sig using (signal_key_id)
        where 
            sig.source = 'hhs'
            and time_type = 'day' 
            and geo.geo_type = 'state' 
        group by 
            geo_key_id,
            signal_key_id,
            time_type,
            time_value
    ) latest_full 
using (geo_key_id, signal_key_id, time_type, time_value) 
where max_issue <> issue;

@brookslogan
Copy link
Author

Attempting to round out that test / making sure it's not that the latest table value column being corrupted somehow (and checking that it's not that the update/history data received some same-day version patch that the latest table didn't):

library(tidyverse)
library(epidatr)
library(epiprocess)

disable_cache()
test_as_of <- Sys.Date() # 2023-10-12

epidata_api_latest <- pub_covidcast(
  "hhs", "confirmed_admissions_influenza_1d", "state", "day",
  "*", "*"
)

# Check that this matches as-of queries:

epidata_api_as_of <- pub_covidcast(
  "hhs", "confirmed_admissions_influenza_1d", "state", "day",
  "*", "*", as_of = as.character(test_as_of + 10L)
)

# In case we special-case the above into an is-latest query, try to trick the
# API with an insane query that might not be special-cased to force an
# as-of-type query:
epidata_api_as_of2 <- pub_covidcast(
  "hhs", "confirmed_admissions_influenza_1d", "state", "day",
  "*", "*", as_of = as.character(test_as_of + 10L)
)

all.equal(epidata_api_as_of, epidata_api_latest)
#> [1] TRUE
all.equal(epidata_api_as_of2, epidata_api_latest)
#> [1] TRUE

So, for the latest data differing

  • maybe there is some bad update data stuck in the full table, or some missing update data that should be there. This hypothesis wouldn't explain why the pipeline wouldn't fix it now though, assuming that the pipeline downloads from the right place and processes the entire data set. (If acquisition code were to be reading it in correctly & processing it all, then it should have recorded a diff that fixed things so our latest queries would match the upstream source; every pipeline run that doesn't do this fix seems to suggest this isn't the current scenario.)
  • maybe there is an is_latest issue not in the covidcast-related table, but in the covid_hosp_state_timeseries-related table

For the as_of data differing

  • Still seems like it could be a separate problem.

@nolangormley
Copy link
Contributor

11-1 Meeting notes:

  • Upstream data source (Covid-hosp) has correct data but it is not propagating to the HHS tables
  • It's possible this could have been an issue with the cronicle migration (happening around april 2023)
  • Cronicle was running multiple acquisitions at the same time which could have caused problems
  • archive differ may not be working as expected as it may have data from CSVs that aren't processed

Next steps:

  • Meeting again next week to dig deeper

@brookslogan
Copy link
Author

brookslogan commented Nov 1, 2023

Note the time_value range of mismatches on latest, run on 2023-10-12:

#> # A tibble: 2 × 8
#>   na_diff `min(time_value)` `max(time_value)` `n()` MdRAE Mdval Q3val maxval
#>   <lgl>   <date>            <date>            <int> <dbl> <dbl> <dbl>  <dbl>
#> 1 FALSE   2022-12-14        2023-05-25          105  1.00     2     4     18
#> 2 TRUE    2023-05-26        2023-05-27           23 NA        0     0      2

I forget whether or not we were thinking that the April job runner migration was compatible with later time_values not matching.

I also floated the idea that maybe we were diffing against the wrong table, specifically an outdated table. That doesn't seem to check out. If we are diffing against old values, then we should be saying that there is a difference & generate an update row. Instead, top theories are that we are diffing only for a limited time_value range [think we checked and it was using the full time series], diffing against newer data than we're pushing diffs to, or some sequence of events that @minhkhul and @melange396 were considering.

@melange396
Copy link
Contributor

Latest working theory: The data point we were exploring in the meeting earlier today (see SQL below) had its value updated in the source data on May 27 (which presumably would've been processed on the 28th), but that update never made it into the covidcast tables... The hhs indicator job failed (during the "archive" step) on May 28 and 29 with a "No space left on device" error. Because of this (or perhaps because of the same root cause), acquisition jobs did not run on those days either. This seems likely to be the explanation, but we should find other mismatches to verify they had missing updates from the same time period.

exploratory SQL for reference:

select id, issue, state, date, record_type, geocoded_state, previous_day_admission_influenza_confirmed from epidata.covid_hosp_state_timeseries where state='AK' and date=20221217;

select issue, value from covid.epimetric_full_v where source='hhs' and `signal`='confirmed_admissions_influenza_1d' and geo_type='state' and geo_value='ak' and time_type='day' and time_value=20221216;

@brookslogan
Copy link
Author

brookslogan commented Nov 6, 2023

Rough key observation from 2023-11-06 meeting: it appears that archive differ calculates the diff from one cached csv to another, not between an as_of query on the database and a new csv. Thus, if the database gets off track, it may never be corrected, unless by some stroke of luck the upstream data provider happens to revise the corresponding rows again (unlikely in this case). So patching the covidcast hhs diff data in maybe a 30d window around May 28 might get the latest version of the data back on track.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data quality Missing data, weird data, broken data
Projects
None yet
Development

No branches or pull requests

3 participants