Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow Unknown DNC in Community Level #1367

Merged
merged 8 commits into from
Apr 28, 2023

Conversation

BrettBoval
Copy link
Contributor

Situation

The state of Iowa has decided as of 1 April to no longer provide postive case reports at the state or county level (see Iowa Department of Health and Human Services Press Release). Florida has suspended reporting "due to a technical issue" and the odds don't look good that they will restart.

Complication

Divergent from CDC

We deliberately had a different implementation than the CDC, which more prominently flagged places where we believed the case data should be treated with caution.

The current implementation from libs/metricslcommunity_levels.py#L20:

   # TODO(michael): The CDC footnotes say:
    #     If the number of cases in 7 days for a jurisdiction is missing, the
    #     7-day case rate is assigned to the “low” category. If both 7-day
    #     admissions and 7-day percentage inpatient beds indicators are N/A, the
    #     community burden category is assigned N/A.
    #
    # For now I'm allowing 1 hospital metric to be missing, but not allowing
    # cases to be missing since that is rare and usually indicates we're
    # blocking data or something. In that case, I'd rather have no community
    # level calculated. But we can revisit if it ends up being a problem.

Implementation Artifact in our Filtering

Our current implementation of forward filling heuristics + don't-grade-stale-data creates unexpected behaviour to the end user.

Specifically, we usually ingest cumulative data and then calculate the diffs to create a daily new timeseries. In the case where the cumulative number is the same day to day, it can be difficult to tell the difference between "the county did not report today" and "the county reported affirmatively that there were no new cases". On a scraper by scraper basis there may be some meta-text or context to discern between the two, but we currently don't have a full solution for this.

Our current solution has two heuristics that are temporally staggered (by somewhat chance) by one day.

  • We treat trailing zeros (a.k.a. all the most recent data is zeros) with suspicion. For the first x days of consecutive zeros, we use the last non-zero value as the latest value for above-the-fold metrics. so [15,0,0,0,0,0,0] returns 15 as the "latest" value.
  • Eventually, after x days of consecutive zeros, we give up waiting, and let all those zeros propagate turning Daily New Cases (DNC) to zero. So [15,0,...,0,0,] returns 0 once the gap is long enough.
  • Separately, we have a different process that looks at all our datastreams, and masks (likely too aggressively) any timeseries where the day is more than y days stale. This turns the metric to Unknown, which bubbles up and will turn a metric gray.

The complication is that x and y are both ~14 days, but one of them is calculated before applying a np.diff which shifts it by one day. So for the first ~13 days (I'm being deliberately vague on defining a day here) we return "last non-zero value in series" and then on one day the "if y days stale mask everything" returns Unknown, and then the following day the case timeseries gets stale enough that we let the zeros forward fill and it returns a zero.

The exact number of zeros to be confirmed.

  • t-1 day:[15,0,0,0,0,0,0,0,0,0,0,0,0,0] pipes out [15] 13 days ago which passes the stale test which then counts as 15 for risk level calculation
  • t-0 day:[15,0,0,0,0,0,0,0,0,0,0,0,0,0,0] pipes out [15] 14 days ago which fails the stale test which then returns None for risk level calculation
  • t+1 day:[15,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0] pipes out [0] 1 day ago which passes the stale test which then returns 0 for risk level calculation

This causes the map and the location page to transiently filter to gray and then green.

Resolutions

Summary

  1. Update community risk level logic to handle unknown DNC as 0 (the lesser of two evils)
  2. Add the appropriate disclaimers to Iowa and Florida in our normal disclaimer code.
  3. Tweak the "turn this unknown after x days to something longer than the DNC ffill code.
  4. Explore special casing the DNC where we freeze it as Unknown starting from a specific day which overrides the "turn it unknown and hide it".
Current Proposed Ideal
US Map Color Grey/Green Green Green w/ Flag
Community Risk Level Grey/Green Green Green w/ Flag
DNC Latest Value None/0 None/0 None
Timeseries Masked/ffilled Masked/ffilled Frozen @ Last Reporting

BrettBoval and others added 8 commits April 14, 2023 14:36
This test was passing incorrectly because the assert was wrong. Now it should be failing, which represents the broken state of this PR.
Adds test for default to 0 for stale DNC after 14 days.

We've had some off-by-1 uncertainty in what "14 days lookback before blocking looks" really means. Here's an updated test, that currently fails, that captures the expected behavior.

I propose that we either change the code to handle this correctly, or update our docs to say 15 whenever they used to say 14.
@BrettBoval BrettBoval merged commit df963dd into main Apr 28, 2023
5 checks passed
@BrettBoval BrettBoval deleted the accept-unknown-DNC-for-community-risk branch April 28, 2023 18:15
@smcclure17
Copy link
Member

Action item from this PR #1368

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants