Skip to content

max-orhai/ppaml-cp7

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

74 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PPAML Challenge Problem 7, phase 3

The task in this phase of CP7 is to predict seasonal rates of Influenza-Like Illness (ILI or 'flu') in 60 distinct sub-populations of the continental US, ranging in size from the entire country to individual counties.

In addition to historical ILI rate data for each population, three different kinds of covariate data, representing flu-related tweets, vaccination claims, and weather, are provided for use in solutions. In a simulated forecast experiment, all four kinds of variables will be made available to solutions, one week of data at time, over the 32-week target season.

Contents:

  • Data sets
    • Population data
    • ILI rates
    • Tweet counts
    • Vaccinations
    • Weather
  • Problem statement
  • Evaluation protocol
    • Evaluator script

Data sets

Data is provided as a single JSON file containing constants that describe the target populations, and as a set of CSV files that contain time-series variable data for some of the populations. Each time series has weekly data following the MMWR week calendar. The official CDC flu season runs from MMWR week 40 through week 20 of the following year.

Each week is encoded as a fixed-point number: for example, the last week in 2015 is represented as "2015.52". In this document, "week w + integer k" means the week that begins k weeks after w: for example, 2015.50 + 10 = 2016.08.

Population data

The populations form a geographic containment hierarchy:

  • The United States is the root
  • HHS Regions 1 – 10 divide the US at the lowest resolution
  • An HHS region is composed of several states, identified with two-letter postal codes
  • A state contains counties, each of which has a four-digit FIPS code

For states and counties, demographics.json provides population count estimates and other demographic information from the US Census Bureau, under the "data" key. For counties, a list of FIPS codes of adjacent counties is also included. FIPS codes of the counties or states that constitute each population can be found using the hierarchy under the "indices" key.

ILI rates

The CDC has a voluntary flu surveillance program called ILINet. Each week, participating clinics submit counts of patients diagnosed with ILI to their state health department, along with their total patient counts. These counts form the basis for the published rates for each HHS region. Some state health departments also publish their weekly rates directly.

Filename populations source first week last week notes
USA-flu.csv Continental United States and 10 HHS Regions CDC ILINet 1997.40 2015.29 off-season data missing from early years
MS-flu.csv Mississippi and 9 Public Health Districts MS Department of Health 2012.48 2015.20 no off-season data
NC-flu.csv North Carolina NC Division of Public Health 2001.40 2015.20 includes diagnosed and total patient counts; no off-season data
NJ-flu.csv New Jersey and 21 counties NJ Department of Health 2005.39 2015.20 includes reported rates from long-term care facilities (.ltc), schools (.sch), and emergency clinics (.emr) in each county; off-season data only for 2009
RI-flu.csv Rhode Island RI Department of Health 2013.40 2015.20 includes rates for five age ranges; no off-season data
TN-flu.csv Tennessee and 13 Health Regions TN Department of Health 2009.32 2015.29 six regions are individual counties; includes off-season data
TX-flu.csv Texas TX Department of State Health Services 2005.40 2015.29 includes counts for four or five patient age ranges; includes off-season data starting in 2009

Some state health departments publish additional data which may be useful. The column headers in each CSV are of the form [POP].[VAR], with each VAR described in the table below.

Variable name populations meaning
%ILI MS, NC, RI, TN, TX, USA percentage of patients diagnosed with ILI
#ILI NC, TX number of patients diagnosed with ILI
#patients NC, TX total number of patients
#sites TX number of clinics reporting
ltc NJ percentage of ILI patients in long-term care facilities
sch NJ percentage of ILI patients in schools
emr NJ percentage of ILI patients in hospital emergency departments
age[H]-[L] RI, TX percentage of ILI patients between ages H and L, inclusive
note: The TX health dept reported ages in four bins before week 2009.40, and five bins afterward. For those later weeks, column TX.age25-64* contains data for ages 25 – 49.

Tweet counts

Geo-located tweets which included the words 'flu' or 'influenza' during a four-year period were aggregated to populations and MMWR weeks and counted to form a social media data set.

To support adjustments for the differences between the Twitter user base and the general US population, a small table of demographic information is provided. The column labeled % 2016 (US) shows percentages of Twitter users in various demographic categories among all US adults, while other columns show percentages among US adult internet users.

Filenames source
[POP]-tweets.csv GNIP Historical PowerTrack
twitter-demographics.csv Pew Research Social Media Updates 2016 and 2014
Variable name meaning
tc tweet counts

Medicare vaccination claims

Medicare recipients are eligible for subsidized flu vaccinations. The National Vaccine Program Office tracks the total number of eligible recipients for each county and flu season, for all ages as those 65 and older. The NVPO records the vaccinated percentage of those eligible on a weekly basis. These percentages are cumulative and thus non-decreasing over a flu season.

Filenames source
[POP]-vaccinations.csv US Department of Health & Human Services National Vaccine Program Office
Variable name meaning
all total number of eligible recipients
allV% percentage of eligible recipients vaccinated
65+ number of eligible recipients age 65 and over
65+V% percentage of eligible recipients age 65+ vaccinated

Weather data

In temperate climates like the continental US, flu epidemics are much more prevalent during cold weather. To encourage teams to explore this correlation, aggregated weather data are provided for each MMWR week and each population.

Filenames source
[POP]-weather.csv National Oceanic and Atmospheric Administration GHCN-Daily
Variable name meaning
Tmax mean daily high temperature, in degrees Celsius
Tmin mean daily low temperature, in degrees Celsius
prcp mean daily percipitation, in millimeters

Problem statement

Denoting ILI rate data for population p and week w as Ipw, and similarly for all covariates C, a forecaster for week n extending m weeks forward can be described as a function

Fp,n,m : { Ip(w-2) , Cpw | wn ;∀ p } → { Ipw | nwn + m }

Given data sets I and C, as described above, solutions will produce a set of forecasts { Fp,n,m(I, C) } where

Parameter value
p teams may choose any of the 60 populations, but must include
HHS Region 4, TN state, and Knox County (TN.D10 = FIPS 47093)
n each of the weeks 2015.40 ... 2016.20
m 0 ... remaining weeks in target season

Solutions may use both I and C data from any of the given populations to predict ILI rates for a specific population p. We are interested in how prediction accuracy for a given forecast period improves throughout the season, as new data is made available.

Each forecast will be evaluated against ground truth data { Ipw | nwn + m } and assigned a sum of squared errors (SSE) score s for each successive forecast period (week n, weeks n through n + 1, ... , weeks n through n + m).

Note that if m = 0 then the problem is a simulated nowcast rather than a forecast: the goal is to infer "current" ILI rates in the target populations, as of week n, from both current and historical covariate data as well as historical ILI data up to two weeks previous.

Example forecast

In chart above, n = 2014.40, m = 10, and p = HHS Region 4. Data points from three of the eight covariates are shown in warm colors. (The * after their names indicates that they have been multiplied by scalars to fit on the chart.) While I and C data prior to week 2013.32 are not shown, they are available to the forecaster function. For evaluation, solutions will target the flu season beginning in week 2015.40, but teams are encouraged to test their solutions on data from previous seasons.


Evaluation protocol

In concrete terms, ILI rates Ip(w-2) and covariates Cpw for all 60 populations p and weeks w, 2015.20 < wn ≤ 2016.20, will be represented in a single file named week-n.txt. This file will consist of concatenated CSV data of the same format as those provided, preceded by filenames, and followed by blank lines. The basic idea is that each line of data could be appended to the appropriate CSV to continue the time series. An example file is provided with data from the 2014–2015 season, covering weeks 2014.21 through 2014.42. Because the CDC and health departments only publish ILI rates after a two-week delay, while data from other sources are available sooner, week-n.txt will contain data from the start of the season up to and including week n for tweets, weather, and vaccination variables, but only up to week n - 2 for ILI variables.

For each evaluation week n, 2015.40 ≤ n ≤ 2016.20, solutions should read the week-n.txt file (as well as the contents of the present data directory) and produce a similar file forecast-n.txt, containing only lines with forecast ILI rates for weeks n through n + m for each target population [POP]-flu.csv.

Note that the first evaluation data file, week-2015.40.txt, will include off-season baseline data for the 20 weeks 2015.21 through 2015.40, for all populations and covariates where this off-season data is available. It will contain at most 18 weeks of data for ILI variables in the populations (TN, TX, and USA) which report off-season ILI rates, and less in the other populations, which do not. The second evaluation file, week-2015.41.txt, will have covariate data for 21 weeks (and ILI data for 19); the third for 22 weeks, and so on.

A solution should present a command line interface in the form of a shell script with arguments:

run.sh [CONFIG-FILE] [DATA-DIR] [WEEK-FILE] so that a command like
$ run.sh solution.conf ../data/ week-2015.40.txt writes the file forecast-2015.40.txt.

The configuration file should include some representation of forecast length m and populations p. At minimum, the output forecast file must include row n + 1 for column R04.%ILI of file USA-flu.csv and columns TN.%ILI, D10.%ILI of file TN-flu.csv. Solutions may save intermediate results between program runs to avoid recomputing models for each evaluation week.

Teams are encouraged (but not required) to submit solutions via GitHub, by forking this repository and adding program code outside the data directory.

Evaluator script

A Python script is included to calculate SSE scores of forecasts. It should run under both Python 2 and 3. If passed the -p flag, it can chart the forecast and ground-truth data together, using the standard matplotlib plotting package.

It requires a target and reference file, both containing CSV data over the same range of weeks. By passing -c [COLUMN], a specific column can be selected by name or (zero-based) index. If -c is omitted, the variable of interest is assumed to be in column 1.

About

Challenge Problem 7 for PPAML Program

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%