PPAML Challenge Problem 7, phase 3

The task in this phase of CP7 is to predict seasonal rates of Influenza-Like Illness (ILI or 'flu') in 60 distinct sub-populations of the continental US, ranging in size from the entire country to individual counties.

In addition to historical ILI rate data for each population, three different kinds of covariate data, representing flu-related tweets, vaccination claims, and weather, are provided for use in solutions. In a simulated forecast experiment, all four kinds of variables will be made available to solutions, one week of data at time, over the 32-week target season.

Data sets

Data is provided as a single JSON file containing constants that describe the target populations, and as a set of CSV files that contain time-series variable data for some of the populations. Each time series has weekly data following the MMWR week calendar. The official CDC flu season runs from MMWR week 40 through week 20 of the following year.

Each week is encoded as a fixed-point number: for example, the last week in 2015 is represented as "2015.52". In this document, "week w + integer k" means the week that begins k weeks after w: for example, 2015.50 + 10 = 2016.08.

Population data

The populations form a geographic containment hierarchy:

The United States is the root
HHS Regions 1 – 10 divide the US at the lowest resolution
An HHS region is composed of several states, identified with two-letter postal codes
A state contains counties, each of which has a four-digit FIPS code

For states and counties, demographics.json provides population count estimates and other demographic information from the US Census Bureau, under the "data" key. For counties, a list of FIPS codes of adjacent counties is also included. FIPS codes of the counties or states that constitute each population can be found using the hierarchy under the "indices" key.

ILI rates

The CDC has a voluntary flu surveillance program called ILINet. Each week, participating clinics submit counts of patients diagnosed with ILI to their state health department, along with their total patient counts. These counts form the basis for the published rates for each HHS region. Some state health departments also publish their weekly rates directly.

Filename	populations	source	first week	last week	notes
`USA-flu.csv`	Continental United States and 10 HHS Regions	CDC ILINet	1997.40	2015.29	off-season data missing from early years
`MS-flu.csv`	Mississippi and 9 Public Health Districts	MS Department of Health	2012.48	2015.20	no off-season data
`NC-flu.csv`	North Carolina	NC Division of Public Health	2001.40	2015.20	includes diagnosed and total patient counts; no off-season data
`NJ-flu.csv`	New Jersey and 21 counties	NJ Department of Health	2005.39	2015.20	includes reported rates from long-term care facilities (`.ltc`), schools (`.sch`), and emergency clinics (`.emr`) in each county; off-season data only for 2009
`RI-flu.csv`	Rhode Island	RI Department of Health	2013.40	2015.20	includes rates for five age ranges; no off-season data
`TN-flu.csv`	Tennessee and 13 Health Regions	TN Department of Health	2009.32	2015.29	six regions are individual counties; includes off-season data
`TX-flu.csv`	Texas	TX Department of State Health Services	2005.40	2015.29	includes counts for four or five patient age ranges; includes off-season data starting in 2009

Some state health departments publish additional data which may be useful. The column headers in each CSV are of the form [POP].[VAR], with each VAR described in the table below.

Variable name	populations	meaning
`%ILI`	MS, NC, RI, TN, TX, USA	percentage of patients diagnosed with ILI
`#ILI`	NC, TX	number of patients diagnosed with ILI
`#patients`	NC, TX	total number of patients
`#sites`	TX	number of clinics reporting
`ltc`	NJ	percentage of ILI patients in long-term care facilities
`sch`	NJ	percentage of ILI patients in schools
`emr`	NJ	percentage of ILI patients in hospital emergency departments
`age[H]-[L]`	RI, TX	percentage of ILI patients between ages H and L, inclusive note: The TX health dept reported ages in four bins before week 2009.40, and five bins afterward. For those later weeks, column `TX.age25-64*` contains data for ages 25 – 49.

Tweet counts

Geo-located tweets which included the words 'flu' or 'influenza' during a four-year period were aggregated to populations and MMWR weeks and counted to form a social media data set.

To support adjustments for the differences between the Twitter user base and the general US population, a small table of demographic information is provided. The column labeled % 2016 (US) shows percentages of Twitter users in various demographic categories among all US adults, while other columns show percentages among US adult internet users.

Filenames	source
`[POP]-tweets.csv`	GNIP Historical PowerTrack
`twitter-demographics.csv`	Pew Research Social Media Updates 2016 and 2014

Variable name	meaning
`tc`	tweet counts

Medicare vaccination claims

Medicare recipients are eligible for subsidized flu vaccinations. The National Vaccine Program Office tracks the total number of eligible recipients for each county and flu season, for all ages as those 65 and older. The NVPO records the vaccinated percentage of those eligible on a weekly basis. These percentages are cumulative and thus non-decreasing over a flu season.

Filenames	source
`[POP]-vaccinations.csv`	US Department of Health & Human Services National Vaccine Program Office

Variable name	meaning
`all`	total number of eligible recipients
`allV%`	percentage of eligible recipients vaccinated
`65+`	number of eligible recipients age 65 and over
`65+V%`	percentage of eligible recipients age 65+ vaccinated

Weather data

In temperate climates like the continental US, flu epidemics are much more prevalent during cold weather. To encourage teams to explore this correlation, aggregated weather data are provided for each MMWR week and each population.

Filenames	source
`[POP]-weather.csv`	National Oceanic and Atmospheric Administration GHCN-Daily

Variable name	meaning
`Tmax`	mean daily high temperature, in degrees Celsius
`Tmin`	mean daily low temperature, in degrees Celsius
`prcp`	mean daily percipitation, in millimeters

Problem statement

Denoting ILI rate data for population p and week w as I_pw, and similarly for all covariates C, a forecaster for week n extending m weeks forward can be described as a function

F_p,n,m : { I_p(w-2) , C_pw | w ≤ n ;∀ p } → { I_pw | n ≤ w ≤ n + m }

Given data sets I and C, as described above, solutions will produce a set of forecasts { F_p,n,m(I, C) } where

Parameter	value
p	teams may choose any of the 60 populations, but must include HHS Region 4, TN state, and Knox County (TN.D10 = FIPS 47093)
n	each of the weeks 2015.40 ... 2016.20
m	0 ... remaining weeks in target season

Solutions may use both I and C data from any of the given populations to predict ILI rates for a specific population p. We are interested in how prediction accuracy for a given forecast period improves throughout the season, as new data is made available.

Each forecast will be evaluated against ground truth data { I_pw | n ≤ w ≤ n + m } and assigned a sum of squared errors (SSE) score s for each successive forecast period (week n, weeks n through n + 1, ... , weeks n through n + m).

Note that if m = 0 then the problem is a simulated nowcast rather than a forecast: the goal is to infer "current" ILI rates in the target populations, as of week n, from both current and historical covariate data as well as historical ILI data up to two weeks previous.

In chart above, n = 2014.40, m = 10, and p = HHS Region 4. Data points from three of the eight covariates are shown in warm colors. (The * after their names indicates that they have been multiplied by scalars to fit on the chart.) While I and C data prior to week 2013.32 are not shown, they are available to the forecaster function. For evaluation, solutions will target the flu season beginning in week 2015.40, but teams are encouraged to test their solutions on data from previous seasons.

Evaluation protocol

In concrete terms, ILI rates I_p(w-2) and covariates C_pw for all 60 populations p and weeks w, 2015.20 < w ≤ n ≤ 2016.20, will be represented in a single file named week-n.txt. This file will consist of concatenated CSV data of the same format as those provided, preceded by filenames, and followed by blank lines. The basic idea is that each line of data could be appended to the appropriate CSV to continue the time series. An example file is provided with data from the 2014–2015 season, covering weeks 2014.21 through 2014.42. Because the CDC and health departments only publish ILI rates after a two-week delay, while data from other sources are available sooner, week-n.txt will contain data from the start of the season up to and including week n for tweets, weather, and vaccination variables, but only up to week n - 2 for ILI variables.

For each evaluation week n, 2015.40 ≤ n ≤ 2016.20, solutions should read the week-n.txt file (as well as the contents of the present data directory) and produce a similar file forecast-n.txt, containing only lines with forecast ILI rates for weeks n through n + m for each target population [POP]-flu.csv.

Note that the first evaluation data file, week-2015.40.txt, will include off-season baseline data for the 20 weeks 2015.21 through 2015.40, for all populations and covariates where this off-season data is available. It will contain at most 18 weeks of data for ILI variables in the populations (TN, TX, and USA) which report off-season ILI rates, and less in the other populations, which do not. The second evaluation file, week-2015.41.txt, will have covariate data for 21 weeks (and ILI data for 19); the third for 22 weeks, and so on.

A solution should present a command line interface in the form of a shell script with arguments:

run.sh [CONFIG-FILE] [DATA-DIR] [WEEK-FILE] so that a command like
$ run.sh solution.conf ../data/ week-2015.40.txt writes the file forecast-2015.40.txt.

The configuration file should include some representation of forecast length m and populations p. At minimum, the output forecast file must include row n + 1 for column R04.%ILI of file USA-flu.csv and columns TN.%ILI, D10.%ILI of file TN-flu.csv. Solutions may save intermediate results between program runs to avoid recomputing models for each evaluation week.

Teams are encouraged (but not required) to submit solutions via GitHub, by forking this repository and adding program code outside the data directory.

Evaluator script

A Python script is included to calculate SSE scores of forecasts. It should run under both Python 2 and 3. If passed the -p flag, it can chart the forecast and ground-truth data together, using the standard matplotlib plotting package.

It requires a target and reference file, both containing CSV data over the same range of weeks. By passing -c [COLUMN], a specific column can be selected by name or (zero-based) index. If -c is omitted, the variable of interest is assumed to be in column 1.

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
Evaluator		Evaluator
data		data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
epiweek.py		epiweek.py
example-forecast.png		example-forecast.png
scrub.py		scrub.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluator

Evaluator

data

data

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

epiweek.py

epiweek.py

example-forecast.png

example-forecast.png

scrub.py

scrub.py

Repository files navigation

PPAML Challenge Problem 7, phase 3

Contents:

Data sets

Population data

ILI rates

Tweet counts

Medicare vaccination claims

Weather data

Problem statement

Evaluation protocol

Evaluator script

About

Releases

Packages

Languages

License

max-orhai/ppaml-cp7

Folders and files

Latest commit

History

Repository files navigation

PPAML Challenge Problem 7, phase 3

Contents:

Data sets

Population data

ILI rates

Tweet counts

Medicare vaccination claims

Weather data

Problem statement

Evaluation protocol

Evaluator script

About

Resources

License

Stars

Watchers

Forks

Languages