2016 Results

June 13, 2017

USGS Model Analysis

In looking back at the USGS model's 2016 performance, we've noticed that the model performed better at some of the beaches with the highest E. coli rates:

Beaches	True Positive Rate
63rd, Calumet, Montrose, Rainbow, Rogers, South Shore	18.5%
12th, 31st, 57th, Albion, Foster, Howard, Jarvis, Juneway, Leone, North Avenue, Oak Street, Ohio, Osterman, 39th	1.8%

Compare those rates with the USGS's overall 2016 True Positive Rate of 11%.

May 30, 2017

Via @nicklucius and @callinosborn, here are results of the reformulate model (see this thread for more info.) from 2016 as the baseline:

Environmental Model

The so-called "day ahead" model developed by ChiHackNight. Using E. Coli data to predict levels the morning of culture-based testing.

Predicts: 14 beaches

Precision: 52.8%	Predicted False	Predicted True
Actual False	1106 (98.5%) - TN	17 (1.5%) - FP
Actual True	118 (89.4%) - FN	14 (10.6%) - TP

USGS Model (2016)

Predicts: 14 beaches

Precision: 52.8%	Predicted False	Predicted True
Actual False	1538 (98.2%) - TN	28 (1.8%) - FP
Actual True	203 (89.0%) - FN	25 (11.0%) - TP

DNA Model 1

Uses DNA testing to predict E. Coli levels at beaches without DNA testing available. Cross validation is done by comparing projected levels with culture-based results collected that day. This model maintains the same level of false positives.

Predicts 14 beaches
Cross-validation with 10 folds on 2015-2016

Precision: 65.4%	Predicted False	Predicted True
Actual False	1538 (98.2%) - TN	28 (1.8%) - FP
Actual True	175 (76.8%) - FN	53 (23.2%) - TP

DNA Model 2

This model allows more false positives to provides greater true positives.

Predicts 14 beaches
Cross-validation with 10 folds on 2015-2016

Precision: 65.4%	Predicted False	Predicted True
Actual False	1490 (95.1%) - TN	76 (4.9%) - FP
Actual True	145 (63.4%) - FN	83 (36.4%) - TP

October 6,2016

The results from the first run were not predicting at a rate that is acceptable. One of the major things that could be a problem is the choice of variables. The following variables were the ones used in all 3 of the models:

Client.ID windVectorX_hour_-5 windVectorY_hour_-9 group_prior_mean windVectorY_hour_0 temperature_hour_4 temperature_hour_-5 temperature_hour_0 windVectorY_hour_4 accum_rain categorical_beach_grouping 12hrPressureChange windVectorX_hour_0 temperature_hour_-19 windVectorX_hour_4 temperature_hour_-14 windVectorX_hour_-14 previous_reading cloudCover_hour_-15 humidity_hour_4 windVectorX_hour_-9 windVectorY_hour_-19 windVectorY_hour_-5 Collection_Time windVectorX_hour_-19 pressure_hour_0 temperature_hour_-9 windVectorY_hour_-14 2_day_prior_Escherichia.coli 3_day_prior_Escherichia.coli 4_day_prior_Escherichia.coli 5_day_prior_Escherichia.coli 6_day_prior_Escherichia.coli 7_day_prior_Escherichia.coli 2_day_prior_temperatureMax 3_day_prior_temperatureMax 4_day_prior_temperatureMax 2_day_prior_windVectorX 2_day_prior_windVectorY 1_day_prior_pressure 2_day_prior_pressure 1_day_prior_dewPoint 2_day_prior_dewPoint trailing_average_daily_Escherichia.coli trailing_average_daily_temperatureMax trailing_average_daily_pressure trailing_average_daily_dewPoint trailing_average_hourly_temperature trailing_average_hourly_windVectorX trailing_average_hourly_windVectorY

In an attempt to cut down on the amount of variables, and a little less over fitting the basics were looked at here. The following were the basic assessments taken from the graphs are:

The longer the swimming season goes on the higher the E.coli becomes until about Aug. 10th then decreases dramatically afterwards.
2006 & 2007 seasons seem abnormally high, especially 2007.
Monday doesn't have as many high_E.coli_ days as the rest of the weekdays.
The facing of the beach seems to make a difference. North facing beaches seem higher than other beaches that face towards the east.

October 3, 2016

Running through the last models, it was noticed that the cloudCover was not auto populating from DarkSky.net, which resulted in unnecessary rows containing NULL in them being taken from the analysis. When that was corrected the following are the new baseline matrices. The cut-off points to obtain the baseline confusion matrices for 2016 were RF=100 and GBM = 1000.

2016 Preliminary Confusion Matrices:

Consensus Matrix = All 3 predicting TRUE
Democratic Matrix = Any 2 of 3 models predicting TRUE

[Consensus / 33.3%]

	Predict False	Predict True
Actual False	1106	14 / 2.81%
Actual True	129	7 / 5.14%

[Democratic / 15.2%]

	Predict False	Predict True
Actual False	809	329 / 28.9%
Actual True	77	59 / 43.4%

[Singular / 12.4%]

	Predict False	Predict True
Actual False	383	755 / 66.3%
Actual True	37	107 / 78.7%

[SVC Model / 8.9%]

	Predict False	Predict True
Actual False	872	266 / 23.4%
Actual True	110	26 / 19.1%

[RF Model / 12.6%]

	Predict False	Predict True
Actual False	546	592 / 52.0%
Actual True	51	85 / 62.5%

[GBM Model / 14.9%]

	Predict False	Predict True
Actual False	880	258 / 22.7%
Actual True	74	62 / 44.9%

Combination Matrices:

The GBM model in the preliminary matrices seems to be performing at a higher rate than the other 2 models. This could be confirmed further by looking at the combination of all 3 matrices:

[RF and GBM / 19.7%]

	Predict False	Predict True
Actual False	954	184 / 16.2%
Actual True	91	45 / 33.1%

[SVC and GBM / 16.9%]

	Predict False	Predict True
Actual False	1079	59 / 5.2%
Actual True	124	12 / 8.8%

[SVC and RF / 5.3%]

	Predict False	Predict True
Actual False	998	150 / 13.2%
Actual True	120	16 / 11.7%

[Weighted Democratic / 19.2%]

	Predict False	Predict True
Actual False	927	211 / 18.5%
Actual True	86	50 / 36.8%

September 27, 2016

There are now 3 models for predicting [Random Forest (RF), Gradient Boosting (GBM), and Supporting Vector (SVC)]. Using those 3 models we have come up with the confusion matrices using the 2016 data to predict what the 2016 season would have looked like if we were using the models. The cut-off points to obtain the preliminary confusion matrices for 2016 were RF=4.8 and GBM = 7.01.

The measures from the matrices that will be used to determine a desirable model will be: false-positive rate (FPR), and the true-positive rate(TPR) and precision. An example matrix with the measures is shown below:

[Matrix Name/ Precision]

	Predict False	Predict True
Actual False	TN	FP / FPR
Actual True	FN	TP / TPR

FPR = FP/(FP+TN)
TRP = TP/(TP+FN)
PRECISION = TP/(TP+FP)

[2015 Matrix/ 44.8%]

	Predict False	Predict True
Actual False	1302	16 / 2.31%
Actual True	187	13 / 6.5%

In the end, the 2015 matrix is the standard that will hopefully be improved upon in the future.

2016 Preliminary Confusion Matrices:

Consensus Matrix = All 3 predicting TRUE
Democratic Matrix = Any 2 of 3 models predicting TRUE

[Consensus / 6.7%]

	Predict False	Predict True
Actual False	591	14 / 2.31%
Actual True	38	1 / 2.56%

[Democratic / 9.9%]

	Predict False	Predict True
Actual False	541	64 / 10.6%
Actual True	32	7 / 17.9%

[Singular / 7.0%]

	Predict False	Predict True
Actual False	286	319 / 52.7%
Actual True	15	24 / 61.5%

[SVC Model / 5.2%]

	Predict False	Predict True
Actual False	329	276 / 45.6%
Actual True	24	15 / 45.6%

[RF Model / 7.8%]

	Predict False	Predict True
Actual False	558	47 / 7.8%
Actual True	35	4 / 10.2%

[GBM Model / 14.9%]

	Predict False	Predict True
Actual False	531	74 / 12.2%
Actual True	26	13 / 33.3%

Combination Matrices:

The GBM model in the preliminary matrices seems to be performing at a higher rate than the other 2 models. This could be confirmed further by looking at the combination of all 3 matrices:

[RF and GBM / 14.3%]

	Predict False	Predict True
Actual False	587	18 / 3.0%
Actual True	36	3 / 7.7%

[SVC and GBM / 8.7%]

	Predict False	Predict True
Actual False	563	42 / 6.9%
Actual True	35	4 / 10.3%

[SVC and RF / 5.3%]

	Predict False	Predict True
Actual False	573	36 / 5.3%
Actual True	37	2 / 5.1%

The combination matrices demonstrate a couple of things. First, the GBM combines well with the other models to create higher FPR than TPR. Second, the RF and SVC do not combine well. With this knowledge, a matrix where you combine the combination matrices that contain the GBM and then stay away the SVC/RF combination model (weighted democratic model) should be better than the preliminary democratic model.

[Weighted Democratic / 11.5%]

	Predict False	Predict True
Actual False	559	46 / 7.6%
Actual True	33	6 / 15.4%

The weighted democratic model overall performs better than the preliminary democratic model.

In general, the precision is lacking in all of the matrices that have been researched. Increasing precision is the goal for the future models.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2016 Results

June 13, 2017

USGS Model Analysis

May 30, 2017

Environmental Model

USGS Model (2016)

DNA Model 1

DNA Model 2

October 6,2016

October 3, 2016

2016 Preliminary Confusion Matrices:

Combination Matrices:

September 27, 2016

2016 Preliminary Confusion Matrices:

Combination Matrices:

Clone this wiki locally