-
Notifications
You must be signed in to change notification settings - Fork 1
/
bayesian_nonparametric_ensemble.qmd
384 lines (308 loc) · 16 KB
/
bayesian_nonparametric_ensemble.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
---
title: "Bayesian Nonparametric Ensemble (`BNE`)"
subtitle: "SHARP Bayesian Modeling for Environmental Health Workshop"
author: "Jaime Benavides"
date: "August 15 2023"
format: html
---
```{r, echo=FALSE, warning=FALSE, message=FALSE}
library(here)
library(tidyverse)
library(bayesplot)
library(hrbrthemes)
library(scico)
library(cowplot)
extrafont::loadfonts()
theme_set(theme_ipsum())
color_scheme_set(scheme = "viridis")
set.seed(2)
```
## Goal of this computing lab session
The goal of this lab is to take some concepts from the Introduction to Bayesian Nonparametric Ensemble (`BNE`) lecture and introduce you to the way `BNE` works.
## What's going to happen in this lab session?
During this lab session, we will:
1. Explore several air pollution concentration prediction models;
2. Understand how air pollution concentration models can disagree;
3. Understand what `BNE` does;
4. Explore output of `BNE`; and
5. Understand why `BNE` is useful for estimating uncertainty of exposures.
## Introduction
As seen in the lecture just now, `BNE` is a methodology which incorporates multiple exposure models (originally for air pollution) and generates uncertainty estimates by weighting ensemble members dynamically over space and time. As a Bayesian model, `BNE` generates a full posterior predictive distribution for predicted concentrations at each point in space and time, then estimates point-specific uncertainty which reflects, in part, base model disagreement, as well as inherent measurement uncertainty.
`BNE` code is currently written in MATLAB programming language (folder bne_v2_matlab) and translated to Octave open source programming language (folder bne_v2_octave) for this course.
In this lab, we fit an ensemble model integrating multiple existing prediction models and estimate uncertainty in PM$_{2.5}$ predictions.
Predictions refers to the predicted value at some location and time. Note that for `BNE`, prediction does not refer to future values, but values within our study period + location.
Here we will estimate 2015 annual PM$_{2.5}$ concentrations at 0.125°× 0.125° resolution across the contiguous US by combining seven well-validated prediction models with `BNE`.
We trained the model on measurements reported in the US Environmental Protection Agency's Air Quality System (AQS) database.
We estimated model uncertainty as the standard deviations of predictions’ location-specific posterior predictive distribution.
## Introducing `BNE` input model data
We will start by exploring annual average PM$_{2.5}$ concentrations from input base models, which are the individual prediction models that are incorporated into `BNE` to ultimately generate predictions.
These models, along with AQS monitor data, will be used to train `BNE`.
![Table: Input base prediction models included](images/table_input_base_models.png)
## Introducing `BNE` training data
Load training dataset containing input base models and AQS observations for PM$_{2.5}$ during the period 2010-2015.
To create the training dataset, we spatially joined each model to the EPA AQS dataset, using the prediction that was closest to each monitor (i.e., nearest neighbors).
```{r}
aqs_preds <- read_csv(here("data", "us", "training_cvfolds.csv"))
summary(aqs_preds)
```
## Load prediction dataset
To create the prediction dataset, we first created a reference grid, and then joined each prediction dataset by nearest neighbour.
This dataset also contains `BNE` results that will be explored in next steps.
```{r}
bne2015 <- read_csv(here("data", "us", "refGrid_pm25_comp_seb_vers_2_2015_2_0-5_2_0-5_2_0-0498_0-1353.csv"))
```
## Plotting and exploring input base models
Let's now plot each of the base models that feed into `BNE` and compare and contrast what they look like.
First, we need to load plotting function `plotOneParameterSpatial`, which will be used for plotting maps. This is a pre-made function which you can explore in your own time.
```{r}
source(here("labs", "bayesian_nonparametric_ensemble", "functions", "STR_1_plotOneParameterSpatial.R"))
```
What do each of the input base model predictions look like for the year 2015 over the continental United States?
```{r}
# generate value range and value breaks for the plot
varVec <- c(
0, bne2015$pred_av, bne2015$pred_cc, bne2015$pred_cm,
bne2015$pred_gs, bne2015$pred_js, bne2015$pred_me,
bne2015$pred_rk
)
varRange <- max(varVec) - min(varVec)
varScale <- c(
round(min(varVec), 2), round(min(varVec) + 0.25 * varRange, 2),
round(min(varVec) + 0.5 * varRange, 2), round(min(varVec) + 0.75 * varRange, 2),
round(max(varVec), 2)
)
# Assign coordinate reference system as needed on the plotting function
projCRS <- "+init=epsg:4326" # alternative epsg:9311 not currently working
# use function plotOneParameterSpatial
```
:::aside
See above Table for model descriptions
:::
Plot AV
```{r}
plotOneParameterSpatial(
dta = bne2015, parameterName = "pred_av",
mainTitle = "Prediction of AV Model", valueScale = varScale
)
```
Plot CACES
```{r}
plotOneParameterSpatial(
dta = bne2015, parameterName = "pred_cc",
mainTitle = "Prediction of CACES", valueScale = varScale
)
```
Plot CMAQ-Fusion
```{r}
plotOneParameterSpatial(
dta = bne2015, parameterName = "pred_cm",
mainTitle = "Prediction of CMAQ-Fusion", valueScale = varScale
)
```
Plot GBD
```{r}
plotOneParameterSpatial(
dta = bne2015, parameterName = "pred_gs",
mainTitle = "Prediction of GBD", valueScale = varScale
)
```
Plot JS
```{r}
plotOneParameterSpatial(
dta = bne2015, parameterName = "pred_js",
mainTitle = "Prediction of JS Model", valueScale = varScale
)
```
Plot MERRA
```{r}
plotOneParameterSpatial(
dta = bne2015, parameterName = "pred_me",
mainTitle = "Prediction of MERRA", valueScale = varScale
)
```
Plot RK
```{r}
plotOneParameterSpatial(
dta = bne2015, parameterName = "pred_rk",
mainTitle = "Prediction of RK Model", valueScale = varScale
)
```
How do input base model predictions compare with AQS observations? Let's generate the Root Mean Squared Error (RMSE) values by region of the continental United States
```{r}
aqs_preds_summ <- aqs_preds |>
mutate(
se_av = (pred_av - obs)^2,
se_cc = (pred_cc - obs)^2,
se_cm = (pred_cm - obs)^2,
se_gs = (pred_gs - obs)^2,
se_js = (pred_js - obs)^2,
se_me = (pred_me - obs)^2,
se_rk = (pred_rk - obs)^2
)
aqs_preds_summ_reg <- aqs_preds_summ |>
inner_join(readr::read_csv(here("data", "us", "key_state_region.csv")), by = "state") |>
inner_join(readr::read_csv(here("data", "us", "key_region_regionName.csv")), by = "region")
aqs_preds_summ_rmse <- aqs_preds_summ_reg |>
group_by(region_name) |>
summarize(
rmse_av = sqrt(mean(se_av)),
rmse_cc = sqrt(mean(se_cc)),
rmse_cm = sqrt(mean(se_cm)),
rmse_gs = sqrt(mean(se_gs)),
rmse_js = sqrt(mean(se_js)),
rmse_me = sqrt(mean(se_me)),
rmse_rk = sqrt(mean(se_rk))
) |>
pivot_longer(
cols = starts_with("rmse_"),
names_to = "model",
names_prefix = "rmse_",
values_to = "rmse",
values_drop_na = TRUE
)
p <- aqs_preds_summ_rmse |>
mutate(region_name = fct_relevel(
region_name,
"1: New England", "2: NY Area", "3: DelMarVa",
"4: Southeast", "5: Ohio Valley", "6: South Central",
"7: Midwest", "8: North Midwest", "9: Pacific Southwest", "10: Pacific Northwest"
)) |>
ggplot(aes(x = region_name, y = rmse, color = model)) +
geom_point(size = 3) +
scale_color_brewer(palette = "Paired") +
scale_fill_brewer(palette = "Paired") +
labs(x = "US Region", y = "RMSE", color = "Model", shape = "Model") +
theme(
legend.title = element_text(size = 22),
legend.text = element_text(size = 18),
axis.title = element_text(size = 18),
axis.text = element_text(size = 15),
axis.text.x = element_text(angle = 45, hjust = 1)
) +
lims(y = c(0, max(aqs_preds_summ_rmse$rmse)))
p
```
We can see that there is a lot of variation in RMSE, by region and by model.
## Running `BNE`
In this lab, we provide you with `BNE` source code both in `MATLAB` and `Octave`, an open source programming language.
In this part of the lab, we execute locally running_`BNE`_octave.qmd file running `BNE` on a subset of the contiguous US constrained to the South West area. You can run either the MATLAB or the Octave code in your computer using the scripts found at `BNE`_v2_matlab and/or `BNE`_v2_octave.
In order to run `BNE`, we need to:
1. Train `BNE` given input base model predictions and AQS observations;
2. Generate `BNE` predictions and uncertainty levels over an spatio-temporal context of interest.
`BNE` is an ensemble approach in which the weights vary smoothly over space and time.
In other words, for a particular point in space-time, the weight of an input model is informed by the model’s performance at nearby points.
This approach is particularly well-suited for modeling environmental factors that vary smoothly over space, such as temperature and air pollution, as individual models have been have accuracy that varies over time and space.
As a Bayesian model, `BNE` yields a Posterior Predictive Distribution (PPD) for each parameter of interest.
For example, if a model has relatively high performance in New England, but not in the South East region, `BNE` will assign high weights to that model in New England, but not in the South East.
`BNE` has adaptive weights that maximize predictive accuracy and, thus, minimize exposure measurement error.
Training `BNE` requires times & locations of ground-truth observations, the ground-truth observations, and the predictions at those times and locations of the input models.
Generating Predictions with `BNE` requires the PPD of the `BNE` parameters and a a tidy dataset of all of the input base model predictions.
The input base model predictions need to be at the same spatial scale (harmonized), so that the weights can be appropriately estimated.
## Explore `BNE` results
In this part of the lab, we explore the results of a `BNE` model run over the contiguous US for the year 2015 given the seven input base models previously introduced.
Model weights. In `BNE`, the weights of the ensemble members smoothly vary over time and space.
The model weights are required to sum to one, so the weighted model combination must stay within the range of the original predictions.
```{r}
varVec <- c(
0, bne2015$w_mean_av, bne2015$w_mean_cc, bne2015$w_mean_cm,
bne2015$w_mean_gs, bne2015$w_mean_js, bne2015$w_mean_me,
bne2015$w_mean_rk
)
varRange <- max(varVec) - min(varVec)
varScale <- c(
round(min(varVec), 2), round(min(varVec) + 0.25 * varRange, 2),
round(min(varVec) + 0.5 * varRange, 2), round(min(varVec) + 0.75 * varRange, 2),
round(max(varVec), 2)
)
plotOneParameterSpatial(
dta = bne2015, parameterName = "w_mean_av",
mainTitle = "Weight of AV Model", valueScale = varScale
)
plotOneParameterSpatial(
dta = bne2015, parameterName = "w_mean_cc",
mainTitle = "Weight of CACES", valueScale = varScale
)
plotOneParameterSpatial(
dta = bne2015, parameterName = "w_mean_cm",
mainTitle = "Weight of CMAQ-Fusion", valueScale = varScale
)
plotOneParameterSpatial(
dta = bne2015, parameterName = "w_mean_gs",
mainTitle = "Weight of GBD", valueScale = varScale
)
plotOneParameterSpatial(
dta = bne2015, parameterName = "w_mean_js",
mainTitle = "Weight of JS Model", valueScale = varScale
)
plotOneParameterSpatial(
dta = bne2015, parameterName = "w_mean_me",
mainTitle = "Weight of MERRA", valueScale = varScale
)
plotOneParameterSpatial(
dta = bne2015, parameterName = "w_mean_rk",
mainTitle = "Weight of RK Model", valueScale = varScale
)
```
## `BNE` Predicted concentration
`BNE` estimates a posterior predictive distribution (PPD) of estimated concentration for each point; we use the mean of the distribution as the point estimate of predicted concentration.
Once optimal weights are calculated, a residual process term is then estimated in a second stage to capture any systematic bias in the weighted model combination and capture uncertainty due to lack of measurements in a particular location.
The plot below shows the total `BNE` estimate summing both components: model ensemble and residual process.
```{r}
varVec <- c(bne2015$y_mean)
varRange <- max(varVec) - min(varVec)
varScale <- c(
round(min(varVec), 2), round(min(varVec) + 0.25 * varRange, 2),
round(min(varVec) + 0.5 * varRange, 2), round(min(varVec) + 0.75 * varRange, 2),
round(max(varVec), 2)
)
plotOneParameterSpatial(
dta = bne2015, parameterName = "y_mean",
mainTitle = "Predicted Concentration 2015", valueScale = varScale
)
```
## Predicted uncertainty levels.
One of the key outputs of `BNE` is the fact that it produces uncertainty of results. This has countless applications, including planning where to place additional monitors, to incorporating uncertainty into health models (as seen as a preview in the lecture).
How wide is the range of plausible values for the concentration at a particular location at a given time? The wider the range, the high the uncertainty, as there are more possible values for the concentration.
Here, we measure predictive uncertainty as the standard deviation (SD) of the PPD of the concentration.
Note that predictive uncertainty is given our evidence and our model; if we use a different model, then we may have a different amount of uncertainty.
A larger SD corresponds to a wider predictive distribution and greater uncertainty, i.e., at that point, `BNE` considers a wider range of concentrations to be plausible, given the base models and model structure.
```{r}
varVec <- c(bne2015$y_sd)
varRange <- max(varVec) - min(varVec)
varScale <- c(
round(min(varVec), 2), round(min(varVec) + 0.25 * varRange, 2),
round(min(varVec) + 0.5 * varRange, 2), round(min(varVec) + 0.75 * varRange, 2),
round(max(varVec), 2)
)
plotOneParameterSpatial(dta = bne2015, parameterName = "y_sd", extraPointObj = filter(aqs_preds, yyyy == 2015), mainTitle = "Predictive Uncertainty 2015", valueScale = varScale)
```
We also examine scaled uncertainty, in which we divided the SD by the predicted concentration. Scaled uncertainty may be more informative in some contexts where, for example, an SD of 1 µg/m3 is more problematic in an area with low concentrations (e.g. annual average of 8 µg/m3) than an area of high concentrations (e.g. annual average of 30 µg/m3).
```{r}
# 1a function to calculate scaled uncertainty
compute_scaled_y_sd <- function(bne) {
bne <- bne |>
mutate(y_sd_scaled = y_sd / y_mean)
}
bne2015 <- bne2015 |>
compute_scaled_y_sd()
varVec <- c(bne2015$y_sd_scaled)
varRange <- max(varVec) - min(varVec)
varScale <- c(
round(min(varVec), 2), round(min(varVec) + 0.25 * varRange, 2),
round(min(varVec) + 0.5 * varRange, 2), round(min(varVec) + 0.75 * varRange, 2),
round(max(varVec), 2)
)
plotOneParameterSpatial(
dta = bne2015, parameterName = "y_sd_scaled",
mainTitle = "Scaled Uncertainty 2015", valueScale = varScale
)
```
## Closing remarks
In this lab session, we have explored how to run `BNE` to estimate PM$_{2.5}$ predictions and uncertainty given seven input base models and AQS observations.
We first explored input base models' PM$_{2.5}$ predictions for 2015 and compared them against AQS observations over the contiguous US.
We continued by running `BNE` over south western US cities, a subset of the US initial data. We then analysed `BNE` output over the contiguous US.
Obtaining comprehensive uncertainties can be helpful in so many fields, from better inference in health models, to informing monitoring placement, to helping improve individual prediction models, to better understanding environmental justice issues etc. Reporting the weights and/or the uncertainty back to the teams developing these models can create a positive feedback loop where the models are improved, identifying high uncertainty areas can help with future monitoring station placement etc.
Many of the materials used in this lab have been developed during Sebastian's Rowland PhD. Sebastian's publication of this work is under review in a scientific journal.
Please do not distribute or publish works using the data provided in this lab.
`BNE`'s methodology has been recently published and the code can be applied to similar environmental issues.
Please let us know if you are interested in pursuing this avenue.