/
10-model-based.qmd
267 lines (225 loc) · 12.1 KB
/
10-model-based.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
# Model-based clustering {#sec-mclust}
\index{cluster analysis!model-based}
Model-based clustering @FR02 fits a multivariate normal mixture model to the data. It uses the EM algorithm to fit the parameters for the mean, variance-covariance of each population, and the mixing proportion. The variance-covariance matrix is re-parameterised using an eigen-decomposition
$$
\Sigma_k = \lambda_kD_kA_kD_k^\top, ~~~k=1, \dots, g ~~\mbox{(number of clusters)}
$$
\noindent resulting in several model choices, ranging from simple to complex, as shown in `r ifelse(knitr::is_html_output(), '@tbl-covariances-html', '@tbl-covariances-pdf')`.
```{r echo=knitr::is_html_output()}
#| label: mc-libraries
#| message: FALSE
#| code-summary: "Load libraries"
library(dplyr)
library(kableExtra)
library(ggplot2)
library(mclust)
library(mulgar)
library(patchwork)
library(colorspace)
library(tourr)
```
::: {.content-visible when-format="html"}
```{r eval=knitr::is_html_output()}
#| label: tbl-covariances-html
#| tbl-cap: "Parameterizations of the covariance matrix."
#| echo: FALSE
#| message: FALSE
readr::read_csv('misc/mclust-covariances-html.csv') %>%
knitr::kable(align = c('c', 'c', 'c', 'c', 'c', 'c')) %>%
kableExtra::kable_styling(full_width = FALSE)
```
:::
::: {.content-visible when-format="pdf"}
```{r eval=knitr::is_latex_output()}
#| label: tbl-covariances-pdf
#| tbl-cap: "Parameterizations of the covariance matrix."
#| echo: FALSE
#| message: FALSE
readr::read_csv('misc/mclust-covariances-latex.csv') %>%
knitr::kable(align = c('c', 'c', 'c', 'c', 'c', 'c'),
format="latex", booktabs = T,
escape = FALSE) %>%
kableExtra::kable_styling(full_width = FALSE)
```
:::
\noindent Note the distribution descriptions "spherical" and "ellipsoidal". These are descriptions of the shape of the variance-covariance for a multivariate normal distribution. A standard multivariate normal distribution has a variance-covariance matrix with zeros in the off-diagonal elements, which corresponds to spherically shaped data. When the variances (diagonals) are different or the variables are correlated, then the shape of data from a multivariate normal is ellipsoidal.
\index{Bayes Information Criterion (BIC)}
The models are typically scored using the Bayes Information Criterion (BIC), which is based on the log likelihood, number of variables, and number of mixture components. They should also be assessed using graphical methods, as we demonstrate using the penguins data.
## Examining the model in 2D
We start with two of the four real-valued variables (`bl`, `fl`) and the three `species`. The goal is to determine whether model-based methods can discover clusters that closely correspond to the three species. Based on the scatterplot in @fig-penguins-bl-fl we would expect it to do well, and suggest an elliptical variance-covariance of roughly equal sizes as the model.
```{r echo=knitr::is_html_output()}
#| label: fig-penguins-bl-fl
#| fig-cap: "Scatterplot of flipper length by bill length of the penguins data."
#| fig-width: 5
#| fig-height: 5
#| code-summary: "Code to make plot"
load("data/penguins_sub.rda")
ggplot(penguins_sub, aes(x=bl,
y=fl)) + #,
#colour=species)) +
geom_point() +
geom_density2d(colour="#3B99B1") +
theme_minimal() +
theme(aspect.ratio = 1)
```
To draw ellipses in any dimension, a reasonable procedure is to sample points uniformly on a sphere, and then transform this into a sphere using the inverse of the variance-covariance matrix. The `mulgar` function `mc_ellipse()` does this for each cluster in the fitted model.
```{r}
#| label: fig-penguins-bl-fl-mc
#| message: FALSE
#| code-fold: false
#| fig-width: 8
#| fig-height: 4
#| out-width: 100%
#| fig-cap: "Summary plots from model-based clustering: (a) BIC values for clusters 2-9 of top four models, (b) variance-covariance ellipses and cluster means (+) corresponding to the best model. The best model is three-cluster EVE, which has differently shaped variance-covariances albeit the same volume and orientation."
# Fit the model, plot BIC, construct and plot ellipses
penguins_BIC <- mclustBIC(penguins_sub[,c(1,3)])
ggmc <- ggmcbic(penguins_BIC, cl=2:9, top=4) +
scale_color_discrete_divergingx(palette = "Roma") +
ggtitle("(a)") +
theme_minimal()
penguins_mc <- Mclust(penguins_sub[,c(1,3)],
G=3,
modelNames = "EVE")
penguins_mce <- mc_ellipse(penguins_mc)
penguins_cl <- penguins_sub[,c(1,3)]
penguins_cl$cl <- factor(penguins_mc$classification)
ggell <- ggplot() +
geom_point(data=penguins_cl, aes(x=bl, y=fl,
colour=cl),
alpha=0.3) +
geom_point(data=penguins_mce$ell, aes(x=bl, y=fl,
colour=cl),
shape=16) +
geom_point(data=penguins_mce$mn, aes(x=bl, y=fl,
colour=cl),
shape=3, size=2) +
scale_color_discrete_divergingx(palette = "Zissou 1") +
theme_minimal() +
theme(aspect.ratio=1, legend.position="none") +
ggtitle("(b)")
ggmc + ggell + plot_layout(ncol=2)
```
@fig-penguins-bl-fl-mc summarises the results. All models agree that three clusters is the best. The different variance-covariance models for three clusters have similar BIC values with EVE (different shape, same volume and orientation) being slightly higher. These plots are made from the `mclust` package output using the `ggmcbic()` and `mc_ellipse()` functions from the `mulgar` package.
## Examining the model in high dimensions
Now we will examine how model-based clustering will group the penguins data using all four variables. @fig-penguins-bic shows the summary of different models, of which we would choose the four-cluster VEE, if we strictly followed the BIC choice. `r ifelse(knitr::is_html_output(), '@fig-penguins-mc-html', '@fig-penguins-mc-pdf')` shows this model in a tour, and the best three-cluster model. You can see that the four-cluster result is inadequate, in various ways. One of the species (Chinstrap) does have a bimodal density, due to the two species, and we would expect that a four cluster solution might detect this. The tour shows that the three-cluster solution is the best match to the data shape.
```{r}
#| label: fig-penguins-bic
#| code-fold: false
#| fig-cap: "BIC values for the top models for 2-9 clusters on the penguins data. The interpretation is mixed: if one were to choose three clusters any of the variance-covariance models would be equally as good, but the very best model is the four-cluster VEE."
#| fig-width: 6
#| fig-height: 4
#| fig-align: "center"
penguins_BIC <- mclustBIC(penguins_sub[,1:4])
ggmc <- ggmcbic(penguins_BIC, cl=2:9, top=7) +
scale_color_discrete_divergingx(palette = "Roma") +
theme_minimal()
ggmc
```
```{r}
#| label: best-mclust
#| code-fold: false
penguins_mc <- Mclust(penguins_sub[,1:4],
G=4,
modelNames = "VEE")
penguins_mce <- mc_ellipse(penguins_mc)
penguins_cl <- penguins_sub
penguins_cl$cl <- factor(penguins_mc$classification)
penguins_mc_data <- penguins_cl %>%
select(bl:bm, cl) %>%
mutate(type = "data") %>%
bind_rows(bind_cols(penguins_mce$ell,
type=rep("ellipse",
nrow(penguins_mce$ell)))) %>%
mutate(type = factor(type))
```
```{r echo=knitr::is_html_output()}
#| eval: FALSE
#| label: simpler-mclust
#| code-summary: "Code to make animated gifs"
animate_xy(penguins_mc_data[,1:4],
col=penguins_mc_data$cl,
pch=c(4, 20 )[as.numeric(penguins_mc_data$type)],
axes="off")
load("data/penguins_tour_path.rda")
render_gif(penguins_mc_data[,1:4],
planned_tour(pt1),
display_xy(col=penguins_mc_data$cl,
pch=c(4, 20)[
as.numeric(penguins_mc_data$type)],
axes="off",
half_range = 0.7),
gif_file="gifs/penguins_best_mc.gif",
frames=500,
loop=FALSE)
penguins_mc <- Mclust(penguins_sub[,1:4],
G=3,
modelNames = "EEE")
penguins_mce <- mc_ellipse(penguins_mc)
penguins_cl <- penguins_sub
penguins_cl$cl <- factor(penguins_mc$classification)
penguins_mc_data <- penguins_cl %>%
select(bl:bm, cl) %>%
mutate(type = "data") %>%
bind_rows(bind_cols(penguins_mce$ell,
type=rep("ellipse",
nrow(penguins_mce$ell)))) %>%
mutate(type = factor(type))
animate_xy(penguins_mc_data[,1:4],
col=penguins_mc_data$cl,
pch=c(4, 20)[as.numeric(penguins_mc_data$type)],
axes="off")
# Save the animated gif
load("data/penguins_tour_path.rda")
render_gif(penguins_mc_data[,1:4],
planned_tour(pt1),
display_xy(col=penguins_mc_data$cl,
pch=c(4, 20)[
as.numeric(penguins_mc_data$type)],
axes="off",
half_range = 0.7),
gif_file="gifs/penguins_simpler_mc.gif",
frames=500,
loop=FALSE)
```
::: {.content-visible when-format="html"}
::: {#fig-penguins-mc-html layout-ncol=2}
![Four-cluster VEE](gifs/penguins_best_mc.gif){#fig-penguins-best_mc fig-alt="FIX ME" fig.align="center"}
![Three-cluster EEE](gifs/penguins_simpler_mc.gif){#fig-penguins-simpler_mc fig-alt="FIX ME" fig.align="center"}
Examining the model-based clustering results for the penguins data: (a) best model according to BIC value, (b) simpler three-cluster model. Dots are ellipse points, and "x" are data points. It is important to note that the three cluster solution fits the data better, even though it has a lower BIC.
:::
:::
::: {.content-visible when-format="pdf"}
::: {#fig-penguins-mc-pdf layout-ncol=2}
![Four-cluster VEE](images/penguins_best_mc_60.png){#fig-penguins-best-mc fig-alt="FIX ME" fig.align="center"}
![Three-cluster EEE](images/penguins_simpler_mc_60.png){#fig-penguins-simpler-mc fig-alt="FIX ME" fig.align="center"}
Examining the model-based clustering results for the penguins data: (a) best model according to BIC value, (b) simpler three-cluster model. Dots are ellipse points, and "x" are data points. It is important to note that the three cluster solution fits the data better, even though it has a lower BIC.
:::
:::
::: {.content-visible when-format="html"}
::: info
Using the tour to visualise the final choices of models with similarly high BIC values helps to choose which best fits the data. It may not be the one with the highest BIC value.
:::
:::
::: {.content-visible when-format="pdf"}
\infobox{Using the tour to visualise the final choices of models with similarly high BIC values helps to choose which best fits the data. It may not be the one with the highest BIC value.}
:::
```{r}
#| eval: FALSE
#| echo: FALSE
# Its surprising that the four cluster solution didn't
# pick up the sexes of the Chinstraps, where there is
# some bimodality. Instead it splits the well-separated
# group into two! Something to raise in the recap chapter.
penguins_cl %>%
count(species, cl) %>%
pivot_wider(names_from=cl,
values_from=n,
values_fill=0)
```
## Exercises {-}
1. Examine the three cluster EVE, VVE and VEE models with the tour, and explain whether these are distinguishably different from the EEE three cluster model.
2. Fit model-based clustering to the the `clusters`. Does it suggest the data has three clusters? Using the tour examine the best model model. How well does this fit the data?
3. Fit model-based clustering to the the `multicluster`. Does it suggest the data has six clusters? Using the tour examine the best model model. How well does this fit the data?
4. Fit model-based clustering to the `fake_trees` data. Does it suggest that the data has 10 clusters? If not, why do you think this is? Using the tour examine the best model model. How well does this fit the branching structure?
5. Try fitting model-based clustering to the `aflw` data? What is the best model? Is the solution related to offensive vs defensive vs mid-fielder skills?
6. Use model-based clustering on the challenge data sets, `c1`-`c7` from the `mulgar` package. Explain why or why not the best model fits the cluster structure or not.