/
6-intro-clust.qmd
362 lines (297 loc) · 18.5 KB
/
6-intro-clust.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
# Introduction to clustering
Unsupervised classification, or cluster analysis, organizes observations into similar groups. Cluster analysis is a commonly used, appealing, and conceptually intuitive statistical method. Some of its uses include market segmentation, where customers are grouped into clusters with similar attributes for targeted marketing; gene expression analysis, where genes with similar expression patterns are grouped together; and the creation of taxonomies for animals, insects, or plants. Clustering can be used as a way of reducing a massive amount of data because observations within a cluster can be summarized by its centre. Also, clustering effectively subsets the data thus simplifying analysis because observations in each cluster can be analyzed separately.
## What are clusters?
Organizing objects into groups is a common task to help make sense of the world around us. Perhaps this is why it is an appealing method of data analysis. However, cluster analysis is more complex than it initially appears. Many people imagine that it will produce neatly separated clusters like those in @fig-ideal-clusters(a), but it almost never does. Such ideal clusters are rarely encountered in real data, so we often need to modify our objective from *find the natural clusters in this data*. Instead, we need to organize the *cases into groups that are similar in some way*. Even though this may seem disappointing when compared with the ideal, it is still often an effective means of simplifying and understanding a dataset.
::: {.content-visible when-format="html"}
::: info
Knowing what shapes are in your data helps to decide on the best method and to diagnose the result. For example, if the clusters are elliptical model-based clustering is recommended.
:::
:::
::: {.content-visible when-format="pdf"}
\infobox{Knowing what shapes are in your data helps to decide on the best method and to diagnose the result. For example, if the clusters are elliptical model-based clustering is recommended.}
:::
```{r}
#| label: fig-ideal-clusters
#| echo: FALSE
#| fig-width: 5
#| fig-height: 5
#| out-width: "100%"
#| fig-cap: "Different structures in data impact cluster analysis. When there are well-separated groups (a), it is simple to group similar observations. Even when there are not, partitioning observations into groups may still be useful. There may be nuisance observations (b) or nuisance variables (c) that affect the interpoint distance calculations and distract the clustering algorithm, and there may oddly shaped clusters (d) which are hard to numerically describe."
#| message: false
stdd <- function(x) (x-mean(x))/sd(x)
set.seed(20230513)
d_sep_cl <- matrix(rnorm(99*2), ncol=2)
d_sep_cl[1:33,1] <-
d_sep_cl[1:33,1]+8
d_sep_cl[34:66,2] <-
d_sep_cl[34:66,2]+8
d_sep_cl[34:66,1] <-
d_sep_cl[34:66,1]+4
d_sep_cl <- data.frame(x1=stdd(d_sep_cl[,1]),
x2=stdd(d_sep_cl[,2]))
x <- (runif(20)-0.5)*4
y <- x
d_nuis_pts <- data.frame(x1 = stdd(c(rnorm(50, -3),
rnorm(50, 3), x)),
x2 = stdd(c(rnorm(50, -3),
rnorm(50, 3), y)))
d_nuis_vars <- matrix(rnorm(99*2), ncol=2)
d_nuis_vars[1:49,1] <- d_nuis_vars[1:49,1]+8
d_nuis_vars <-
data.frame(x1=stdd(d_nuis_vars[,1]),
x2=stdd(d_nuis_vars[,2]))
d_odd_shapes <- matrix(rnorm(99*2),ncol=2)
d_odd_shapes[1:66,2] <- (d_odd_shapes[1:66,1])^2-5 + rnorm(66)*0.6
d_odd_shapes[1:66,1] <- d_odd_shapes[1:66,1]*3
d_odd_shapes <-
data.frame(x1=stdd(d_odd_shapes[,1]),
x2=stdd(d_odd_shapes[,2]))
library(ggplot2)
library(patchwork)
p1 <- ggplot(d_sep_cl, aes(x=x1, y=x2)) +
geom_point(colour="#3B99B1", alpha=0.7) +
annotate("text", -2.5, 2.5, label="a") +
xlim(-2.8, 2.8) + ylim(-2.8, 2.8) +
theme(aspect.ratio=1,
axis.text = element_blank(),
axis.title = element_blank(),
axis.ticks = element_blank(),
panel.background = element_rect("white"),
panel.border = element_rect("black", fill=NA,
linewidth = 0.5))
p2 <- ggplot(d_nuis_pts, aes(x=x1, y=x2)) +
geom_point(colour="#3B99B1", alpha=0.7) +
annotate("text", -2.5, 2.5, label="b") +
xlim(-2.8, 2.8) + ylim(-2.8, 2.8) +
theme(aspect.ratio=1,
axis.text = element_blank(),
axis.title = element_blank(),
axis.ticks = element_blank(),
panel.background = element_rect("white"),
panel.border = element_rect("black", fill=NA,
linewidth = 0.5))
p3 <- ggplot(d_nuis_vars, aes(x=x1, y=x2)) +
geom_point(colour="#3B99B1", alpha=0.7) +
annotate("text", -2.5, 2.5, label="c") +
xlim(-2.8, 2.8) + ylim(-2.8, 2.8) +
theme(aspect.ratio=1,
axis.text = element_blank(),
axis.title = element_blank(),
axis.ticks = element_blank(),
panel.background = element_rect("white"),
panel.border = element_rect("black", fill=NA,
linewidth = 0.5))
p4 <- ggplot(d_odd_shapes, aes(x=x1, y=x2)) +
geom_point(colour="#3B99B1", alpha=0.7) +
annotate("text", -2.5, 2.5, label="d") +
xlim(-2.8, 2.8) + ylim(-2.8, 2.8) +
theme(aspect.ratio=1,
axis.text = element_blank(),
axis.title = element_blank(),
axis.ticks = element_blank(),
panel.background = element_rect("white"),
panel.border = element_rect("black", fill=NA,
linewidth = 0.5))
print(p1 + p2 + p3 + p4 + plot_layout(ncol=2))
```
At the heart of the clustering process is the work of discovering which variables are most important for defining the groups. It is often true that we only require a subset of the variables for finding clusters, whereas another subset (called *nuisance variables*) has no impact. In the bottom left plot of @fig-ideal-clusters, it is clear that the variable plotted horizontally is important for splitting this data into two clusters, whereas the variable plotted vertically is a nuisance variable. Nuisance is an apt term for these variables, because they can radically change the interpoint distances and impair the clustering
process. \index{cluster analysis!interpoint distance}
\index{cluster analysis!nuisance variable}
Dynamic graphical methods help us to find and understand the cluster structure in high dimensions. With the tools in our toolbox, primarily tours, along with linked scatterplots and parallel coordinate plots, we can see clusters in high-dimensional spaces. We can detect gaps between clusters, the shape and relative positions of clusters, and the presence of nuisance variables. We can even find unusually shaped clusters, like those in the bottom right plot in @fig-ideal-clusters. In simple
situations we can use graphics alone to group observations into clusters, using a "spin and brush" method. In more difficult data problems, we can assess and refine numerical solutions using graphics.\index{brushing!persistent}
\index{cluster analysis!spin-and-brush}
This part of the book discusses the use of interactive and dynamic graphics in the clustering of data. @sec-clust-bg introduces cluster analysis, focusing on interpoint distance measures. @sec-clust-graphics describes an example of a purely graphical approach to cluster analysis, the spin and brush method. In the example shown in that section, we were able to find simplifications of the data that had not been found using numerical clustering methods, and to find a variety of structures in high-dimensional space. @sec-hclust describes methods for reducing the interpoint distance matrix to an intercluster distance matrix using hierarchical algorithms, in @sec-kmeans shows the the $k$-means algorithm, @sec-mclust covers model-based clustering, and @sec-som describes clustering with self-organising maps. Each of these chapters shows how graphical tools can be used to assess the results of numerical methods. @sec-clust-compare summarizes these chapters and revisits the data analysis strategies used in the examples. Additional references that provide good companions to the material presented in these chapters are @VR02, @HOML, @hennig, @giordani, @kassambara, and the CRAN Task View [@ctv-clustering].
## The importance of defining similar {#sec-clust-bg}
Before we can begin finding groups of cases that are similar[^3], we need to decide how to define or measure whether they are close together or far apart. Consider a dataset with three cases $(a_1, a_2, a_3)$ and four variables $(V_1, V_2, V_3, V_4)$, described in matrix format as
[^3]: Both *similarity* and *dissimilarity* measures are used for defining how similar cases are. It can be confusing! They measure similar in opposite directions. With a dissimilarity measure, a smaller number means the cases are closer, as in a distance metric. A similarity measure usually ranges between 0 and 1, with 1 indicating that the cases are closer, for example, correlation.
\begin{align*}
X = \begin{bmatrix}
& {\color{grey} V_1} & {\color{grey} V_2} & {\color{grey} V_3} & {\color{grey} V_4} \\\hline
{\color{grey} a_1} | & x_{11} & x_{12} & x_{13} & x_{14} \\
{\color{grey} a_2} | & x_{21} & x_{22} & x_{23} & x_{24} \\
{\color{grey} a_3} | & x_{31} & x_{32} & x_{33} & x_{34}
\end{bmatrix}
= \begin{bmatrix}
& {\color{grey} V_1} & {\color{grey} V_2} & {\color{grey} V_3} & {\color{grey} V_4} \\\hline
{\color{grey} a_1} | & 7.3 & 7.6 & 7.7 & 8.0 \\
{\color{grey} a_2} | & 7.4 & 7.2 & 7.3 & 7.2 \\
{\color{grey} a_3} | & 4.1 & 4.6 & 4.6 & 4.8
\end{bmatrix}
\end{align*}
\noindent which is plotted in @fig-similarity1. The Euclidean distance between two cases (rows of the matrix) with $p$ elements is defined as
\begin{align*}
d_{\rm Euc}(a_i,a_j) &=& ||a_i-a_j|| %\\
% &=& \sqrt{(x_{i1}-x_{j1})^2+\dots + (x_{ip}-x_{jp})^2},
~~~~~~i,j=1,\dots, n,
\end{align*}
\noindent where $||x_i||=\sqrt{x_{i1}^2+x_{i2}^2+\dots +x_{ip}^2}$. For example, the Euclidean distance between cases 1 and 2 in the above data, is
\begin{align*}
d_{\rm Euc}(a_1,a_2) &= \sqrt{(7.3-7.4)^2+(7.6-7.2)^2+ (7.7-7.3)^2+(8.0-7.2)^2} \\
&= 1.0
\end{align*}
\index{cluster analysis!interpoint distance}
\noindent For the three cases, the interpoint Euclidean distance matrix is
::: {.content-visible when-format="html"}
::: {.hidden}
$$
\require{mathtools}
\definecolor{grey}{RGB}{192, 192, 192}
$$
:::
:::
\begin{align*}
d_{\rm Euc} = \begin{bmatrix}
& {\color{grey} a_1} & {\color{grey} a_2} & {\color{grey} a_3} \\\hline
{\color{grey} a_1} | & 0.0 & 1.0 & 6.3 \\
{\color{grey} a_2} | & 1.0 & 0.0 & 5.5 \\
{\color{grey} a_3} | & 6.3 & 5.5 & 0.0
\end{bmatrix}
\end{align*}
::: {#fig-similarity1 layout-ncol=2}
```{r echo=knitr::is_html_output()}
#| message: FALSE
#| warning: false
#| fig-width: 4
#| fig-height: 4
#| code-summary: "Code for plot"
x <- data.frame(V1 = c(7.3, 7.4, 4.1),
V2 = c(7.6, 7.2, 4.6),
V3 = c(7.7, 7.3, 4.6),
V4 = c(8.0, 7.2, 4.8),
point = factor(c("a1", "a2", "a3")))
library(GGally)
library(colorspace)
library(gridExtra)
pscat <- ggpairs(x, columns=1:4,
upper=list(continuous="points"),
diag=list(continuous="blankDiag"),
axisLabels="internal",
ggplot2::aes(colour=point)) +
scale_colour_discrete_divergingx(
palette = "Zissou 1", nmax=4) +
xlim(3.7, 8.5) + ylim(3.7, 8.5) +
theme_minimal() +
theme(aspect.ratio=1)
pscat
```
```{r echo=knitr::is_html_output()}
#| fig-width: 3
#| fig-height: 3.4
#| code-summary: "Code for plot"
ppar <- ggparcoord(x, columns=1:4,
groupColumn = 5,
scale = "globalminmax") +
scale_colour_discrete_divergingx(
palette = "Zissou 1", nmax=4) +
xlab("") + ylab("") +
theme_minimal() +
theme(axis.ticks.y = element_blank(),
axis.text.y = element_blank(),
legend.title = element_blank())
ppar
```
The scatterplot matrix (left) shows that cases $a_1$ and $a_2$ have similar values. The parallel coordinate plot (right) allows a comparison of other structure, which shows the similarity in the trend of the profiles on cases $a_1$ and $a_3$.
:::
\noindent Cases $a_1$ and $a_2$ are more similar to each other than they are to case $a_3$, because the Euclidean distance between cases $a_1$ and $a_2$ is much smaller than the distance between cases $a_1$ and $a_3$ and between cases $a_2$ and $a_3$.
There are many different ways to calculate similarity. Similarity measures based on correlation distance can be useful. It is typically used where similarity of structure or shape is more important than similarity in magnitude.
\index{parallel coordinate plot}
As an example, see the parallel coordinate plot of the sample data at the right of @fig-similarity1. Cases $a_1$ and $a_3$ are widely separated, but their shapes are similar (low, medium, medium, high). Case $a_2$, although
overlapping with case $a_1$, has a very different shape (high, medium, medium, low). The Pearson correlation between two cases, $\rho(a_i,a_j)$, is defined as
\begin{align*}
\rho(a_i,a_j) = \frac{(a_i-c_i)^\top(a_j-c_j)}
{\sqrt{(a_i-c_i)^\top(a_i-c_i)} \sqrt{(a_j-c_j)^\top(a_j-c_j)}}
\label{corc}
\end{align*}
\noindent Typically, $c_i, c_j$ are the sample means of each case, $\bar{a}_i,\bar{a}_j$. For these three observations, $c_1=\bar{a}_1=7.650, c_2=\bar{a}_2=7.275, c_3=\bar{a}_3=4.525$. An interesting geometric fact, is that if $c_i, c_j$ are set to be 0, as is commonly done, $\rho$ is a generalized correlation that describes the angle between the two data vectors. The correlation is then converted to a distance metric, with one possibility being as follows:
\begin{align*}
d_{\rm Cor}(a_i,a_j) = \sqrt{2(1-\rho(a_i,a_j))}
\end{align*}
This distance metric will treat cases that are strongly negatively correlated as the most distant. If you want to consider strong negative correlation as close, then you could take the absolute value of $\rho(a_i,a_j)$ in the above equation, and remove the multiplication by 2.
The interpoint distance matrix for the sample data using $d_{\rm Cor}$ and the Pearson correlation coefficient is
\begin{align*}
d_{\rm Cor} = \begin{bmatrix}
& {\color{grey} a_1} & {\color{grey} a_2} & {\color{grey} a_3} \\\hline
{\color{grey} a_1} | & 0.0 & 3.6 & 0.1 \\
{\color{grey} a_2} | & 3.6 & 0.0 & 3.8 \\
{\color{grey} a_3} | & 0.1 & 3.8 & 0.0
\end{bmatrix}
\end{align*}
\noindent By this metric, cases $a_1$ and $a_3$ are the most similar, because the correlation distance is smaller between these two cases than the other pairs of cases. \index{cluster analysis!interpoint distance}
Note that these interpoint distances differ dramatically from those for Euclidean distance. As a consequence, the way the cases would be clustered is also very different. Choosing the appropriate distance measure is an important part of a cluster analysis.
After a distance metric has been chosen and a cluster analysis has been performed, the analyst must evaluate the results, and this is actually a difficult task. A cluster analysis does not generate $p$-values or other numerical criteria, and the process tends to produce hypotheses rather than testing them. Even the most determined attempts to produce the "best" results using modeling and validation techniques may result in clusters that, although seemingly significant, are useless for practical purposes. As a result, cluster analysis is best thought of as an exploratory technique, and it can be quite useful despite the lack of formal validation because of its power in data simplification.
::: {.content-visible when-format="html"}
::: info
Defining an appropriate distance metric from the context of the problem is a most important decision. For example, if your variables are all numeric, and on the same scale then Euclidean distance might be best. If your variables are categorical, you might need to use something like Hamming distance.
:::
:::
::: {.content-visible when-format="pdf"}
\infobox{Defining an appropriate distance metric from the context of the problem is a most important decision. For example, if your variables are all numeric, and on the same scale then Euclidean distance might be best. If your variables are categorical, you might need to use something like Hamming distance.}
:::
The context in which the data arises is the key to assessing the results. If the clusters can be characterized in a sensible manner, and they increase our knowledge of the data, then we are on the right track. To use an even more pragmatic criterion, if a company can gain an economic advantage by using a particular clustering method to carve up their customer database, then that is the method they should use.
## Exercises {-}
Use the following data to answer these questions:
```{r}
#| echo: false
library(mulgar)
vc <- matrix(c(1,0.8,0.8,0.8,1,0.8,0.8,0.8,1), ncol=3, byrow=TRUE)
set.seed(9044)
x <- data.frame(matrix(0, 4, 3))
x[1,] <- c(0.13, 0.21, 0.09)
x[2,] <- c(0.91, 0.95, 0.85)
x[3,] <- c(0.62, 0.73, 0.65)
x[4,] <- c(0.21, 0.92, 0.43)
rownames(x) <- paste0("a", 1:4)
colnames(x) <- paste0("x", 1:3)
x
```
1. Compute the Euclidean distance between cases `a1`, `a2`, `a3`, `a4`.
2. Compute the correlation distance (as defined above) between cases `a1`, `a2`, `a3`, `a4`.
3. Which two points have the (a) biggest (b) smallest Mahalanobis (statistical) distance, assuming that the covariance matrix is:
```{r}
#| echo: false
rownames(vc) <- paste0("x", 1:3)
colnames(vc) <- paste0("x", 1:3)
vc
```
(The base function `mahalanobis()` will calculate this in R. Technically this gives distance between each case and the mean vector.)
4. Is the ordering of distance between cases the same if Manhattan distance is used instead of Euclidean?
5. Compute the Chebychev distance between cases `a1`, `a2`, `a3`, `a4`.
6. Compute Bray-Curtis distance between cases `a1`, `a2`, `a3`, `a4`.
7. Make a plot of the data, and write a paragraph describing how the different distance metrics agree and disagree on how close or far the cases are from each other.
```{r}
#| eval: false
#| echo: false
# Answers
library(patchwork)
# Euclidean
dist(x)
# Mahalanobis
mahalanobis(x, center=FALSE, cov=vc)
# manhattan
dist(x, "manhattan")
# Chebychev
d_ch <- matrix(0, 4, 4)
for (i in 1:4) {
for (j in 1:4) {
if (i != j)
d_ch[i,j] <- abdiv::chebyshev(x[i,], x[j,])
}}
as.dist(d_ch)
# Bray-Curtis
vegdist(x, "bray")
# Plot
x <- data.frame(x)
p1 <- ggplot(x) + geom_text(aes(x=x1, y=x2,
label=rownames(x))) +
geom_abline(slope=1, intercept=0) +
theme(aspect.ratio=1)
p2 <- ggplot(x) + geom_text(aes(x=x1, y=x3,
label=rownames(x))) +
geom_abline(slope=1, intercept=0) +
theme(aspect.ratio=1)
p3 <- ggplot(x) + geom_text(aes(x=x2, y=x3,
label=rownames(x))) +
geom_abline(slope=1, intercept=0) +
theme(aspect.ratio=1)
p1+p2+p3+plot_layout(ncol=3)
```