-
Notifications
You must be signed in to change notification settings - Fork 0
/
5-nldr.qmd
279 lines (214 loc) · 14.1 KB
/
5-nldr.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
## Non-linear dimension reduction
## Explanation of NLDR methods
Non-linear dimension reduction (NLDR) aims to find a low-dimensional representation of the high-dimensional data that shows the main features of the data. In statistics, it dates back to the work of @Kr64a on multidimensional scaling (MDS). Some techniques only require an interpoint similarity or distance matrix as the main ingredient, rather than the full data. We'll focus on when the full data is available here, so we can also compare structure perceived using the tour on the high-dimensional space, relative to structure revealed in the low-dimensional embedding.
There are many methods available for generating non-linear low dimensional representations of the data. Classically, MDS minimises some function of the difference between two interpoint distance matrices, the distance between points in the high-dimensions, and in the low-dimensional representations.
$$
\mbox{Stress}_D(x_1, ..., x_n) = \left(\sum_{i, j=1; i\neq j}^n (d_{ij} - d_k(i,j))^2\right)^{1/2}
$$
where $D$ is an $n\times n$ matrix of distances $(d_{ij})$ between all pairs of points, and $d_k(i,j)$ is the distance between the points in the low-dimensional space. PCA is a special case of MDS. The result from PCA is a linear projection, but generally MDS can provide non-linear transformations to represent unusual high-dimensional patterns. A good resource for learning about MDS is @BG05.
\index{dimension reduction!t-SNE}
\index{dimension reduction!UMAP}
```{r echo=knitr::is_html_output()}
#| label: fig-nldr-clusters
#| fig-cap: "Two non-linear embeddings of the non-linear clusters data: (a) t-SNE, (b) UMAP. Both suggest four clusters, with two being non-linear in some form."
#| fig-alt: FIXME
#| fig-width: 8
#| fig-height: 4
#| message: false
#| code-summary: "Code to generate the 2D non-linear representation"
library(mulgar)
library(Rtsne)
library(uwot)
library(ggplot2)
library(patchwork)
set.seed(42)
cnl_tsne <- Rtsne(clusters_nonlin)
cnl_umap <- umap(clusters_nonlin)
n1 <- ggplot(as.data.frame(cnl_tsne$Y), aes(x=V1, y=V2)) +
geom_point() +
ggtitle("(a) t-SNE") +
theme_minimal() +
theme(aspect.ratio=1)
n2 <- ggplot(as.data.frame(cnl_umap), aes(x=V1, y=V2)) +
geom_point() +
ggtitle("(b) UMAP") +
theme_minimal() +
theme(aspect.ratio=1)
n1 + n2
```
Popular methods in current use for NLDR include t-SNE [@Maaten2008] and UMAP [@McInnes2018]. The approach of t-SNE is to compare interpoint distances with a standard probability distribution (eg $t$-distribution) to exaggerate local neighbourhood differences. UMAP compares the interpoint distances with what might be expected if the data was uniformly distributed in the high-dimensions.
@fig-nldr-clusters shows two NLDR views of the `clusters_nonlin` data set from the `mulgar` package. Both suggest that there are four clusters, and that some clusters are non-linearly shaped. They disagree on the type of non-linear pattern, where t-SNE represents one cluster as a wavy-shape and UMAP both have a simple parabolic shape.
```{r}
#| eval: false
#| echo: false
#| code-summary: "Code to create animated gif"
library(tourr)
render_gif(clusters_nonlin,
grand_tour(),
display_xy(),
gif_file = "gifs/clusters_nonlin.gif",
frames = 500,
width = 300,
height = 300)
```
::: {.content-visible when-format="html"}
![Grand tour of the nonlinear clusters data set, shows four clusters. Two are very small and spherical in shape. One is large, and has a sine wave shape, and the other is fairly small with a bent rod shape.](gifs/clusters_nonlin.gif){#fig-clusters-nonlin-html}
:::
::: {.content-visible when-format="pdf"}
::: {#fig-clusters-nonlin-pdf layout-ncol=2}
![](images/clusters_nonlin_60.png){width=220}
![](images/clusters_nonlin_233.png){width=220}
Two frames from a grand tour of the nonlinear clusters data set, shows four clusters. Two are very small and spherical in shape. One is large, and has a sine wave shape, and the other is fairly small with a bent rod shape. {{< fa play-circle >}}
:::
:::
The full 4D data is shown with a grand tour in `r ifelse(knitr::is_html_output(), '@fig-clusters-nonlin-html', '@fig-clusters-nonlin-pdf')`. The four clusters suggested by the NLDR methods can be seen. We also get a better sense of the relative size and proximity of the clusters. There are two small spherical clusters, one quite close to the end of the large sine wave cluster. The fourth cluster is relatively small, and has a slight curve, like a bent rod. The t-SNE representation is slightly more accurate than the UMAP representation. We would expect that the wavy cluster is the sine wave seen in the tour.
::: {.content-visible when-format="html"}
::: info
NLDR can provide useful low-dimensional summaries of high-dimensional structure but you need to check whether it is a sensible and accurate representation by comparing with what is perceived from a tour.
:::
:::
::: {.content-visible when-format="pdf"}
\infobox{NLDR can provide useful low-dimensional summaries of high-dimensional structure but you need to check whether it is a sensible and accurate representation by comparing with what is perceived from a tour.}
:::
## Assessing reliability of the NLDR representation
NLDR can produce useful low-dimensional summaries of structure in high-dimensional data, like those shown in @fig-nldr-clusters. However, there are numerous pitfalls. The fitting procedure can produce very different representations depending on the parameter choices, and even the random number seeding the fit. (You can check this by changing the `set.seed` in the code above, and by changing from the default parameters.) Also, it may not be possible to represent the high-dimensional structures faithfully in low dimensions. For these reasons, one needs to connect the NLDR view with a tour of the data, to help assess its usefulness and accuracy. For example, with this data, we would want to know which of the two curved clusters in the UMAP representation correspond to the sine wave cluster.
### Using `liminal`
\index{liminal}
@fig-liminal-clusters-nonlin shows how the NLDR plot can be linked to a tour view, using the `liminal` package, to better understand how well the structure of the data is represented. Here we learn that the smile in the UMAP embedding is the small bent rod cluster, and that the unibrow is the sine wave.
```{r}
#| message: FALSE
#| eval: false
#| code-fold: false
library(liminal)
umap_df <- data.frame(umapX = cnl_umap[, 1],
umapY = cnl_umap[, 2])
limn_tour_link(
umap_df,
clusters_nonlin,
cols = x1:x4
)
```
::: {#fig-liminal-clusters-nonlin layout-ncol=1}
![Smile matches bent rod.](images/liminal-clusters-nonlin1.png){#fig-smile}
![Unibrow matches sine wave.](images/liminal-clusters-nonlin2.png){#fig-unibrow}
Two screenshots from liminal showing which clusters match between the UMAP representation and the tour animation. The smile corresponds to the small bent rod cluster. The unibrow matches to the sine wave cluster.
:::
### Using `detourr`
\index{detourr}
@fig-detourr-clusters-nonlin shows how the linking is achieved using `detourr`. It uses a shared data object, as made possible by the `crosstalk` package, and the UMAP view is made interactive using `plotly`.
```{r}
#| message: false
#| eval: false
#| code-fold: false
library(detourr)
library(dplyr)
library(crosstalk)
library(plotly)
umap_df <- data.frame(umapX = cnl_umap[, 1],
umapY = cnl_umap[, 2])
cnl_df <- bind_cols(clusters_nonlin, umap_df)
shared_cnl <- SharedData$new(cnl_df)
detour_plot <- detour(shared_cnl, tour_aes(
projection = starts_with("x"))) |>
tour_path(grand_tour(2),
max_bases=50, fps = 60) |>
show_scatter(alpha = 0.7, axes = FALSE,
width = "100%", height = "450px")
umap_plot <- plot_ly(shared_cnl,
x = ~umapX,
y = ~umapY,
color = I("black"),
height = 450) %>%
highlight(on = "plotly_selected",
off = "plotly_doubleclick") %>%
add_trace(type = "scatter",
mode = "markers")
bscols(
detour_plot, umap_plot,
widths = c(5, 6)
)
```
![Screenshot from detourr showing which clusters match between the UMAP representation and the tour animation. The smile corresponds to the small bent rod cluster.](images/detourr-clusters-nonlin.png){#fig-detourr-clusters-nonlin}
## Example: `fake_trees`
\index{data!fake trees}
@fig-liminal-trees shows a more complex example, using the `fake_trees` data. We know that the 10D data has a main branch, and 9 branches (clusters) attached to it, based on our explorations in the earlier chapters. The t-SNE view, where points are coloured by the known branch ids, is very helpful for seeing the linear branch structure.
What we can't tell is that there is a main branch from which all of the others extend. We also can't tell which of the clusters corresponds to this branch. Linking the plot with a tour helps with this. Although, not shown in the sequence of snapshots in @fig-liminal-trees, the main branch is actually the dark blue cluster, which is separated into three pieces by t-SNE.
```{r}
#| message: false
#| eval: false
#| code-summary: "Code to run liminal on the fake trees data"
library(liminal)
library(Rtsne)
data(fake_trees)
set.seed(2020)
tsne <- Rtsne::Rtsne(
dplyr::select(fake_trees,
dplyr::starts_with("dim")))
tsne_df <- data.frame(tsneX = tsne$Y[, 1],
tsneY = tsne$Y[, 2])
limn_tour_link(
tsne_df,
fake_trees,
cols = dim1:dim10,
color = branches
)
```
::: {#fig-liminal-trees layout-ncol=1}
![Linked views of t-SNE dimension reduction with a tour of the fake trees data. The t-SNE view clearly shows ten 1D non-linear clusters, while the tour of the full 100 variables suggests a lot more variation in the data, and less difference between clusters. ](images/fake_trees1.png){#fig-trees1 width=300}
![Focus on the green cluster which is split by t-SNE. The shape as viewed in many linear projections shown by the tour shows that it is a single curved cluster. The split is an artifact of the t-SNE mapping.](images/fake_trees2.png){#fig-trees2 width=300}
![Focus on the purple cluster which splits the green cluster in the t-SNE view. The tour shows that these two clusters are distinct, but are close in one neighbourhood of the 100D space. The close proximity in the t-SNE view is reasonable, though.](images/fake_trees3.png){#fig-trees3 width=300}
Three snapshots of using the `liminal` linked views to explore how t-SNE has summarised the `fake_trees` data in 2D.
:::
::: {.content-visible when-format="html"}
::: insight
The t-SNE representation clearly shows the linear structures of the data, but viewing this 10D data with the tour shows that t-SNE makes several inaccurate breaks of some of the branches.
:::
:::
::: {.content-visible when-format="pdf"}
\insightbox{The t-SNE representation clearly shows the linear structures of the data, but viewing this 10D data with the tour shows that t-SNE makes several inaccurate breaks of some of the branches. }
:::
## Exercises {-}
1. This question uses the `penguins_sub` data
a. Generate a 2D representation using t-SNE. Plot the points mapping the colour to species. What is most surprising? (Hint: Are the three species represented by three distinct clusters?)
b. Re-do the t-SNE representation with different parameter choices, including using different random seeds. Are the results different each time, or do you think that they could be considered to be equivalent?
c. Use `liminal` or `detourr` to link the t-SNE representation to a tour of the penguins. Highlight the points that have been placed in an awkward position by t-SNE from others in their species. Watch them relative to the others in their species in the tour view, and think about whether there is any rationale for the awkward placement.
d. Try again using UMAP to make the 2D representation, and use `liminal` or `detourr` to link with a tour to explore the result.
2. Conduct your best t-SNE and UMAP representations of the `aflw` data. Compare and contrast what is learned relative to a tour or the principal component analysis.
```{r}
#| label: penguins-tsne
#| message: false
#| eval: false
#| echo: false
load("data/penguins_sub.rda")
set.seed(2022)
p_tsne <- Rtsne::Rtsne(penguins_sub[,2:5])
p_tsne_df <- data.frame(tsneX = p_tsne$Y[, 1], tsneY = p_tsne$Y[, 2])
limn_tour_link(
p_tsne_df,
penguins_sub,
cols = bl:bm,
color = species
)
# The t-SNE mapping of the penguins data inaccurately splits one of the clusters. The three clusters are clearly distinct when viewed with the tour.
```
## Project {-}
Gene expressions measured as scRNA-Seq of 2622 human peripheral blood mononuclear cells data is available from the `Seurat` R package [@seurat1, @seurat2, @seurat3, @seurat4]. The paper web site has code to extract and pre-process the data, which follow the tutorial at https://satijalab.org/seurat/articles/pbmc3k_tutorial.html. The processed data, containing the first 50 PCs is provided with the book, as `pbmc_pca_50.rds`.
The original paper [@chen2023] used UMAP on the first 15 PCs to find a representation of the data to illustrate the clustering. They used the default settings of the `RunUMAP()` function in `Seurat`, without setting a seed.
Generate the t-SNE and UMAP representations of the first 9 PCs of data, using their default settings. They should be quite different. (We use 9 PCs because the scree plot in the data pre-processing suggests that 15 is too many.) Based on your examination of the data in a tour, which method yields the more accurate representation? Explain what the structure in the 2D is relative to that seen in the tour.
```{r}
#| label: pbmc
#| message: false
#| eval: false
#| echo: false
pbmc <- readRDS("data/pbmc_pca_50.rds")
# t-SNE
set.seed(1041)
p_tsne <- Rtsne::Rtsne(pbmc[,1:15])
p_tsne_df <- data.frame(tsneX = p_tsne$Y[, 1], tsneY = p_tsne$Y[, 2])
ggplot(p_tsne_df, aes(x=tsneX, y=tsneY)) + geom_point()
# UMAP
set.seed(1045)
p_umap <- uwot::umap(pbmc[,1:15])
p_umap_df <- data.frame(umapX = p_umap[, 1], umapY = p_umap[, 2])
ggplot(p_umap_df, aes(x=umapX, y=umapY)) + geom_point()
```