/
07-sle_cell_type_recount2_model.Rmd
469 lines (376 loc) · 15.3 KB
/
07-sle_cell_type_recount2_model.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
---
title: "Banchereau, et al. cell type analyses with recount2 model"
output:
html_notebook:
toc: true
toc_float: true
---
**J. Taroni 2018**
In `06-sle-wb_cell_type`, we examined how LVs from the SLE WB PLIER model
(`results/05/SLE-WB_PLIER_model.RDS`) compared to known patterns in the
[Banchereau, et al.](https://doi.org/10.1016/j.cell.2016.03.008) data set that
was included in the SLE WB compendium (and therefore, training).
Specifically, we asked:
* Are there neutrophil-associated LVs that are correlated with neutrophil
counts?
* Are there plasma cell-associated LVs that demonstrate differential expression
between disease activity groups (specifically, higher values in more severe
disease)?
The answer to both of these questions is yes.
This demonstrated the utility of PLIER for extracting cell type patterns from
heterogeneous (at least in terms of patient population and platform) gene
expression data from a single tissue (whole blood) that had been assembled into
a compendium ([`greenelab/rheum-plier-data/sle-wb`](https://github.com/greenelab/rheum-plier-data/tree/4be547553f24fecac9e2f5c2b469a17f9df253f0/sle-wb)).
SLE is not a rare condition -- though the molecular heterogeneity and
multi-tissue nature of the disease warrant sophisticated approaches to the
analysis of transcriptomic data.
(This SLE WB compendium is comprised of 1640 samples and is _incomplete_; we
picked 7 data sets based on the platform used and the availability of raw data.)
Thus, we can use data-intensive approaches (e.g., PLIER) that are inappropriate
for smaller data sets and, by extension, rare or understudied diseases.
Using a "shovel-ready" data set like [recount2](https://jhubiostatistics.shinyapps.io/recount/)
to train a PLIER model (see [`greenelab/rheum-plier-data/recount2`](https://github.com/greenelab/rheum-plier-data/tree/4be547553f24fecac9e2f5c2b469a17f9df253f0/recount2))
and applying this model to smaller data sets may "get around" the data-intensive
requirement, but more investigation is warranted.
In this notebook, we examine whether we can use the recount2 PLIER model
(`data/recount2_PLIER_data/recount_PLIER_model.RDS`) to analyze neutrophil and
plasma cell patterns in the Banchereau, et al. data.
If the recount2 model performance is similar to that of the SLE model
(examined in `06-sle-wb_cell_type`), that is evidence supporting the validity
of this transfer learning approach.
## Functions and directory set up
```{r}
`%>%` <- dplyr::`%>%`
library(PLIER)
# custom functions
source(file.path("util", "plier_util.R"))
```
```{r}
# plot and result directory setup for this notebook
plot.dir <- file.path("plots", "07")
dir.create(plot.dir, recursive = TRUE, showWarnings = FALSE)
results.dir <- file.path("results", "07")
dir.create(results.dir, recursive = TRUE, showWarnings = FALSE)
```
## Load data
### recount2 model
```{r}
recount.plier <- readRDS(file.path("data", "recount2_PLIER_data",
"recount_PLIER_model.RDS"))
```
### SLE WB expression data
```{r}
symbol.file <-
file.path("data", "expression_data",
"SLE_WB_all_microarray_QN_zto_before_with_GeneSymbol.pcl")
sle.exprs.df <- readr::read_tsv(symbol.file, progress = FALSE)
# matrix with gene symbol as rownames
exprs.mat <- dplyr::select(sle.exprs.df, -EntrezID)
rownames(exprs.mat) <- exprs.mat$GeneSymbol
exprs.mat <- as.matrix(dplyr::select(exprs.mat, -GeneSymbol))
```
### SLE WB cell type LV dfs
From `06-sle-wb_cell_type`
```{r}
neutro.file <- file.path("results", "06",
"Banchereau_count_SLE_model_neutrophil_LV.tsv")
sle.neutro.df <- readr::read_tsv(neutro.file)
plasma.file <- file.path("results", "06",
"Banchereau_DA_group_SLE_model_plasma_cell_LVs.tsv")
sle.plasma.df <- readr::read_tsv(plasma.file)
```
## Cell type-associated LVs in the recount2 model
```{r}
# summary information from the recount2 PLIER model
recount.summary <- recount.plier$summary
```
### Neutrophil
Are there recount2 LVs associated with neutrophil gene sets?
```{r}
recount.summary %>%
dplyr::filter(grepl("Neutrophil", pathway),
FDR < 0.05)
```
The following latent variables look most promising: `LV524`, `LV603`, `LV985`
(`AUC > 0.75` for both `SVM Neutrophils` and `IRIS_Neutrophil-Resting`)
What other gene sets are associated with these LVs?
```{r}
recount.summary %>%
dplyr::filter(`LV index` %in% c(524, 603, 985),
FDR < 0.05) %>%
dplyr::arrange(`LV index`, desc(AUC))
```
Some notes on these results:
* `LV524` is also associated with the following monocyte-related gene sets:
`SVM Monocytes`, `DMAP_MONO1`, and `DMAP_MONO2`. Again, we would like something
that tells us about neutrophils _only_ if at all possible.
* `LV603` is associated with two other gene sets, [`PID_IL8CXCR2_PATHWAY`](http://software.broadinstitute.org/gsea/msigdb/cards/PID_IL8_CXCR2_PATHWAY.html) and
[`SIG_PIP3_SIGNALING_IN_B_LYMPHOCYTES`](http://software.broadinstitute.org/gsea/msigdb/cards/SIG_PIP3_SIGNALING_IN_B_LYMPHOCYTES.html).
It's worth noting that interleukin 8 (`IL8`) is a neutrophil chemoattractant,
and `SIG_PIP3_SIGNALING_IN_B_LYMPHOCYTES` [may just be related](http://software.broadinstitute.org/gsea/msigdb/compute_overlaps.jsp?geneSetName=SIG_PIP3_SIGNALING_IN_B_LYMPHOCYTES&collection=C2) to PI3K signaling & chemotaxis in general.
* `LV985` is also associated with `SIG_PIP3_SIGNALING_IN_B_LYMPHOCYTES` and
[`PID_FCER1PATHWAY`](http://software.broadinstitute.org/gsea/msigdb/cards/PID_FCER1_PATHWAY.html).
`FCER1` refers to Fc-epsilon receptor 1 which is mainly expressed on other
granulocytes.
#### U matrix
```{r}
PLIER::plotU(plierRes = recount.plier,
pval.cutoff = 1e-06,
indexCol = c(524, 603, 985),
top = 10)
```
```{r}
png(file.path(plot.dir, "recount2_neutrophil_Uplot.png"),
res = 300, width = 7, height = 7, units = "in")
PLIER::plotU(plierRes = recount.plier,
pval.cutoff = 1e-06,
indexCol = c(524, 603, 985),
top = 10)
dev.off()
```
### Plasma cell
Are there LVs associated with plasma cell gene sets in the recount2 model?
```{r}
recount.summary %>%
dplyr::filter(grepl("Plasma", pathway),
FDR < 0.05)
```
What other gene sets are associated with `LV951`?
```{r}
recount.summary %>%
dplyr::filter(`LV index` == 951)
```
`KEGG_INTESTINAL_IMMUNE_NETWORK_FOR_IGA_PRODUCTION` is not a significant
association (`FDR = 0.13`)
There were 2 LVs from the SLE WB PLIER model that were associated with plasma
cell gene sets.
Those LVs were also associated with other processes (e.g.,
`REACTOME_UNFOLDED_PROTEIN_RESPONSE`, `KEGG_HEDGEHOG_SIGNALING_PATHWAY`)
## Transform SLE WB data
```{r}
sle.recount.b <- GetNewDataB(exprs.mat = exprs.mat,
plier.model = recount.plier)
b.file <- file.path(results.dir, "SLE-WB_B_matrix_recount2_model.RDS")
saveRDS(sle.recount.b, file = b.file)
```
## Banchereau, et al. cell type results
### Neutrophil
```{r}
neutro.lv.df <- as.data.frame(cbind(colnames(sle.recount.b),
t(sle.recount.b[c(524, 603, 985), ])))
colnames(neutro.lv.df) <- c("Sample", "recount2_LV524", "recount2_LV603",
"recount2_LV985")
```
```{r}
# join with sle wb results & neutrophil counts
neutro.df <- dplyr::inner_join(sle.neutro.df, neutro.lv.df) %>%
dplyr::mutate(recount2_LV524 = as.numeric(as.character(recount2_LV524)),
recount2_LV603 = as.numeric(as.character(recount2_LV603)),
recount2_LV985 = as.numeric(as.character(recount2_LV985)))
head(neutro.df)
```
```{r}
neutro.out.file <- file.path(results.dir, "neutrophil_count_LV_both_models.tsv")
readr::write_tsv(neutro.df, path = neutro.out.file)
```
```{r}
# function for scatter plots, given lv (a string indicating the LV of interest)
# make a scatter plot where the LV is the x variable, neutrophil count is the
# y variable & fit a line with geom_smooth(method = "lm")
# also will annotate the plot with the supplied r-squared value (rsq arg)
# where the text is placed is automatically chosen from the x and y values
# needs to be used in this global environment
LVScatter <- function(lv, rsq) {
y.var <- "Neutrophil.Count"
# calculate where to put the r-squared value
x.range <- max(neutro.df[, lv]) - min(neutro.df[, lv])
x.coord <- min(neutro.df[, lv]) + (x.range * 0.15)
y.range <- max(neutro.df[, y.var]) - min(neutro.df[, y.var])
y.coord <- max(neutro.df[, y.var]) - (y.range * 0.15)
ggplot2::ggplot(neutro.df, ggplot2::aes_string(x = lv, y = y.var)) +
ggplot2::geom_point(alpha = 0.7) +
ggplot2::geom_smooth(method = "lm") +
ggplot2::theme_bw() +
ggplot2::labs(y = "Neutrophil Count") +
ggplot2::theme(legend.position = "none",
text = ggplot2::element_text(size = 15)) +
ggplot2::annotate("text", x = x.coord, y = y.coord,
label = paste("r-squared =", rsq))
}
```
#### LV524
```{r}
summary(lm(neutro.df$Neutrophil.Count ~ neutro.df$recount2_LV524))
```
```{r}
LVScatter(lv = "recount2_LV524", rsq = 0.34)
```
#### LV603
```{r}
summary(lm(neutro.df$Neutrophil.Count ~ neutro.df$recount2_LV603))
```
```{r}
LVScatter(lv = "recount2_LV603", rsq = 0.36)
```
#### LV985
```{r}
summary(lm(neutro.df$Neutrophil.Count ~ neutro.df$recount2_LV985))
```
```{r}
LVScatter(lv = "recount2_LV985", rsq = 0.32)
```
The neutrophil-associated LVs from the recount2 model generally perform as well
as the neutrophil-associated LVs from the SLE WB PLIER model.
We'll plot the results from `SLE WB LV87` and `recount2 LV603` together because
these are both LVs that are not significantly associated with monocyte
signatures and have high AUC for the neutrophil gene sets.
```{r}
sle.plot <- LVScatter("LV87", rsq = 0.29) +
ggplot2::labs(title = "SLE WB PLIER model") +
ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5,
face = "bold"))
```
```{r}
recount.plot <- LVScatter("recount2_LV603", rsq = 0.36) +
ggplot2::labs(title = "recount2 PLIER model") +
ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5,
face = "bold"))
```
```{r}
pdf(file.path(plot.dir, "Neutrophil_LV_model_comparison.pdf"), width = 14,
height = 7)
gridExtra::grid.arrange(sle.plot, recount.plot, ncol = 2)
dev.off()
```
#### Correlation between LV values from different models
```{r}
summary(lm(neutro.df$LV87 ~ neutro.df$recount2_LV603))
```
```{r}
ggplot2::ggplot(neutro.df, ggplot2::aes(x = recount2_LV603,
y = LV87)) +
ggplot2::geom_point(alpha = 0.7) +
ggplot2::geom_smooth(method = "lm") +
ggplot2::theme_bw() +
ggplot2::labs(x = "recount2 LV603", y = "SLE WB LV87") +
ggplot2::theme(legend.position = "none",
text = ggplot2::element_text(size = 15)) +
ggplot2::annotate(geom = "text", x = -0.375, y = 1,
label = "r-squared = 0.87")
```
```{r}
plot.file <-
file.path(plot.dir, "Banchereau_neutrophil_count_LV87_v_recLV603_scatter.png")
ggplot2::ggsave(plot.file, plot = ggplot2::last_plot())
```
#### Distribution of correlation values
There are 987 latent variables from the recount2 PLIER model.
Could any one of them be well-correlated with the Banchereau, et al.
neutrophil counts by chance?
```{r}
recount.b.t <- as.data.frame(t(sle.recount.b))
recount.b.t <- tibble::rownames_to_column(recount.b.t, "Sample")
head(recount.b.t)
```
```{r}
b.count.df <- dplyr::inner_join(x = neutro.df[, c("Sample",
"Neutrophil.Count")],
y = recount.b.t, by = "Sample")
```
```{r}
rsq.all.lv <- (cor(x = b.count.df$Neutrophil.Count,
y = b.count.df[, 3:ncol(b.count.df)])) ^ 2
rsq.df <- as.data.frame(matrix(rsq.all.lv, ncol = 1))
colnames(rsq.df) <- "r.squared"
```
```{r}
ggplot2::ggplot(rsq.df, ggplot2::aes(x = r.squared)) +
ggplot2::geom_density() +
ggplot2::geom_segment(mapping = ggplot2::aes(x = rsq.df[603, ],
xend = rsq.df[603, ],
y = 14, yend = 0),
arrow = grid::arrow(),
colour = "blue") +
ggplot2::theme_bw() +
ggplot2::theme(text = ggplot2::element_text(size = 13)) +
ggplot2::labs(x = "R-squared", subtitle = "Neutrophil Count ~ LV Value",
title = "Distribution of R-squared values") +
ggplot2::annotate(geom = "text", x = 0.36, y = 15, colour = "blue",
label = "LV603", size = 5)
```
```{r}
plot.file <-
file.path(plot.dir, "Banchereau_neutrophil_count_recount2_lv_rsq.png")
ggplot2::ggsave(plot.file, plot = ggplot2::last_plot())
```
```{r}
which.max(rsq.all.lv)
```
`recount2 LV603` has the highest R-squared with the Banchereau, et al.
neutrophil counts.
### Plasma cell
```{r}
# get LV951 into data.frame form
plasma.lv.df <- as.data.frame(cbind(colnames(sle.recount.b),
sle.recount.b[951, ]))
colnames(plasma.lv.df) <- c("Sample", "recount2_LV951")
```
```{r}
# join with SLE WB results
plasma.df <- dplyr::inner_join(sle.plasma.df, plasma.lv.df, by = "Sample") %>%
dplyr::mutate(recount2_LV951 = as.numeric(as.character(recount2_LV951)))
head(plasma.df)
```
```{r}
plasma.file <- file.path(results.dir, "plasma_cell_LVs_both_models.tsv")
readr::write_tsv(plasma.df, plasma.file)
```
```{r}
plasma.df %>%
ggplot2::ggplot(ggplot2::aes(x = Disease.Activity,
y = recount2_LV951)) +
ggplot2::geom_boxplot(notch = TRUE) +
ggplot2::theme_bw() +
ggplot2::labs(x = "Disease Activity") +
ggplot2::theme(text = ggplot2::element_text(size = 15))
```
```{r}
# check for statistical significance
# (pairwise t-test is consistent with original publication as far as I can tell)
pairwise.t.test(x = plasma.df$recount2_LV951,
g = plasma.df$Disease.Activity,
p.adjust.method = "bonferroni")
```
This is highly similar to what we saw in `06-sle-wb_cell_type`.
Now we'll plot the recount2 (`recount2_LV951`) and SLE WB model (`LV52`)
results side by side as we did above with the neutrophil scatterplots.
```{r}
sle.plot <- plasma.df %>%
ggplot2::ggplot(ggplot2::aes(x = Disease.Activity,
y = LV52)) +
ggplot2::geom_boxplot(notch = TRUE) +
ggplot2::theme_bw() +
ggplot2::labs(x = "Disease Activity",
title = "SLE WB PLIER model") +
ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5,
face = "bold")) +
ggplot2::theme(text = ggplot2::element_text(size = 15))
```
```{r}
recount.plot <- plasma.df %>%
ggplot2::ggplot(ggplot2::aes(x = Disease.Activity,
y = recount2_LV951)) +
ggplot2::geom_boxplot(notch = TRUE) +
ggplot2::theme_bw() +
ggplot2::labs(x = "Disease Activity",
title = "recount2 PLIER model") +
ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5,
face = "bold")) +
ggplot2::theme(text = ggplot2::element_text(size = 15))
```
```{r}
pdf(file.path(plot.dir, "Plasma_cell_LV_model_comparison.pdf"), width = 14,
height = 7)
gridExtra::grid.arrange(sle.plot, recount.plot, ncol = 2)
dev.off()
```