/
session3.Rmd
246 lines (187 loc) · 6.92 KB
/
session3.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
---
title: "Training in ade4 in R - Module I: Basic methods"
subtitle: "Multiple correspondence analysis"
author: "Stéphane Dray"
date: '`r Sys.Date()`'
output:
xaringan::moon_reader:
lib_dir: libs
css: [custom.css, xaringan-themer.css]
nature:
highlightStyle: github
highlightLines: true
countIncrementalSlides: false
includes:
before_body: preamble.tex
---
```{r setup, include=FALSE}
options(htmltools.dir.version = FALSE)
knitr::opts_chunk$set(tidy=TRUE, tidy.opts=list(width.cutoff=50), fig.height= 4, fig.width=4, dev='svg', fig.align="center", fig.path = "fig/")
```
```{r xaringan-themer, include=FALSE}
library(xaringanthemer)
style_duo(primary_color = "#1F4257", secondary_color = "#F97B64", table_row_even_background_color = "#1F4257")
```
---
# Data structure
.left-column[
```{r, echo = FALSE}
knitr::include_graphics("img/onetable-Z.png")
```
]
.right-column[
* One table with *p* variables measured on *n* individuals
* All variables are **qualitative** (categorical)
* For instance
- sites $\times$ environmental variables (e.g., soil types)
- species $\times$ traits (e.g., functional groups)
]
---
# Objectives
* Identify what is the main information contained in the table
- Identify which *categories* are the most linked
- Identify the principal differences/similarities between individuals
---
# Data
We consider the `meaudret` data set
```{r}
library(ade4)
data(meaudret)
names(meaudret)
dim(meaudret$env)
names(meaudret$env)
```
---
# Categorical variables
The data set contains an environmental table with 20 measurements of 9 environmental variables. For this example, quantitative variables are transformed into categorical variables:
```{r}
env.categ <- apply(meaudret$env, 2, cut, breaks = 3, labels = c("low", "med", "hi"))
env.categ <- as.data.frame(env.categ, stringsAsFactors = TRUE)
head(env.categ, 3)
```
We want to know
* what are the main environmental gradients, i.e., which variables co-vary (if any)
* which samples have similar/different environmental conditions
---
# Quantitative and categorical variables
* Some variables are recorded as categorical
* Quantitative variables can always be recoded as categorical ones
- `r emo::ji("sad_but_relieved_face")` This introduces a loss of information (different values are regrouped into a single category)
- `r emo::ji("smile")` It allows to detect non-linear relationships (order of categories is lost)
```{r, out.width="40%", fig.dim = c(6,4)}
plot(meaudret$env[, 1] ~ env.categ[, 1], ylab = "Quantitative Temp",
xlab = "Qualitative Temp")
```
---
# Disjunctive table
The original data table $\mathbf{Z}$ contains categorical information (words).
The first step is to build a disjunctive table with numbers. Information is stored as a binary table with $n$ rows and $m$ columns (total number of categories).
```{r, echo = FALSE}
knitr::include_graphics("img/disjunct.png")
```
---
# Multiple correspondence analysis
* $\mathbf{X}=\mb{X}_B{\mb{D}_m}^{-1}-{\mb{1}_{n}}{\mb{1}_{m}}\tr$ is the the transformed and centred disjunctive table
* $\mathbf{Q} = \frac{1}{p}\mb{D}_m$ where $\mb{D}_m=diag(\mb{X}_B\tr\mb{D}\mb{1}_{n})$ contains the category frequencies
* $\mathbf{D} = \frac{1}{n}\mathbf{I}_n$ is the diagonal matrix with $\frac{1}{n}$
.column-left[
```{r, echo = FALSE}
knitr::include_graphics("img/onetable-Z.png")
```
]
.column-center[
.center[
`dudi.acm`
```{r, echo = FALSE}
knitr::include_graphics("img/arrow.png")
```
]
]
.column-right[
```{r, echo = FALSE}
knitr::include_graphics("img/pca-map.png")
```
]
---
# Maximized criteria
* For individuals
$$ Q(\mathbf{a})=\sqnorm{\mathbf{XQa}}{\mb{D}} = \sqnorm{\mathbf{X}\frac{1}{p}\mathbf{D}_m\mb{a}}{\frac{1}{n}\mb{I}_n}=\sqnorm{\frac{1}{p}\mathbf{X}_B\mb{a}}{\frac{1}{n}\mb{I}_n}=var\left (\frac{1}{p}\mb{X}_B\mb{a} \right ) = \lambda
$$
* For variables
$$\sqnorm{\mb{X}\tr\mb{Db}}{\mb{Q}} = \sqnorm{\mathbf{X}\tr\frac{1}{n}\mb{I}_n\mb{b}}{\frac{1}{p}\mb{D}_m}=\sqnorm{\frac{1}{n}\mb{D}_m^{-1}\mathbf{X}_B\tr\mb{b}}{\frac{1}{v}\mb{D}_m}$$
The vector $\frac{1}{n}\mb{D}_m^{-1}\mathbf{X}_B\tr\mb{b}$ contains means of $\mb{b}$ per category so that:
$$\sqnorm{\mathbf{X}\tr\frac{1}{n}\mb{I}_n\mb{b}}{\frac{1}{p}\mb{D}_m}=\frac{1}{p}\sum_{j=1}^{p}\eta^2(\mathbf{z}_j,\mathbf{b})$$
This quantity is the mean of correlation ratios computed for all variables.
---
# The `dudi.acm` function
## Arguments
```{r}
args(dudi.acm)
```
* `df` is a `data.frame` with the categorical data (`factors` in R)
* `row.w` is an optional vector of weights
* `scannf` and `nf` allow to set the number of dimensions to interpret
```{r}
mca.meau <- dudi.acm(env.categ, scannf = FALSE)
```
---
## Returned values
```{r}
names(mca.meau)
```
It returns an object of class `dudi` containing:
- `$eig`: eigenvalues ( $\mb{\Lambda}$ )
- `$cw`: column (i.e., category) weights ( $\frac{1}{v}\mb{D}_m$ )
- `$lw`: row weights ( $\mb{D}=\frac{1}{n}\mb{I}_n$ )
- `$tab`: transformed and centred disjunctive data table ( $\mb{X}$ )
- `$c1`: category loadings ( $\mb{A}$ )
- `$li`: row scores ( $\mb{L}=\frac{1}{p}\mathbf{X}_B\mb{A}$ )
- `$l1`: principal components ( $\mb{B}$ )
- `$co`: column scores ( $\mb{C}=\frac{1}{n}\mb{D}_m^{-1}\mathbf{X}_B\tr\mb{B}$ )
- `$cr`: correlation ratios between qualitative variables and axes
---
# Graphical representation and interpretation
In the first viewpoint, MCA positions categories by a normed score (`$c1`}. A score for individuals (`$li`) is derived from this categories score: an individual is located at the mean of the score of the categories that it carries.
This second score provides an ordination of individuals with the highest possible dispersion (maximum variance).
In the second type of interpretation, MCA finds normed coordinates for individuals (`$l1`) and positions categories at the mean of the individual scores that belong to them (`$co`). This maximises the mean of the variance of the categories for all variables. In other words, it maximises the mean of the correlation ratios.
---
## Graphical representations
.pull-left[
```{r}
s.label(mca.meau$co)
```
]
.pull-right[
```{r}
s.label(mca.meau$li)
```
]
---
## Optimal representation
.pull-left[
```{r}
plot(mca.meau, col = TRUE)
```
]
.pull-right[
```{r}
mca.meau$cr
```
]
---
# Inertia statistics
```{r}
summary(mca.meau)
```
---
# Mix of variables
If a table contains both quantitative and categorical variables, Hill and Smith analysis (HSA) can be applied.
This method is a compromise between PCA and MCA.
* If all the variables are quantitative, then the results of HSA are identical to those of PCA.
* If all the variables are qualitative, then the results are identical to those of MCA.
* If there is a mix of variables, then the analysis is an optimal combination of the properties of the two analyses (maximizing the squared correlations for quantitative variables and correlation ratios for categorical ones)
---
# MCA in practice
.center[
[Go to practical 3](../../practical/session3/session3.html)
]