/
tutorial_PCA_correction.Rmd
397 lines (276 loc) · 12.2 KB
/
tutorial_PCA_correction.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
---
title: 'Tutorial: PCA on State Budget'
author: 'MAP573 team'
date: "10/13/2020"
# bibliography: resources/MAP573.bib
---
# Preliminaries
```{r setup, include=FALSE}
knitr::opts_chunk$set(
echo = TRUE,
rows.print = 5)
```
## Package requirements
We start by loading a couple of packages for data manipulation, dimension reduction and fancy representations.
```{r packages, message = FALSE, warning = FALSE}
library(tidyverse) # advanced data manipulation and vizualisation
library(knitr) # R notebook export and formatting
library(FactoMineR) # Factor analysis
library(factoextra) # Fancy plotting of FactoMineR outputs
library(corrplot) # Fancy plotting of matrices
library(GGally) # Easy-to-use ggplot2 extensions
theme_set(theme_bw()) # set default ggplot2 theme to black and white
```
## Importing the Data set
### Data description
We consider a data set inherited from the classical 'French school' of data analysis.
<!-- This data set is taken from @bouroche1980analyse and has been analyzed e.g. in @duby2006analyse. -->
The data matrix $\mathbf{X} = (x_{ij})_{i=1,\dots,n; j=1,\dots,p}$ contains the distribution of the French State budget between $p = 11$ different items between $n = 24$ years sampled between 1872 and 1971.
The values are given as a percentage of the overall budget and the different items are
- Government
- Agriculture
- Business and Industry
- Employment
- Housing
- Education
- Welfare
- Veterans
- Defence
- Debt Interest
- Others
### Data loading
The file `budget.csv` is available in comma separated format, with `;` as separator and `,` as decimal (SOME cumbersome French formatting).
**Load the data into a data frame and name the columns appropriately.**
<div class="hiddensolution">
The function `read_csv2` from the `readr` package, part of the [tidyverse](https://www.tidyverse.org/) can hande this automatically. We also specify the column names, corresponding to the budgetary items, plus the corresponding year, while loading the data set into `R`:
```{r data_loadings, message = FALSE}
state_budget <- readr::read_csv2("data/budget.csv",
col_names = c(
"Year",
"Governments",
"Agriculture",
"Business and Industry",
"Employment",
"Housing",
"Education",
"Welfare",
"Veterans",
"Defence",
"Debt Interest",
"Others"
))
```
</div>
# Basic descriptive analysis
## Data table summary
**Have a look at the head of the data table $\mathbf{X}$**
<div class="hiddensolution">
`kable` is used adapt the formatting to the type of output: HTML, screen, PDF
```{r data header}
state_budget %>% head() %>% knitr::kable()
```
The following function is useful to have a quick look at a data frame a check types of each variables:
```{r data glimpse}
glimpse(state_budget)
```
</div>
Each variable (i.e. budget item) takes it value in $[0,100]$ (proportion of the current budget). **Propose a plot to summarize the distribution of these proportions**
<div class="hiddensolution">
```{r boxplot}
state_budget %>%
dplyr::select(-Year) %>%
pivot_longer(everything()) %>%
ggplot() +
aes(x = name, y = value) + labs(x = "budget item", y = "value (proportion)") +
geom_boxplot() + geom_point(alpha = 0.5) +
theme(axis.text.x = element_text(angle = 90))
```
</div>
### Data transformation
**Create a categorical variable (factor) with a couple of level to regroup samples by time interval that you find relevant in history of 20th century. Check that these period are balanced.**
<div class="hiddensolution">
We identify 4 historical periods to regroup the 'Year' variable (the id of the sample) into four clusters that might be interesting for interpreting the the PCA. We amend our data frame by adding a column encoding this new descriptive variable (that will _not_ be used in the PCA, of course!).
```{r period}
state_budget <- state_budget %>%
mutate(Period = cut(Year,
breaks = c(-Inf, 1900, 1920, 1947, Inf), right = FALSE,
labels = c("<1900", "[1900, 1920)", "(1920, 1947]", ">= 1947"))
)
```
</div>
## Scatterplot/pairs plot
When only few variables are at play, plotting the scatterplot or (pairs-to-pairs plot) might help in finding obvious relations.
**Use the function `scatmat` from \{GGally\} to do that**
<div class="hiddensolution">
```{r scatterplot, fig.width = 8, fig.height = 8}
state_budget %>%
dplyr::select(-Year) %>% mutate_if(is_double, scale) %>%
GGally::scatmat(color = "Period")
```
</div>
When many more continuous variables are observed, a quick glance at the redondancy in information may be explored by representing the correlation matrix, optionally regrouped according to similarity (here, hierarchical clustering).
**Use the package \{corrplot\} to check correlations between variables. What do you conclude?**
<div class="hiddensolution">
```{r correlation matrix, fig.width = 8, fig.height = 8}
par(mfrow = c(1,2))
state_budget %>% dplyr::select(-Year, -Period) %>%
cor() %>% corrplot::corrplot()
state_budget %>% dplyr::select(-Year, -Period) %>%
cor() %>% corrplot::corrplot(method = "color", order = "hclust")
```
It seems that some variables carry the same information: dimension reduction might help. The data is probably living in a smaller space than $p=11$.
</div>
# Principal Component Anaysis
## Performing PCA
Before performing the PCA, we center (always!) and scale the continuous columns in the data table, in order to give the same weight to items with high or lower percentage.
**Form the matrix $\mathbf{X}$ of continuous scaled variables and perform the PCA with \{FactoMineR\}**
<div class="hiddensolution">
```{r PCA}
X <- state_budget %>%
mutate_if(is_double, scale) %>%
dplyr::select(-Year, -Period)
rownames(X) <- state_budget$Year
myPCA <- FactoMineR::PCA(
X = X, # the data on which PCA is performed
scale.unit = FALSE, # scaling has been made "manually"
graph = FALSE, # to make plot right now
ncp = 11 # keep all component
)
```
</div>
### Quick Recap on PCA
Recall that, essentially, a PCA and all important quantities is obtained by computing
the eigen decomposition of the empirical covariance matrix:
\begin{equation*}
\hat{\boldsymbol \Sigma} = \mathbf{X}^\top \mathbf{X} = \mathbf{U} \boldsymbol{\Lambda} \mathbf{U}^\top.
\end{equation*}
Equivalently -- and this is how it is computed in most program -- PCA is obtained by performing the _Singular Value Decomposition_ of the data matrix $\mathbf{X}$, i.e.,
\begin{equation*}
\mathbf{X} = \mathbf{V} \boldsymbol{\Lambda}^{1/2} \mathbf{U}^\top.
\end{equation*}
We use the following vocabulary:
- the weights, or rotation matrix, or loadings, is the orthogonal matrix $\mathbf{W} = \mathbf{U}$
- the score matrix, or principal components, is the matrix $\mathbf{F} = \mathbf{X}\mathbf{U} = \mathbf{V}\mathbf{\Lambda}^{1/2}$.
- the singular values are stack in the diagonal matrix $\mathbf{\Lambda}^{1/2}$.
- the eigen values $\lambda_j$ of $\boldsymbol\Sigma$ are the square of the singular values of $\mathbf{X}$.
All in all, a PCA is a matrix factorisation such that
\begin{equation*}
\mathbf{F} \mathbf{U}^\top = \mathbf{X}.
\end{equation*}
## Scree plot
The first diagnostic plot to make is a scree plot, which display the proportion of explained variances by the successive component. Recall that is related to the eigen values of the empirical covariance:
\begin{equation*}
\mathrm{percent}(\mathrm{var}_{j}) = \frac{\lambda_j}{\sum_{j=1}^p \lambda_j}.
\end{equation*}
**Make a scree plot. Comment!**
<div class="hiddensolution">
```{r variances}
factoextra::fviz_eig(myPCA, addlabels = TRUE, ylim = c(0,50))
```
Remark that the last 6 axes (> 50% of the number of variables) explain less than 10% of the total variance: this is an important argument toward dimension reduction.
</div>
## Variables study
All information about variables can be reached as follow:
```{r variables summary}
var <- get_pca_var(myPCA)
var
```
### Eigen vector / Weights (a.k.a. loadings)
The eigen vectors of the empirical covariance matrix give weights for obtaining the new variables as a linear combinaison of the original ones.
**Check that the right eigen vectors of $\mathbf{X}$, and the eigen vectors or $\boldsymbol\Sigma$ match with the loadings. Also check that it is orthogonal**
<div class="hiddensolution">
```{r eigenvectors}
U <- data.frame(myPCA$svd$V) ## eigen(cov(X))$vector
dimnames(U) <- dimnames(var$coord)
kable(U, digits = 3)
```
This is a _rotation matrix_, and thus an orthogonal matrix:
```{r orthogonality}
sum((t(U) - solve(U))^2)
```
</div>
The coordinate of the variables projected on the correlation circle are equal to $\mathbf{U} \sqrt{\mathbf{\Lambda}}$, the correlation between the new variables and the original ones. Indeed,
```{r coord var}
U <- as.matrix(U)
D <- diag(myPCA$svd$vs)
sum((U %*% D - var$coord)^2)
```
**Check the contribution of the original variables to the new axis. Comment**
<div class="hiddensolution">
The contribution of the original variable to the (e.g first two) components are given in percentage by the field `contrib`:
```{r contrib var}
fviz_contrib(myPCA, choice = "var", axes = 1:2)
```
</div>
### Correlation circle
The correlation circle gives a quick representation of how new and original variables are related together.
**Make the plot, comment! Also check the quality of the representation of the variables**
<div class="hiddensolution">
```{r circle}
kable(var$cor, digits = 3)
fviz_pca_var(myPCA, col.var = "cos2", axes = 1:2)
```
These quantities measure the correlation between the variabes in the new and the original bases.
The quality of the representation of the original variables by the new ones is measure with the (squared) cosine between them: a high cos2 indicates a good representation of the variable on the principal component.
```{r, out.width = "50%", fig.align='center', eval=FALSE, echo=FALSE}
corrplot(var$cos2, is.corr = FALSE, method = 'color')
fviz_cos2(myPCA, choice = "var", axes = 1:2)
```
</div>
## Indiviual Factor Map
All information about variables can ge reached as follow:
```{r individual}
ind <- get_pca_ind(myPCA)
ind
```
### The principal components / Scores
The _principal components_ are the the coordinate of the points in the new basis, after rotating the original data by the matrix of eigen vectors $U$.
**Compute them by yourself and check that it matches the coordinate of the individuals.**
<div class="hiddensolution">
```{r PC manual}
PC <- as.matrix(X) %*% U
kable(t(PC[, 1:2]), digits = 3)
```
This match the coordinate computed by `FactoMineR`:
```{r}
kable(t(ind$coord[, 1:2]), digits = 3)
```
</div>
The individual factor map is the representation of the principal components/scores in the specified axes.
**Use \{factoextra\} to represent the individuals in the new basis, on axes (1,2), (1,3), (2,3). Add some color related to the period. Comment**
<div class="hiddensolution">
Here, we change the size of the point according to the quality of their representation (measure with the cosine to the square, just like with variable factor map).
```{r ind map}
fviz_pca_ind (myPCA,
pointsize = "cos2",
pointshape = 21,
fill = "#E7B800",
repel = TRUE ,
axes = c(1, 2))
```
Let us enrich our plot by the Period, a qualitative variable that we wish related with the budget.
```{r Period}
par(mfrow = c(1,3))
fviz_pca_ind (myPCA,
geom.ind = "point",
col.ind = state_budget$Period,
palette = c("#00AFBB", "#E7B800", "#FC4E07","#2E9FDF", "#000000"),
legend.title = "Period",
axes = c(1, 2))
```
</div>
## Biplot
When there are only few variables (only tens), it is possible to give an unifying representation of individual and variable factor maps into a single plot, called _biplot_.
**Make the biplot and comment**
<div class="hiddensolution">
```{r biplot}
fviz_pca_biplot (myPCA,
geom.ind = "point",
col.ind = state_budget$Period,
palette = c("#00AFBB", "#E7B800", "#FC4E07","#2E9FDF", "#000000"),
legend.title = "Period",
axes = c(1, 2))
```
This plot is especially helpful to identifiy groups of individuals, how the new components are related to the original variables and finally which group individuals is carrying which part of the information/variables.
</div>
## References