13-pca.Rmd

# Principal component analysis

Principal component analysis (PCA) is a method in the field of *multivariate statistics*. In such methods we do not define any causality between variables by distinguishing between response and predictor variables. Instead, methods in multivariate statistics  look at the overall structure of data and attempt to identify patterns. PCA allows to find such patterns that could be used to **reduce the number of variables** while minimizing the amount of information lost. This can be useful e.g. when there are a lot of correlated predictors in a regression model.

See @navarro_learning_2018 section 15.2 and @crawley_r_2013 section 25.1.

## Intuition

The underlying goal of PCA in simple terms is to summarize many variables into one, few or several new variables. More precisely, **the aim is to determine a set of standardized linear combinations that would explain a maximum amount of variation in data**. Sometimes only one linear combination is found such that it best separates observations. As such, PCA is a *dimensionality reduction* technique. 

Each linear combination that summarizes part of variance in data is called a *principal component* (*PC*). These linear combinations $\xi_j$ can be expressed as  $\xi_j = b_{j1}x_1 + b_{j2}x_2+ ... a_{jp}x_p$, where $b_{jp}$ is a weight of variable $x_p$ of a particular $\xi_j$, i.e. $j$th PC. PCs are thus essentially **weighted sum of variables**. The ordering of PCs is fixed: **PCs are ordered decreasingly, starting from the PC that explains the most variation in data**. The maximum number of PCs we can determine is $min(p, n - 1)$, where $p$ is the number of variables and $n$ the number of observations.

## Estimation of PCs

There are several ways to understand the estimation PCs. An intuitive approach would be to visualize a data cloud from which we incrementally derive the PCs. The **first PC   explains the maximum amount of variation in the data cloud**. Thus, we choose the first PC so that (1) it follows the direction of largest variance in data cloud and (2) is then also a line with smallest orthogonal distance to all data points. As such the idea is similar to least squares estimation, except that in **PCA we consider variation in all variables (not just the response)** and do not use squared residuals. See the [relevant question on Stack Exchange](https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues) for some visual explanations.

Establishing the direction of largest variation requires transforming data points to a new coordinate system. This involves the following steps: (1) centering data points, (2) scaling the axes so that they would be equal, (3) rotating axes into the direction of a  PC. In the latter step the **first PC is rotated so as to maximize variation, whereas each following PC is rotated to be orthogonal to preceding PC**. 

Estimation of PCs requires **data to be centered**, i.e. for each variable $\mu = 0$. This is usually done by software. Another aspect to consider is **scale of data**. If software does calculations on correlation matrix then scale is irrelevant because correlation matrix is scale agnostic. However, if covariance matrix is used then variables with higher variance are given more importance (recall that PCA involves maximizing variation). Whether or not this is preferred depends on data and research problem.

Note that while @navarro_learning_2018[, 437] suggest using rotations, we should not rotate PCs. The basic principle of PCA is that the PCs are orthogonal and rotation would violate this principle.t

In Jamovi: `Factor > Principal Component Analysis | Rotation: none`.  
In R: `prcomp(data, scale = TRUE)`

## Calculation

The transformation of data points into a new coordinate system is done according to the *eigenvalues* (scaling) and *eigenvectors* (rotation) derived from data. There are two methods for obtaining these: (1) using covariance or correlation matrix of data or (2) using data matrix. These methods are described below.

### Spectral decomposition

A data matrix $X : n \times p$ can be represented by its correlation or covariance matrix $\Sigma$.  This can be decomposed as $\Sigma = V D V^T$, where $D = diag(\lambda_1, \lambda_2, \dots, \lambda_{n-1})$ is a diagonal matrix of eigenvalues of $\Sigma$ and matrix $V = (v_1, v_2, ..., v_{n-1})$ the corresponding eigenvectors of $\Sigma$. Thus, matrices D and V provide the eigenvalues and eigenvectors. This can also be expressed as $\Sigma w_i = \lambda_i w_i$ where $\lambda_i$ is the eigenvalue that corresponds to eigenvector $w_i$ of $i$th PC. 

### Singular value decomposition

A data matrix $X : n \times p$ with $rank(X) = k$ can also be decomposed as $X = UDV^T$, where  $U = (u_1, u_2, \dots, u_k)$ is an $n \times n$ matrix of eigenvectors of $XX^T$, $V = (v_1, v_2, ..., v_k)$ is an $p \times p$ matrix of eigenvectors of $X^TX$, and $D = diag(\lambda_1, \lambda_2, \dots, \lambda_k)$ is a diagonal matrix containing nonzero eigenvalues of $X^TX$ and $XX^T$. Such decomposition of our data matix $X$ then gives us the eigenvalues $\lambda_1, \lambda_2, \dots, \lambda_k$ and corresponding eigenvectors $u_1, u_2, \dots, u_k$ for $X$. 

## Assumptions

The most essential assumption of PCA concerns the type of variables: PCA should only be applied to variables that are measured on a **continuous scale**, i.e. on a ratio or interval scale. It is also common to include ordinal variables but these are then treated as continuous.

Because PCA involves linear transformation of data, at least some **linear relationships** in data are assumed. Also, **notable outliers** may have a high leverage in the transformation. [**Spericity** and **sampling adequacy**][Factor analysis] may also be considered.

PCA has no assumptions on normal distribution of values or equality of variance among variables (although scales are important!).

## Number of components

Recall that the aim of PCA is to summarize variables into a smaller number of PCs. The number of PCs that are used is a free choice. Depending on our research problem, we might wish to obtain just one PC or few or more PCs. There are **several possible criteria that may help us choose the number of components**:

1. choose a threshold of variation in data that PCs should cumulatively explain, e.g. 80%; 
2. use PCs that have eigenvalue above average or above one; 
2. use PCs that are above an "elbow" in scree plot.

In Jamovi: `Factor > Principal Component Analysis | Number of Components`.

## Interpretation

### Variation explained

The proportion of total *variation explained* by a $j$th PC can be found from its eigenvalues: it is the proportion of sum of eigenvalues, i.e. $\lambda_j / \Sigma^p_{j=1} \lambda$. The $\Sigma^p_{j=1} \lambda$ is equal to the number of variables, thus eigenvalue of a PC indicates how many variables' variation it represents. The proportion of variation explained by a number of first PCs is represented by the cumulative sum of the variation explained by individual PCs. 

This measure illustrates **how much information is preserved when summarizing data** into a particular number of PCs.

In Jamovi: `Factor > Principal Component Analysis | Component summary`.  
In R: `prcomp(data, scale = TRUE)`

### Loadings

Each PC is defined by the influence that each variable has on it. As such, **each variable has a particular loading on each PC** which is why the values of this influence are called loadings. Loadings can also be understood as variable weights. 

When the linear combinations are calculated for each observation using loadings (weights) and initial variables, we obtain *principal component scores*. These are in essence the new summarized variables.

Loadings can be used to interpret PCs to give them a theoretical meaning. This involves a lot of creativity. Keep in mind the coding of variables: if higher values of one variable indicate a "negative" result and for other variables a "positive" result, then you might consider mutliplying some variables with $-1$ for a more simple interpretation.

In Jamovi: `Factor > Principal Component Analysis | Rotation: none > Save | Component scores`.  
In R: `prcomp(data, scale = TRUE)$x`

### Biplots

These plots concisely summarize the first two PCs by depicting observations on these two dimensions together with loadings of each variable as arrows. Directions of these arrows indicate the direction of loading and length of arrows indicate the magnitude of loading. Then, the angle between arrows represents correlations between variables.

In Jamovi: `snowCluster > PCA plot | Biplot`.  
In R: `biplot(prcomp(data, scale = TRUE))`