diff --git a/1-intro.qmd b/1-intro.qmd index 7674668..b433bf1 100644 --- a/1-intro.qmd +++ b/1-intro.qmd @@ -5,9 +5,11 @@ High-dimensional data means that we have a large number of numeric features or v \index{variable}\index{feature} \index{projection} -![](images/shadow_puppets.png){width=450 fig-align="center" fig-env="figure*" fig-cap="Viewing high dimensions using low-dimensional displays is like playing shadow puppets, looking at the shadows to guess what the shape is." fig-alt="Three images, each with a hand or two hands, illustrating making shadows of a bird in flight, snail and dog."} +![Viewing high dimensions using low-dimensional displays is like playing shadow puppets, looking at the shadows to guess what the shape is.](images/shadow_puppets.png){#fig-shadow-puppets width=450 fig-alt="Three images, each with a hand or two hands, illustrating making shadows of a bird in flight, snail and dog."} +One approach to visualise high dimensional data and models is by using linear projections, as done in a tour. You can think of projections of high-dimensional data like shadows (@fig-shadow-puppets). Unlike shadow puppets, though the object stays fixed, and with multiple projections we can obtain a *view of the object from all sides*. + ## Getting familiar with tours @@ -24,7 +26,7 @@ s_p <- ggplot(simple_clusters, aes(x=x1, y=x2)) + annotate("text", x=2.0, y=2.2, label="(0.707, 0.707)", angle=45) + annotate("text", x=2.2, y=2.0, label="most clustered", angle=45) + geom_abline(intercept=0, slope=-1) + - annotate("text", x=-1.6, y=1.8, label="(-0.707, 0.707)", angle=-45) + + annotate("text", x=-1.6, y=1.8, label="(0.707, -0.707)", angle=-45) + annotate("text", x=-1.8, y=1.6, label="no clusters", angle=-45) + geom_abline(intercept=0, slope=0) + annotate("text", x=-1.6, y=0.15, label="(1, 0)") + @@ -104,7 +106,7 @@ How a tour can be used to explore high-dimensional data illustrated using (a) 2D ```{r fig-explain-1D-pdf, eval=knitr::is_latex_output()} #| echo: false -#| fig-cap: "How a tour can be used to explore high-dimensional data illustrated using (a) 2D data with two clusters and (b,c,d) 1D projections from a tour shown as a density plot. Imagine spinning a line around the centre of the data plot, with points projected orthogonally onto the line. With this data, when the line is at `x1=x2 (0.707, 0.707)` or `(-0.707, -0.707)` the clustering is the strongest. When it is at `x1=-x2 (0.707, -0.707)` there is no clustering." +#| fig-cap: "How a tour can be used to explore high-dimensional data illustrated using (a) 2D data with two clusters and (b,c,d) 1D projections from a tour shown as a density plot. Imagine spinning a line around the centre of the data plot, with points projected orthogonally onto the line. With this data, when the line is at `x1=x2 (0.707, 0.707)` or `(-0.707, -0.707)` the clustering is the strongest. When it is at `x1=-x2 (0.707, -0.707)` there is no clustering. {{< fa play-circle >}}" #| fig-width: 8 #| fig-height: 8 #| out-width: 100% @@ -228,7 +230,7 @@ How a tour can be used to explore high-dimensional data illustrated by showing a ```{r fig-explain-2D-pdf, eval=knitr::is_latex_output()} #| echo: false -#| fig-cap: "How a tour can be used to explore high-dimensional data illustrated by showing a sequence of random 2D projections of 3D data (a). The data has a donut shape with the hole revealed in a single 2D projection (b). Data usually arrives with a given number of observations, and when we plot it like this using a scatterplot, it is like shadows of a transparent object." +#| fig-cap: "How a tour can be used to explore high-dimensional data illustrated by showing a sequence of random 2D projections of 3D data (a). The data has a donut shape with the hole revealed in a single 2D projection (b). Data usually arrives with a given number of observations, and when we plot it like this using a scatterplot, it is like shadows of a transparent object. {{< fa play-circle >}}" #| fig-width: 8 #| fig-height: 8 #| out-width: 100% @@ -530,7 +532,7 @@ Two 5D datasets shown as tours of 2D projections. Can you see clusters of points ![Outliers](images/outlier-intro.png){#fig-tour-clusters width=200} -Frames from 2D tours on two 5D datasets, with clusters of points in (a) and two outliers with a plane in (b). This figure is best viewed in the HTML version of the book. +Frames from 2D tours on two 5D datasets, with clusters of points in (a) and two outliers with a plane in (b). This figure is best viewed in the HTML version of the book. {{< fa play-circle >}} ::: ::: @@ -674,13 +676,6 @@ render_gif(plane_outliers[,1:5], The movement of points give further clues about the structure of the data in high-dimensions. In the data with clustering, often we can see a group of points moving differently from the others. Because there are three clusters, you should see three distinct movement patterns. It is similar with outliers, except these may be individual points moving alone, and different from all others. This can be seen in the static plot, one point (top left) has a movement pattern upwards whereas most of the other observations near it are moving down towards the right. ::: - - This type of visualisation is useful for many activities in dealing with high-dimensional data, including: @@ -711,13 +706,13 @@ With computer graphics, the capability of animating plots to show more than a si The methods in this book primarily emerge from @As85's grand tour method. The algorithm provided the first smooth and continuous sequence of low dimensional projections, and guaranteed that all possible low dimensional projections were likely to be shown. The algorithm was refined in @BA86b (and documented in detail in @BCAH05) to make it *efficiently* show all possible projections. Since then there have been numerous varieties of tour algorithms developed to focus on specific tasks in exploring high dimensional data, and these are documented in @tours2022. -This book is an evolution from @CS07. One of the difficulties in working on interactive and dynamic graphics research has been the rapid change in technology. Programming languages have changed a little (FORTRAN to C to java to python) but graphics toolkits and display devices have changed a lot! The tour software used in this book evolved from XGobi, which was written in C and used the X Window System, which was then rewritten in GGobi using gtk. The video library has engaging videos of these software systems There have been several other short-lived implementations, including orca [@orca], written in java, and cranvas [@cranvas], written in R with a back-end provided by wrapper functions to qt libraries. +This book is an evolution from @CS07. One of the difficulties in working on interactive and dynamic graphics research has been the rapid change in technology. Programming languages have changed a little (FORTRAN to C to java to python) but graphics toolkits and display devices have changed a lot! The tour software used in this book evolved from XGobi, which was written in C and used the X Window System, which was then rewritten in GGobi using gtk. The video library has engaging videos of these software systems There have been several other short-lived implementations, including orca [@orca], written in java, and cranvas [@cranvas], written in R with a back-end provided by wrapper functions to `qt` libraries. Although attempts were made with these ancestor systems to connect the data plots to a statistical analysis system, these were always limited. With the emergence of R, having graphics in the data analysis workflow has been much easier, albeit at the cost of the interactivity with graphics that matches the old systems. We are mostly using the R package, `tourr` [@tourr] for examples in this book. It provides the machinery for running a tour, and has the flexibility that it can be ported, modified, and used as a regular element of data analysis. ## Exercises {-} -1. Randomly generate data points that are uniformly distributed in a hyper-cube of 3, 5 and 10 dimensions, with 500 points in each sample, using the `cube.solid.random` function of the `geozoo` package. What differences do we expect to see? Now visualise each set in a grand tour and describe how they differ, and whether this matched your expectations? +1. Randomly generate data points that are uniformly distributed in a hyper-cube of 3, 5 and 10 dimensions, with 500 points in each sample, using the `cube.solid.random()` function of the `geozoo` package. What differences do we expect to see? Now visualise each set in a grand tour and describe how they differ, and whether this matched your expectations? 2. Use the `geozoo` package to generate samples from different shapes and use them to get a better understanding of how shapes appear in a grand tour. You can start with exploring the conic spiral in 3D, a torus in 4D and points along the wire frame of a cube in 5D. 3. For each of the challenge data sets, `c1`, ..., `c7` from the `mulgar` package, use the grand tour to view and try to identify structure (outliers, clusters, non-linear relationships). @@ -733,31 +728,23 @@ cube3 <- cube.solid.random(3, 500)$points cube5 <- cube.solid.random(5, 500)$points cube10 <- cube.solid.random(5, 500)$points -animate(cube3) -animate(cube5) -animate(cube10) +animate_xy(cube3, axes="bottomleft") +animate_xy(cube5, axes="bottomleft") +animate_xy(cube10, axes="bottomleft") ``` +::: {.content-hidden when-format="pdf"} +::: {.hidden} +Answer 1. Each of the projections has a boxy shape, which gets less distinct as the dimension increases. + +As the dimension increases, the points tend to concentrate in the centre of the plot window, with a smattering of points in the edges. +::: +::: + ```{r} #| eval: false #| echo: false # Answer to Q3 -library(tourr) -library(mvtnorm) - -s1 <- diag(5) -s2 <- diag(5) -s2[3,4] <- 0.7 -s2[4,3] <- 0.7 -s3 <- s2 -s3[1,2] <- 0.7 -s3[2,1] <- 0.7 - -set.seed(1234) -d1 <- as.data.frame(rmvnorm(500, sigma = s1)) -d2 <- as.data.frame(rmvnorm(500, sigma = s2)) -d3 <- as.data.frame(rmvnorm(500, sigma = s3)) - library(mulgar) animate_xy(c1) render_gif(c1, diff --git a/1-intro_files/figure-html/fig-density-1.png b/1-intro_files/figure-html/fig-density-1.png new file mode 100644 index 0000000..cdd83e2 Binary files /dev/null and b/1-intro_files/figure-html/fig-density-1.png differ diff --git a/1-intro_files/figure-html/fig-dimension-cubes-1.png b/1-intro_files/figure-html/fig-dimension-cubes-1.png new file mode 100644 index 0000000..beed249 Binary files /dev/null and b/1-intro_files/figure-html/fig-dimension-cubes-1.png differ diff --git a/1-intro_files/figure-html/fig-example-structure-1.png b/1-intro_files/figure-html/fig-example-structure-1.png new file mode 100644 index 0000000..13c08b1 Binary files /dev/null and b/1-intro_files/figure-html/fig-example-structure-1.png differ diff --git a/1-intro_files/figure-html/fig-explain-1D-data-1.png b/1-intro_files/figure-html/fig-explain-1D-data-1.png new file mode 100644 index 0000000..3cac76b Binary files /dev/null and b/1-intro_files/figure-html/fig-explain-1D-data-1.png differ diff --git a/1-intro_files/figure-html/fig-explain-2D-data-1.png b/1-intro_files/figure-html/fig-explain-2D-data-1.png new file mode 100644 index 0000000..da03b83 Binary files /dev/null and b/1-intro_files/figure-html/fig-explain-2D-data-1.png differ diff --git a/10-model-based.qmd b/10-model-based.qmd index 4c53018..923cfe3 100644 --- a/10-model-based.qmd +++ b/10-model-based.qmd @@ -2,13 +2,13 @@ \index{cluster analysis!model-based} -Model-based clustering @FR02 fits a multivariate normal mixture model to the data. It uses the EM algorithm to fit the parameters for the mean, variance--covariance of each population, and the mixing proportion. The variance-covariance matrix is re-parameterised using an eigen-decomposition +Model-based clustering @FR02 fits a multivariate normal mixture model to the data. It uses the EM algorithm to fit the parameters for the mean, variance-covariance of each population, and the mixing proportion. The variance-covariance matrix is re-parameterised using an eigen-decomposition $$ \Sigma_k = \lambda_kD_kA_kD_k^\top, ~~~k=1, \dots, g ~~\mbox{(number of clusters)} $$ -\noindent resulting in several model choices, ranging from simple to complex, as shown in @tbl-covariances. +\noindent resulting in several model choices, ranging from simple to complex, as shown in `r ifelse(knitr::is_html_output(), '@tbl-covariances-html', '@tbl-covariances-pdf')`. ```{r echo=knitr::is_html_output()} #| label: mc-libraries @@ -24,15 +24,32 @@ library(colorspace) library(tourr) ``` -```{r} -#| label: tbl-covariances +::: {.content-visible when-format="html"} + +```{r eval=knitr::is_html_output()} +#| label: tbl-covariances-html #| tbl-cap: "Parameterizations of the covariance matrix." #| echo: FALSE #| message: FALSE -readr::read_csv('misc/mclust-covariances.csv') %>% +readr::read_csv('misc/mclust-covariances-html.csv') %>% knitr::kable(align = c('c', 'c', 'c', 'c', 'c', 'c')) %>% kableExtra::kable_styling(full_width = FALSE) ``` +::: + +::: {.content-visible when-format="pdf"} +```{r eval=knitr::is_latex_output()} +#| label: tbl-covariances-pdf +#| tbl-cap: "Parameterizations of the covariance matrix." +#| echo: FALSE +#| message: FALSE +readr::read_csv('misc/mclust-covariances-latex.csv') %>% + knitr::kable(align = c('c', 'c', 'c', 'c', 'c', 'c'), + format="latex", booktabs = T, + escape = FALSE) %>% + kableExtra::kable_styling(full_width = FALSE) +``` +::: \noindent Note the distribution descriptions "spherical" and "ellipsoidal". These are descriptions of the shape of the variance-covariance for a multivariate normal distribution. A standard multivariate normal distribution has a variance-covariance matrix with zeros in the off-diagonal elements, which corresponds to spherically shaped data. When the variances (diagonals) are different or the variables are correlated, then the shape of data from a multivariate normal is ellipsoidal. @@ -60,6 +77,9 @@ ggplot(penguins_sub, aes(x=bl, theme(aspect.ratio = 1) ``` +To draw ellipses in any dimension, a reasonable procedure is to sample points uniformly on a sphere, and then transform this into a sphere using the inverse of the variance-covariance matrix. The `mulgar` function `mc_ellipse()` does this for each cluster in the fitted model. + + ```{r} #| label: fig-penguins-bl-fl-mc #| message: FALSE @@ -68,6 +88,7 @@ ggplot(penguins_sub, aes(x=bl, #| fig-height: 4 #| out-width: 100% #| fig-cap: "Summary plots from model-based clustering: (a) BIC values for clusters 2-9 of top four models, (b) variance-covariance ellipses and cluster means (+) corresponding to the best model. The best model is three-cluster EVE, which has differently shaped variance-covariances albeit the same volume and orientation." +# Fit the model, plot BIC, construct and plot ellipses penguins_BIC <- mclustBIC(penguins_sub[,c(1,3)]) ggmc <- ggmcbic(penguins_BIC, cl=2:9, top=4) + scale_color_discrete_divergingx(palette = "Roma") + @@ -96,7 +117,7 @@ ggell <- ggplot() + ggmc + ggell + plot_layout(ncol=2) ``` -@fig-penguins-bl-fl-mc summarises the results. All models agree that three clusters is the best. The different variance-covariance models for three clusters have similar BIC values with EVE (different shape, same volume and orientation) being slightly higher. These plots are made from the `mclust` package output using the `ggmcbic` and `mc_ellipse` functions fro the `mulgar` package. +@fig-penguins-bl-fl-mc summarises the results. All models agree that three clusters is the best. The different variance-covariance models for three clusters have similar BIC values with EVE (different shape, same volume and orientation) being slightly higher. These plots are made from the `mclust` package output using the `ggmcbic()` and `mc_ellipse()` functions from the `mulgar` package. ## Examining the model in high dimensions diff --git a/12-summary-clust.qmd b/12-summary-clust.qmd index 6d06861..f2d5704 100644 --- a/12-summary-clust.qmd +++ b/12-summary-clust.qmd @@ -239,6 +239,41 @@ limn_tour_link( ![Highlighting the penguins where the methods disagree so we can see where these observations are located relative to the two clusters.](images/compare-clusters2.png){#fig-compare-clusters2} +Linking the confusion matrix with the tour can also be accomplished with `crosstalk` and `detourr`. + +```{r} +#| eval: false +#| echo: true +library(crosstalk) +library(plotly) +library(viridis) +p_cl_shared <- SharedData$new(penguins_cl) + +detour_plot <- detour(p_cl_shared, tour_aes( + projection = bl:bm, + colour = cl_w)) |> + tour_path(grand_tour(2), + max_bases=50, fps = 60) |> + show_scatter(alpha = 0.7, axes = FALSE, + width = "100%", height = "450px") + +conf_mat <- plot_ly(p_cl_shared, + x = ~cl_mc_j, + y = ~cl_w_j, + color = ~cl_w, + colors = viridis_pal(option = "D")(3), + height = 450) |> + highlight(on = "plotly_selected", + off = "plotly_doubleclick") %>% + add_trace(type = "scatter", + mode = "markers") + +bscols( + detour_plot, conf_mat, + widths = c(5, 6) + ) +``` + ## Exercises {-} 1. Compare the results of the four cluster model-based clustering with that of the four cluster Wards linkage clustering of the penguins data. @@ -248,7 +283,7 @@ limn_tour_link( ## Project {-} -Most of the time your data will not neatly separate into clusters, but partitioning it into groups of similar observations can still be useful. In this case our toolbox will be useful in comparing and contrasting different methods, understanding to what extend a cluster mean can describe the observations in the cluster, and also how the boundaries between clusters have been drawn. To explore this we will use survey data that examines the risk taking behavior of tourists. The data was collected in Australia in 2015 [@risk-survey] and includes six types of risks (recreational, health, career, financial, safety and social) with responses on a scale from 1 (never) to 5 (very often). The data is available in `risk_MSA.rds` from the book web site. +Most of the time your data will not neatly separate into clusters, but partitioning it into groups of similar observations can still be useful. In this case our toolbox will be useful in comparing and contrasting different methods, understanding to what extend a cluster mean can describe the observations in the cluster, and also how the boundaries between clusters have been drawn. To explore this we will use survey data that examines the risk taking behavior of tourists, this is the `risk_MSA` data, see the Appendix for details. 1. We first examine the data in a grand tour. Do you notice that each variable was measured on a discrete scale? 2. Next we explore different solutions from hierarchical clustering of the data. For comparison we will keep the number of clusters fixed to 6 and we will perform the hierarchical clustering with different combinations of distance functions (Manhattan distance and Euclidean distance) and linkage (single, complete and Ward linkage). Which combinations make sense based on what we know about the method and the data? diff --git a/13-intro-class.qmd b/13-intro-class.qmd index eb0ceaa..8026928 100644 --- a/13-intro-class.qmd +++ b/13-intro-class.qmd @@ -121,3 +121,25 @@ print(class1 + class2 + class3 + class4 + plot_layout(ncol=2)) ``` @fig-sup-example shows some 2D examples where the two classes are (a) linearly separable, (b) not completely separable but linearly different, (c) non-linearly separable and (d) not completely separable but with a non-linear difference. We can also see that in (a) only the horizontal variable would be important for the model because the two classes are completely separable in this direction. Although the pattern in (c) is separable classes, most models would have difficulty capturing the separation. It is for this reason that it is important to understand the boundary between classes produced by a fitted model. In each of b, c, d it is likely that some observations would be misclassified. Identifying these cases, and inspecting where they are in the data space is important for understanding the model's future performance. + +## Exercises {-} + +1. For the penguins data, use the tour to decide if the species are separable, and if the boundaries between species is linear or non-linear. +2. Using just the variables `se`, `maxt`, `mint`, `log_dist_road`, and "accident" or "lightning" causes, use the tour to decide whether the two classes are separable, and whether the boundary might be linear or non-linear. + +```{r eval=FALSE} +#| echo: false +b_sub <- bushfires |> + select(se, maxt, mint, log_dist_road, cause) |> + filter(cause %in% c("accident", "lightning")) |> + rename(ldr = log_dist_road) |> + mutate(cause = factor(cause)) +animate_xy(b_sub[,-5], col=b_sub$cause, rescale=TRUE) +animate_xy(b_sub[,-5], guided_tour(lda_pp(b_sub$cause)), col=b_sub$cause, rescale=TRUE) +``` + +::: {.content-hidden} +Q1 answer: Not separable, but boundary could be linear. + +Q2 answer: Gentoo and others are separable. Chinstrap and Adelie are not separable. All bounaries are linear. +::: diff --git a/16-svm.qmd b/16-svm.qmd index 8c0d478..53fbfea 100644 --- a/16-svm.qmd +++ b/16-svm.qmd @@ -31,7 +31,7 @@ Non-linear SVM models are interesting to examine also. Mostly one would examine To illustrate the approach, we use two simple simulated data examples. Both have only two variables, and two classes. Explaining SVM is easier when there are just two groups. In the first data set the two classes have different covariances matrices, which will cause trouble for LDA, but SVM should see the gap between the two clusters and place the separating hyperplane in the middle of the gap. In the second data set the two groups are concentric circles, with the inner one solid. A non-linear SVM should be fitted to this data, which should see circular gap between the two classes. -Note that the `svm` function in the `e1071` package will automatically scale observations into the range $[0,1]$. To make it easier to examine the fitted model, it is best to scale your data first, and then fit the model. +Note that the `svm()` function in the `e1071` package will automatically scale observations into the range $[0,1]$. To make it easier to examine the fitted model, it is best to scale your data first, and then fit the model. ```{r echo=knitr::is_html_output()} #| code-summary: "Code to simulate data examples" @@ -164,7 +164,7 @@ s1 + geom_abline(intercept=df1_svm$rho/w[2], slope=-w[1]/w[2]) ``` -**Note that** care in scaling of data is important to get the intercept calculated exactly. We have standardised the data, and set the `scale=FALSE` parameter in the `svm` function. The slope calculation is quite robust to the data scaling. +**Note that** care in scaling of data is important to get the intercept calculated exactly. We have standardised the data, and set the `scale=FALSE` parameter in the `svm()` function. The slope calculation is quite robust to the data scaling. ::: {.content-visible when-format="html"} ::: info diff --git a/17-nn.qmd b/17-nn.qmd index d6b4c4a..1b2403b 100644 --- a/17-nn.qmd +++ b/17-nn.qmd @@ -1,29 +1,29 @@ # Neural networks and deep learning + \index{classification!neural networks} -Neural networks (NN) can be considered to be nested additive (or even ensemble) models where explanatory variables are combined, and transformed through an activation function like a logistic. These transformed combinations are added recursively to yield class predictions. They are considered to be black box models, but there is a growing demand for interpretability. Although interpretability is possible, it can be unappealing to understand a complex model constructed to tackle a difficult classification task. Nevertheless, this is the motivation for the explanation of visualisation for NN models in this chapter. +Neural networks (NN) can be considered to be nested additive (or even ensemble) models where explanatory variables are combined, and transformed through an activation function like a logistic. These transformed combinations are added recursively to yield class predictions. They are considered to be black box models, but there is a growing demand for interpretability. Although interpretability is possible, it can be unappealing to understand a complex model constructed to tackle a difficult classification task. Nevertheless, this is the motivation for the explanation of visualisation for NN models in this chapter. In the simplest form, we might write the equation for a NN as $$ \hat{y} = f(x) = a_0+\sum_{h=1}^{s} w_{0h}\phi(a_h+\sum_{i=1}^{p} w_{ih}x_i) -$$ -where $s$ indicates the number of nodes in the hidden (middle layer), and $\phi$ is a choice of activation function. In a simple situation where $p=3$, $s=2$, and linear output layer, the model could be written as: +$$ where $s$ indicates the number of nodes in the hidden (middle layer), and $\phi$ is a choice of activation function. In a simple situation where $p=3$, $s=2$, and linear output layer, the model could be written as: $$ \begin{aligned} \hat{y} = a_0+ & w_{01}\phi(a_1+w_{11}x_1+w_{21}x_2+w_{31}x_3) +\\ & w_{02}\phi(a_2+w_{12}x_1+w_{22}x_2+w_{32}x_3) \end{aligned} -$$ -which is a combination of two (linear) models, each of which could be examined for their role in making predictions. +$$ which is a combination of two (linear) models, each of which could be examined for their role in making predictions. -In practice, a model may have many nodes, and several hidden layers, a variety of activation functions, and regularisation modifications. One should keep in mind the principle of parsimony is important when applying NNs, because it is tempting to make an overly complex, and thus over-parameterised, construction. Fitting NNs is still problematic. One would hope that fitting produces a stable result, whatever the starting seed the same parameter estimates are returned. However, this is not the case, and different, sometimes radically different, results are routinely obtained after each attempted fit [@wickham2015]. +In practice, a model may have many nodes, and several hidden layers, a variety of activation functions, and regularisation modifications. One should keep in mind the principle of parsimony is important when applying NNs, because it is tempting to make an overly complex, and thus over-parameterised, construction. Fitting NNs is still problematic. One would hope that fitting produces a stable result, whatever the starting seed the same parameter estimates are returned. However, this is not the case, and different, sometimes radically different, results are routinely obtained after each attempted fit [@wickham2015]. -For these examples we use the software `keras` [@keras] following the installation and tutorial details at [https://tensorflow.rstudio.com/tutorials/](https://tensorflow.rstudio.com/tutorials/). Because it is an interface to python it can be tricky to install. If this is a problem, the example code should be possible to convert to use `nnet` [@VR02] or `neuralnet` [@neuralnet]. We will use the penguins data to illustrate the fitting, because it makes it easier to understand the procedures and the fit. However, a NN is like using a jackhammer instead of a trowel to plant a seedling, more complicated than necessary to build a good classification model for this data. +For these examples we use the software `keras` [@keras] following the installation and tutorial details at . Because it is an interface to python it can be tricky to install. If this is a problem, the example code should be possible to convert to use `nnet` [@VR02] or `neuralnet` [@neuralnet]. We will use the penguins data to illustrate the fitting, because it makes it easier to understand the procedures and the fit. However, a NN is like using a jackhammer instead of a trowel to plant a seedling, more complicated than necessary to build a good classification model for this data. + +## Setting up the model -## Setting up the model \index{classification!ANN architecture} A first step is to decide how many nodes the NN architecture should have, and what activation function should be used. To make these decisions, ideally you already have some knowledge of the shapes of class clusters. For the penguins classification, we have seen that it contains three elliptically shaped clusters of roughly the same size. This suggests two nodes in the hidden layer would be sufficient to separate three clusters (@fig-nn-diagram). Because the shapes of the clusters are convex, using linear activation ("relu") will also be sufficient. The model specification is as follows: @@ -51,9 +51,9 @@ p_nn_model %>% compile( ) ``` -Note that `tensorflow::set_random_seed(211)` sets the seed for the model fitting so that we can obtain the same result to discuss later. It needs to be set before the model is defined in the code. The model will also be saved in order to diagnose and make predictions. +Note that `tensorflow::set_random_seed(211)` sets the seed for the model fitting so that we can obtain the same result to discuss later. It needs to be set before the model is defined in the code. The model will also be saved in order to diagnose and make predictions. -![Network architecture for the model on the penguins data. The round nodes indicate original or transformed variables, and each arrow connecting these is represented as one of the weights $w_{ih}$ in the definition. The boxes indicate the additive constant entering the nodes, and the corresponding arrows represent the terms $a_h$. ](images/nn-diagram.png){#fig-nn-diagram align="center"} +![Network architecture for the model on the penguins data. The round nodes indicate original or transformed variables, and each arrow connecting these is represented as one of the weights $w_{ih}$ in the definition. The boxes indicate the additive constant entering the nodes, and the corresponding arrows represent the terms $a_h$.](images/nn-diagram.png){#fig-nn-diagram align="center"} ```{r eval=FALSE} #| echo: false @@ -93,9 +93,10 @@ p_nn_wgts <- keras::get_weights(p_nn_model, trainable=TRUE) ``` ## Checking the training/test split + \index{classification!training/test split} -Splitting the data into training and test is an essential way to protect against overfitting, for most classifiers, but especially so for the copiously parameterised NNs. The model specified for the penguins data with only two nodes is unlikely to be overfitted, but it is nevertheless good practice to use a training set for building and a test set for evaluation. +Splitting the data into training and test is an essential way to protect against overfitting, for most classifiers, but especially so for the copiously parameterised NNs. The model specified for the penguins data with only two nodes is unlikely to be overfitted, but it is nevertheless good practice to use a training set for building and a test set for evaluation. `r ifelse(knitr::is_html_output(), '@fig-p-split-html', '@fig-p-split-pdf')` shows the tour being used to examine the split into training and test samples for the penguins data. Using random sampling, particularly stratified by group, should result the two sets being very similar, as can be seen here. It does happen that several observations in the test set are on the extremes of their class cluster, so it could be that the model makes errors in the neighbourhoods of these points. @@ -107,6 +108,8 @@ library(dplyr) library(tidyr) library(rsample) library(ggbeeswarm) +library(tidymodels) +library(tourr) load("data/penguins_sub.rda") # from mulgar book @@ -130,16 +133,19 @@ p_split_check <- bind_rows( #| code-summary: "Code to run tours" animate_xy(p_split_check[,1:4], col=p_split_check$species, - pch=p_split_check$type) + pch=p_split_check$type, + shapeset=c(16,1)) animate_xy(p_split_check[,1:4], guided_tour(lda_pp(p_split_check$species)), col=p_split_check$species, - pch=p_split_check$type) + pch=p_split_check$type, + shapeset=c(16,1)) render_gif(p_split_check[,1:4], grand_tour(), display_xy( col=p_split_check$species, - pch=p_split_check$type, + pch=p_split_check$type, + shapeset=c(16,1), cex=1.5, axes="bottomleft"), gif_file="gifs/p_split.gif", @@ -150,7 +156,8 @@ render_gif(p_split_check[,1:4], guided_tour(lda_pp(p_split_check$species)), display_xy( col=p_split_check$species, - pch=p_split_check$type, + pch=p_split_check$type, + shapeset=c(16,1), cex=1.5, axes="bottomleft"), gif_file="gifs/p_split_guided.gif", @@ -160,31 +167,30 @@ render_gif(p_split_check[,1:4], ``` ::: {.content-visible when-format="html"} -::: {#fig-p-split-html layout-ncol=2} - -![Grand tour](gifs/p_split.gif){#fig-split-grand fig-alt="FIX ME" width=300} +::: {#fig-p-split-html layout-ncol="2"} +![Grand tour](gifs/p_split.gif){#fig-split-grand fig-alt="FIX ME" width="300"} -![Guided tour](gifs/p_split_guided.gif){#fig-split-guided fig-alt="FIX ME" width=300} +![Guided tour](gifs/p_split_guided.gif){#fig-split-guided fig-alt="FIX ME" width="300"} Evaluating the training/test split, where we expect that the two samples should roughly match. There are a few observations in the test set that are on the outer edges of the clusters, which will likely result in the model making an error in these regions. However, the two samples roughly match. ::: ::: ::: {.content-visible when-format="pdf"} -::: {#fig-p-split-pdf layout-ncol=2} - -![Grand tour](images/p_split.png){#fig-split-grand fig-alt="FIX ME" width=220} +::: {#fig-p-split-pdf layout-ncol="2"} +![Grand tour](images/p_split.png){fig-alt="FIX ME" width="220"} -![Guided tour](images/p_split_guided.png){#fig-split-guided fig-alt="FIX ME" width=220} +![Guided tour](images/p_split_guided.png){fig-alt="FIX ME" width="220"} Evaluating the training/test split, where we expect that the two samples should roughly match. There are a few observations in the test set that are on the outer edges of the clusters, which will likely result in the model making an error in these regions. However, the two samples roughly match. ::: ::: ## Fit the model + \index{classification!Fitting a NN} -The data needs to be specially formatted for the model fitted using `keras`. The explanatory variables need to be provided as a `matrix`, and the categorical response needs to be separate, and specified as a `numeric` variable, beginning with 0. +The data needs to be specially formatted for the model fitted using `keras`. The explanatory variables need to be provided as a `matrix`, and the categorical response needs to be separate, and specified as a `numeric` variable, beginning with 0. ```{r} # Data needs to be matrix, and response needs to be numeric @@ -200,7 +206,7 @@ p_test_y <- p_test %>% pull(species) %>% as.numeric() p_test_y <- p_test_y-1 # Needs to be 0, 1, 2 ``` -The specified model is reasonably simple, four input variables, two nodes in the hidden layer and a three column binary matrix for output. This corresponds to 5+5+3+3+3=19 parameters. +The specified model is reasonably simple, four input variables, two nodes in the hidden layer and a three column binary matrix for output. This corresponds to 5+5+3+3+3=19 parameters. ```{r echo=FALSE} #| message: false @@ -209,7 +215,6 @@ p_nn_model <- load_model_tf("data/penguins_cnn") p_nn_model ``` - ```{r eval=FALSE} #| message: false # Fit model @@ -229,17 +234,19 @@ plot(p_nn_fit) keras::get_weights(p_nn_model, trainable=TRUE) ``` -Because we set the random number seed we will get the same fit each time the code provided here is run. However, if the model is re-fit without setting the seed, you will see that there is a surprising amount of variability in the fits. Setting `epochs = 200` helps to usually get a good fit. One expects that `keras` is reasonably stable so one would not expect the huge array of fits as observed in @wickham2015 using `nnet`. That this can happen with the simple model used here reinforces the notion that fitting of NN models is fiddly, and great care needs to be taken to validate and diagnose the fit. +Because we set the random number seed we will get the same fit each time the code provided here is run. However, if the model is re-fit without setting the seed, you will see that there is a surprising amount of variability in the fits. Setting `epochs = 200` helps to usually get a good fit. One expects that `keras` is reasonably stable so one would not expect the huge array of fits as observed in @wickham2015 using `nnet`. That this can happen with the simple model used here reinforces the notion that fitting of NN models is fiddly, and great care needs to be taken to validate and diagnose the fit. ::: {.content-visible when-format="html"} ::: info -Fitting NN models is fiddly, and very different fitted models can result from restarts, parameter choices, and architecture. +Fitting NN models is fiddly, and very different fitted models can result from restarts, parameter choices, and architecture. ::: ::: ::: {.content-visible when-format="pdf"} +```{=tex} \infobox{Fitting NN models is fiddly, and very different fitted models can result from restarts, parameter choices, and architecture. } +``` ::: ```{r echo=knitr::is_html_output()} @@ -252,7 +259,7 @@ library(colorspace) p_nn_model <- load_model_tf("data/penguins_cnn") ``` -The fitted model that we have chosen as the final one has reasonably small loss and high accuracy. Plots of loss and accuracy across epochs showing the change during fitting can be plotted, but we don't show them here, because they are generally not very interesting. +The fitted model that we have chosen as the final one has reasonably small loss and high accuracy. Plots of loss and accuracy across epochs showing the change during fitting can be plotted, but we don't show them here, because they are generally not very interesting. ```{r} p_nn_model %>% evaluate(p_test_x, p_test_y, verbose = 0) @@ -265,6 +272,7 @@ save_model_tf(p_nn_model, "data/penguins_cnn") ``` ## Extracting model components + \index{classification!hidden layers} ::: {.content-visible when-format="html"} @@ -274,8 +282,10 @@ View the individual node models to understand how they combine to produce the ov ::: ::: {.content-visible when-format="pdf"} +```{=tex} \infobox{View the individual node models to understand how they combine to produce the overall model. } +``` ::: Because nodes in the hidden layers of NNs are themselves (relatively simple regression) models, it can be interesting to examine these to understand how the model is making it's predictions. Although it's rarely easy, most software will allow the coefficients for the models at these nodes to be extracted. With the penguins NN model there are two nodes, so we can extract the coefficients and plot the resulting two linear combinations to examine the separation between classes. @@ -333,14 +343,15 @@ ggplot(p_all_m, aes(x=nn1, y=nn2, @fig-hidden-layer shows the data projected into the plane determined by the two linear combinations of the two nodes in the hidden layer. Training and test sets are indicated by empty and solid circles. The three species are clearly different but there is some overlap or confusion for a few penguins. The most interesting aspect to learn is that there is no big gap between the Gentoo and other species, which we know exists in the data. The model has not found this gap, and thus is likely to unfortunately and erroneously confuse some Gentoo penguins, particularly with Adelie. -What we have shown here is a process to use the models at the nodes of the hidden layer to produce a reduced dimensional space where the classes are best separated, at least as determined by the model. The process will work in higher dimensions also. +What we have shown here is a process to use the models at the nodes of the hidden layer to produce a reduced dimensional space where the classes are best separated, at least as determined by the model. The process will work in higher dimensions also. When there are more nodes in the hidden layer than the number of original variables it means that the space is extended to achieve useful classifications that need more complicated non-linear boundaries. The extra nodes describe the non-linearity. @wickham2015 provides a good illustration of this in 2D. The process of examining each of the node models can be useful for understanding this non-linear separation, also in high dimensions. ## Examining predictive probabilities + \index{classification!predictive probabilities} -When the predictive probabilities are returned by a model, as is done by this NN, we can use a ternary diagram for three class problems, or high-dimensional simplex when there are more classes to examine the strength of the classification. This done in the same way that was used for the votes matrix from a random forest in @sec-votes. +When the predictive probabilities are returned by a model, as is done by this NN, we can use a ternary diagram for three class problems, or high-dimensional simplex when there are more classes to examine the strength of the classification. This done in the same way that was used for the votes matrix from a random forest in @sec-votes. ```{r} # Predict training and test set @@ -365,6 +376,10 @@ p_test_pred_cat <- factor( table(p_test$species, p_test_pred_cat) ``` +```{r echo=FALSE, eval=FALSE} +# predict() causes the problem, use p_nn_model(p_test_x) instead +``` + ```{r echo=knitr::is_html_output()} #| code-fold: true # Set up the data to make the ternary diagram @@ -421,7 +436,6 @@ ggplot() + theme(aspect.ratio=1, legend.position = "right") ``` - ::: {.content-visible when-format="html"} ::: info If the training and test sets look similar when plotted in the model space then the model is not suffering from over-fitting. @@ -429,15 +443,17 @@ If the training and test sets look similar when plotted in the model space then ::: ::: {.content-visible when-format="pdf"} +```{=tex} \infobox{If the training and test sets look similar when plotted in the model space then the model is not suffering from over-fitting. } +``` ::: ## Local explanations -\index{classification!local explanations} -\index{classification!XAI} -It especially important to be able to interpret or explain a model, even more so when the model is complex or black-box'y. A good resource for learning about the range of methods is @iml. Local explanations provide some information about variables that are important for making the prediction for a particular observation. The method that we use here is Shapley value, as computed using the `kernelshap` package [@kernelshap]. +\index{classification!local explanations} \index{classification!XAI} + +It especially important to be able to interpret or explain a model, even more so when the model is complex or black-box'y. A good resource for learning about the range of methods is @iml. Local explanations provide some information about variables that are important for making the prediction for a particular observation. The method that we use here is Shapley value, as computed using the `kernelshap` package [@kernelshap]. ```{r eval=FALSE} # Explanations @@ -454,11 +470,11 @@ p_exp_sv <- shapviz(p_explain) save(p_exp_sv, file="data/p_exp_sv.rda") ``` -A Shapley value for an observation indicates how the variable contributes to the model prediction for that observation, relative to other variables. It is an average, computed from the change in prediction when all combinations of presence or absence of other variables. In the computation, for each combination, the prediction is computed by substituting absent variables with their average value, like one might do when imputing missing values. +A Shapley value for an observation indicates how the variable contributes to the model prediction for that observation, relative to other variables. It is an average, computed from the change in prediction when all combinations of presence or absence of other variables. In the computation, for each combination, the prediction is computed by substituting absent variables with their average value, like one might do when imputing missing values. -@fig-shapley-pcp shows the Shapley values for Gentoo observations (both training and test sets) in the penguins data, as a parallel coordinate plot. The values for the single misclassified Gentoo penguin (in the training set) is coloured orange. Overall, the Shapley values don't vary much on `bl`, `bd` and `fl` but they do on `bm`. The effect of other variables is seems to be only important for `bm`. +@fig-shapley-pcp shows the Shapley values for Gentoo observations (both training and test sets) in the penguins data, as a parallel coordinate plot. The values for the single misclassified Gentoo penguin (in the training set) is coloured orange. Overall, the Shapley values don't vary much on `bl`, `bd` and `fl` but they do on `bm`. The effect of other variables is seems to be only important for `bm`. -For the misclassified penguin, the effect of `bm` for all combinations of other variables leads to a decline in predicted value, thus less confidence in it being a Gentoo. In contrast, for this same penguin when considering the effect of `bl` the predicted value increases on average. +For the misclassified penguin, the effect of `bm` for all combinations of other variables leads to a decline in predicted value, thus less confidence in it being a Gentoo. In contrast, for this same penguin when considering the effect of `bl` the predicted value increases on average. ```{r echo=knitr::is_html_output()} #| code-fold: true @@ -520,7 +536,6 @@ p_exp_gentoo %>% If we examine the data [@fig-penguins-bl-bm-bd] the explanation makes some sense. The misclassified penguin has an unusually small value on `bm`. That the SHAP value for `bm` was quite different pointed to this being a potential issue with the model, particularly for this penguin. This penguin's prediction is negatively impacted by `bm` being in the model. - ```{r echo=knitr::is_html_output()} #| label: fig-penguins-bl-bm-bd #| code-fold: true @@ -588,7 +603,8 @@ sp1 + sp2 + plot_layout(ncol=2, guides = "collect") & ## Examining boundaries -@fig-penguins-lda-nn shows the boundaries for this NN model along with those of the LDA model. + +@fig-penguins-lda-nn shows the boundaries for this NN model along with those of the LDA model. ```{r echo=knitr::is_html_output(), eval=FALSE} #| label: fig-penguins-nn-boundaries @@ -630,34 +646,29 @@ ggplot(p_grid_proj, aes(x=nn1, y=nn2, legend.title = element_blank()) ``` - ::: {.content-visible when-format="html"} +::: {#fig-penguins-lda-nn-html layout-ncol="2"} +![LDA model](gifs/penguins_lda_boundaries.gif){#fig-lda-boundary fig-alt="FIX ME" width="300"} -::: {#fig-penguins-lda-nn-html layout-ncol=2} - -![LDA model](gifs/penguins_lda_boundaries.gif){#fig-lda-boundary fig-alt="FIX ME" width=300} +![NN model](gifs/penguins_nn_boundaries.gif){#fig-tree-boundary fig-alt="FIX ME" width="300"} -![NN model](gifs/penguins_nn_boundaries.gif){#fig-tree-boundary fig-alt="FIX ME" width=300} - -Comparison of the boundaries produced by the LDA (a) and the NN (b) model, using a slice tour. +Comparison of the boundaries produced by the LDA (a) and the NN (b) model, using a slice tour. ::: ::: -::: {#fig-penguins-lda-nn layout-ncol=2} - -![LDA model](images/fig-lda-2D-boundaries-1.png){#fig-lda-boundary2 fig-alt="FIX ME" width=200} +::: {#fig-penguins-lda-nn layout-ncol="2"} +![LDA model](images/fig-lda-2D-boundaries-1.png){#fig-lda-boundary2 fig-alt="FIX ME" width="200"} -![NN model](images/penguins-nn-boundaries-1.png){#fig-nn-boundary fig-alt="FIX ME" width=290} +![NN model](images/penguins-nn-boundaries-1.png){#fig-nn-boundary fig-alt="FIX ME" width="290"} -Comparison of the boundaries produced by the LDA (a) and the NN (b) model, using a slice tour. +Comparison of the boundaries produced by the LDA (a) and the NN (b) model, using a slice tour. ::: -\index{tour!slice} +\index{tour!slice} ## Application to a large dataset -To see how these methods apply in the setting where we have a large number of variables, observations and classes we will look at a neural network that predicts the category for the fashion MNIST data. The code for designing and fitting the model is following the tutorial available from https://tensorflow.rstudio.com/tutorials/keras/classification and you can find additional information there. Below we only replicate the steps needed to build the model from scratch. We also note that a similar investigation was presented in @li2020visualizing, with a focus on investigating the model at different epochs during the training. -\index{data!fashion MNIST} +To see how these methods apply in the setting where we have a large number of variables, observations and classes we will look at a neural network that predicts the category for the fashion MNIST data. The code for designing and fitting the model is following the tutorial available from https://tensorflow.rstudio.com/tutorials/keras/classification and you can find additional information there. Below we only replicate the steps needed to build the model from scratch. We also note that a similar investigation was presented in @li2020visualizing, with a focus on investigating the model at different epochs during the training. \index{data!fashion MNIST} The first step is to download and prepare the data. Here we scale the observations to range between zero and one, and we define the label names. @@ -779,7 +790,6 @@ cowplot::plot_grid(cowplot::plot_grid(p1, p2), legend_labels, Looking only at the first two principal components we note some clear differences from the transformation in the hidden layer. The observations seem to be more evenly spread in the input space, while in the activations space we notice grouping along specific directions. In particular the category "Bag" appears to be most different from all other classes, and the non-linear transformation in the activations space shows that they are clearly different from the shoe categories, while in the input space we could note some overlap in the linear projection. To better identify differences between other groups we will use the tour on the first five principal components. - ```{r echo=knitr::is_html_output(), eval=FALSE} #| code-fold: true #| code-summary: "Code to run tours" @@ -813,23 +823,20 @@ render_gif(activations_pc[,1:5], ``` ::: {.content-visible when-format="html"} -::: {#fig-fashion-gt-html layout-ncol=2} - -![Input space](gifs/fashion_images_gt.gif){#fig-fashion-input fig-alt="FIX ME" width=200} +::: {#fig-fashion-gt-html layout-ncol="2"} +![Input space](gifs/fashion_images_gt.gif){#fig-fashion-input fig-alt="FIX ME" width="200"} -![Activations](gifs/fashion_activations_gt.gif){#fig-fashion-activation fig-alt="FIX ME" width=200} +![Activations](gifs/fashion_activations_gt.gif){#fig-fashion-activation fig-alt="FIX ME" width="200"} Comparison of the test observations in the first five principal components of the input space (left) and in the hidden layer activations (right). The activation function results in more clearly defined grouping of the different classes. ::: ::: - ::: {.content-visible when-format="pdf"} -::: {#fig-p-split-pdf layout-ncol=2} +::: {layout-ncol="2"} +![Input space](images/fashion_images_gt_36.png){fig-alt="FIX ME" width="200"} -![Input space](images/fashion_images_gt_36.png){#fig-fashion-input fig-alt="FIX ME" width=200} - -![Activations](images/fashion_activation_gt_126.png){#fig-fashion-activation fig-alt="FIX ME" width=200} +![Activations](images/fashion_activation_gt_126.png){fig-alt="FIX ME" width="200"} Comparison of the test observations in the first five principal components of the input space (left) and in the hidden layer activations (right). The activation function results in more clearly defined grouping of the different classes. ::: @@ -853,6 +860,7 @@ observed <- as.numeric(test_tags) -1 table(observed, predicted) ``` +Here the labels are used as 0 - T-shirt/top, 1 - Trouser, 2 - Pullover, 3 - Dress, 4 - Coat, 5 - Sandal, 6 - Shirt, 7 - Sneaker, 8 - Bag, 9 - Ankle boot. From this we see that the model mainly confuses certain categories with each other, and within expected groups (e.g. different types of shoes can be confused with each other, or different types of shirts). We can further investigate this by visualizing the full probability matrix for the test observations, to see which categories the model is uncertain about. ```{r echo=knitr::is_html_output(), eval=FALSE} @@ -862,7 +870,7 @@ From this we see that the model mainly confuses certain categories with each oth fashion_test_pred <- predict(model_fashion_mnist, test_images, verbose = 0) -# copying this from RF fake tree vote matrix +# this is the same code as was used in the RF chapter proj <- t(geozoo::f_helmert(10)[-1,]) f_nn_v_p <- as.matrix(fashion_test_pred) %*% proj colnames(f_nn_v_p) <- c("x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8", "x9") @@ -880,23 +888,64 @@ f_nn_v_p_s <- bind_rows(sp, f_nn_v_p) %>% "Pullover", "Shirt", "Coat"), class, "Other")) %>% - mutate(class = factor(class, levels=c("Other", - "T-shirt/top", + mutate(class = factor(class, levels=c("T-shirt/top", "Pullover", "Shirt", - "Coat"))) -# nicely shows confusion between certain classes is common + "Coat", + "Other"))) + animate_xy(f_nn_v_p_s[,1:9], col = f_nn_v_p_s$class, - axes = "off", pch = ".", + axes = "off", cex=0.2, edges = as.matrix(simp$edges), edges.width = 0.05, - palette = "Lajolla") + palette = "Viridis") +render_gif(f_nn_v_p_s[,1:9], + grand_tour(), + display_xy( + col=f_nn_v_p_s$class, + cex=0.2, + palette = "Viridis", + axes="off", + edges = as.matrix(simp$edges), + edges.width = 0.05), + gif_file="gifs/fashion_confusion_gt.gif", + frames=500, + loop=FALSE +) ``` -For this data using explainers like SHAP is not so interesting, since the individual pixel contribution to a prediction are typically not of interest. With image classification a next step might be to further investigate which part of the image is important for a prediction, and this can be visualized as a heat map placed over the original image. This is especially interesting in the case of difficult or misclassified images. This however is beyond the scope of this book. +::: {.content-visible when-format="html"} +::: {#fig-fashion-conf-gt-html} +![Input space](gifs/fashion_confusion_gt.gif){#fig-fashion-confusion fig-alt="FIX ME" width="400"} + +A tour of the confusion matrix for the fashion MNIST test observations, focusing on a subset of items. Often observations get confused between two of the classes, this appears as points falling along one of the edges, for example some Shirts look more like T-shirts/tops, while others get confused with Coats. We can also notice that a subset of three other classes not mapped to colors as very separate from this group. +::: +::: + +::: {.content-visible when-format="pdf"} +::: {#fig-fashion-confusion-split-pdf layout-ncol="3"} +![](images/fashion_confustion_gt_36.png){fig-alt="FIX ME" width="130"} + +![](images/fashion_confusion_gt_58.png){fig-alt="FIX ME" width="130"} + +![](images/fashion_confusion_gt_69.png){fig-alt="FIX ME" width="130"} + +![](images/fashion_confusion_gt_161.png){fig-alt="FIX ME" width="130"} + +![](images/fashion_confusion_gt_212.png){fig-alt="FIX ME" width="130"} + +![](images/fashion_confusion_gt_333.png){fig-alt="FIX ME" width="130"} + +A tour of the confusion matrix for the fashion MNIST test observations, focusing on a subset of items. Often observations get confused between two of the classes, this appears as points falling along one of the edges, for example some Shirts look more like T-shirts/tops, while others get confused with Coats. We can also notice that a subset of three other classes not mapped to colors as very separate from this group. +::: +::: +The tour of the class probabilities shows that the model is often confused between two classes, this appears as points falling along one edge in the simplex. In particular for the highlighted categories we can notice some interesting patterns, where pairs of classes get confused with each other. We also see some three-way confusions, these are observations that fall on one surface triangle defined via three corners of the simplex, for example between Pullover, Shirt and Coat. +For this data using explainers like SHAP is not so interesting, since the individual pixel contribution to a prediction are typically not of interest. With image classification a next step might be to further investigate which part of the image is important for a prediction, and this can be visualized as a heat map placed over the original image. This is especially interesting in the case of difficult or misclassified images. This however is beyond the scope of this book. + +```{=html} - -## Exercises {-} - -1. The problem with the NN model fitted to the penguins is that the Gentoo are poorly classified, when they should be perfectly predictable due to the big gap between class clusters. Re-fit the NN to the penguins data, to find a better model that appropriately perfectly predicts Gentoo penguins. Support this by plotting the model (using the hidden layer), and the predictive probabilities as a ternary plot. Do the SHAP values also support that `bd` plays a stronger role in your best model? (`bd` is the main variable for distinguishing Gentoo's from the other species, particularly when used with `fl` or `bl`.) -2. For the fashion MNIST data we have seen that certain categories are more likely to be confused with each other. Select a subset of the data including only the categories Ankle boot, Sneaker and Sandal and see if you can reproduce the analysis of the penguins data in this chapter with this subset. -3. XXX fake trees, can we think about the number of nodes and make it work with a simple NN similar to penguins data? -4. The sketches data could also be considered a classic image classification problem, and we have seen that we can get a reasonable accuracy with a random forest model. Because we only have a smaller number of observations (compared to the fashion MNIST data) when fitting a neural network we need to be very careful not to overfit the training data. Try fitting a flat neural network (similar to what we did for the fashion MNIST data) and check the test accuracy of the model. -5. Challenge: try to design a more accurate neural network for the sketches data. Here you can investigate using a convolutional neural network in combination with data augmentation. In addition, using batch normalization should improve the model performance. - -```{r} -#| eval: false -#| echo: false -library(mulgar) -library(keras) -library(liminal) - - -tree_branch <- keras::to_categorical(fake_trees$branches, num_classes = 10) -tree_input <- scale(fake_trees[,1:100]) - -ft_nn <- keras_model_sequential() -ft_nn %>% - layer_dense(units = 128, activation = 'relu', input_shape = c(100)) %>% - layer_dropout(0.2) %>% - layer_dense(units = 64, activation = 'relu') %>% - layer_dropout(0.2) %>% - layer_dense(units = 10, activation = 'softmax') -ft_nn %>% compile( - loss = 'categorical_crossentropy', - optimizer = optimizer_adam(learning_rate = 5*10E-7), - metrics = c('accuracy') -) -history_ft <- ft_nn %>% fit( - tree_input, tree_branch, - epochs = 100, batch_size = 128, - validation_split = 0.2 -) - - ``` +## Exercises {.unnumbered} + +1. The problem with the NN model fitted to the penguins is that the Gentoo are poorly classified, when they should be perfectly predictable due to the big gap between class clusters. Re-fit the NN to the penguins data, to find a better model that appropriately perfectly predicts Gentoo penguins. Support this by plotting the model (using the hidden layer), and the predictive probabilities as a ternary plot. Do the SHAP values also support that `bd` plays a stronger role in your best model? (`bd` is the main variable for distinguishing Gentoo's from the other species, particularly when used with `fl` or `bl`.) +2. For the fashion MNIST data we have seen that certain categories are more likely to be confused with each other. Select a subset of the data including only the categories Ankle boot, Sneaker and Sandal and see if you can reproduce the analysis of the penguins data in this chapter with this subset. +3. Can you fit a neural network that can predict the class in the fake tree data? Because the data is noisy and we do not have that many observations, it can be easy to overfit the data. Once you find a setting that works, think about what aspects of the model might be interesting for visualization. What comparisons with a random forest model could be of interest? +4. The sketches data could also be considered a classic image classification problem, and we have seen that we can get a reasonable accuracy with a random forest model. Because we only have a smaller number of observations (compared to the fashion MNIST data) when fitting a neural network we need to be very careful not to overfit the training data. Try fitting a flat neural network (similar to what we did for the fashion MNIST data) and check the test accuracy of the model. +5. Challenge: try to design a more accurate neural network for the sketches data. Here you can investigate using a convolutional neural network in combination with data augmentation. In addition, using batch normalization should improve the model performance. ```{r} #| eval: false @@ -1059,5 +1076,3 @@ history_cnn <- model_cnn %>% fit( sketches_img_test <- formatData(sketches_test, flat = FALSE) evaluate(model_cnn, x = sketches_img_test$x, y = sketches_img_test$y) ``` - - diff --git a/2-notation.qmd b/2-notation.qmd index 3b3783c..03e0ec9 100644 --- a/2-notation.qmd +++ b/2-notation.qmd @@ -11,15 +11,23 @@ X_{21} & X_{22} & \dots & X_{2p}\\ X_{n1} & X_{n2} & \dots & X_{np} \end{array} \right]_{n\times p} \end{eqnarray*} -where $X$ indicates the the $n\times p$ data matrix, $X_j$ indicates variable $j, j=1, \dots, p$ and $X_{ij}$ indicates the value $j^{th}$ variable of the $i^{th}$ observation. (It can be confusing to distinguish whether one is referring to the observation or a variable, because $X_i$ is used to indicate observation also. When this is done it is usually accompanied by qualifying words such as **observation** $X_3$, or **variable** $X_3$.) +where $X$ indicates the $n\times p$ data matrix, $X_j$ indicates variable $j, j=1, \dots, p$ and $X_{ij}$ indicates the value of the $j^{th}$ variable for the $i^{th}$ observation. (It can be confusing to distinguish whether one is referring to the observation or a variable, because $X_i$ is used to indicate observation also. When this is done it is usually accompanied by qualifying words such as **observation** $X_3$, or **variable** $X_3$.) +::: {.content-visible when-format="html"} ::: info Having notation is helpful for concise explanations of different methods, to explain how data is scaled, processed and projected for various tasks, and how different quantities are calculated from the data. +::: +::: + +::: {.content-visible when-format="pdf"} + +\infobox{Having notation is helpful for concise explanations of different methods, to explain how data is scaled, processed and projected for various tasks, and how different quantities are calculated from the data. } + ::: When there is a response variable(s), it is common to consider $X$ to be the predictors, and use $Y$ to indicate the response variable(s). $Y$ could be a matrix, also, and would be $n\times q$, where commonly $q=1$. $Y$ could be numeric or categorical, and this would change how it is handled with visualisation. -To make a low-dimensional projection (shadow) of the data, we need a projection matrix: +To make a low-dimensional projection (shadow) of the data onto $d$ dimensions ($d < p$), we need an orthonormal basis: \begin{eqnarray*} A_{p\times d} = \left[ \begin{array}{cccc} @@ -76,7 +84,7 @@ or an individual cell (value): X[3,2] ``` -To make a projection we need an orthonormal matrix: +To make the data projection we need an orthonormal matrix: ```{r} #| code-fold: false @@ -92,17 +100,17 @@ sum(A[,1]^2) sum(A[,1]*A[,2]) ``` -and make a projection using matrix multiplication: +and compute the projected data using matrix multiplication: ```{r} #| code-fold: false X %*% A ``` -The seemingly magical number `0.707` used above and to create the projection in `r ifelse(knitr::is_html_output(), '@fig-explain-1D-html', '@fig-explain-1D-pdf')` arises from normalising a vector with equal contributions from each variable, `(1, 1)`. Dividing by `sqrt(2)` gives `(0.707, 0.707)`. +The magical number `0.707` used above and to create the projection in `r ifelse(knitr::is_html_output(), '@fig-explain-1D-html', '@fig-explain-1D-pdf')` arises from normalising a vector with equal contributions from each variable, `(1, 1)`. Dividing by `sqrt(2)` gives `(0.707, 0.707)`. - +::: {.content-visible when-format="html"} ::: info The notation convention used throughout the book is: @@ -112,11 +120,25 @@ The notation convention used throughout the book is: `g =` number of groups, in classification
`X =` data matrix ::: +::: + +::: {.content-visible when-format="pdf"} + +\infobox{The notation convention used throughout the book is: +\begin{itemize} +\item n = number of observations +\item p = number of variables, dimension of data +\item d = dimension of the projection +\item g = number of groups, in classification +\item X = data matrix +\end{itemize} +} +::: ## Exercises {-} 1. Generate a matrix $A$ with $p=5$ (rows) and $d=2$ (columns), where each value is randomly drawn from a standard normal distribution. Extract the element at row 3 and column 1. -2. We will interpret $A$ as a projection matrix and therefore it needs to be orthonormalised. Use the function `tourr::orthonormalise` to do this, and explicitly check that each column is normalised and that the two columns are orthogonal now. Which dimensions contribute most to the projection for your $A$? +2. We will interpret $A$ as an orthonormal basis and therefore it needs to be checked for orthonormality, and if it fails, then to be orthonormalised. Use the function `tourr::is_orthonormal` to explicitly check that each column is normalised and that the two columns are orthogonal. If they are not, then use `tourr::orthonormalise` to make them so. For the fixed version of $A$, which dimensions contribute most to the projection, horizontally and vertically? 3. Use matrix multiplication to calculate the projection of the `mulgar::clusters` data onto the 2D plane defined by $A$. Make a scatterplot of the projected data. Can you identify clustering in this view? diff --git a/3-intro-dimred.qmd b/3-intro-dimred.qmd index 9ecb01e..0106b25 100644 --- a/3-intro-dimred.qmd +++ b/3-intro-dimred.qmd @@ -1,8 +1,8 @@ # Dimension reduction overview {#sec-dimension-overview} -This chapter will focus on methods for reducing dimension, and how the tour[^tour-link] can be used to assist with the common methods such as principal component analysis (PCA), multidimensional scaling (MDS), t-stochastic neighbour embedding (t-SNE), and factor analysis. +This chapter sets up the concepts related to methods for reducing dimension such as principal component analysis (PCA) and t-stochastic neighbour embedding (t-SNE), and how the tour can be used to assist with these methods. -[^tour-link]: Note that the animated tours from this chapter can be viewed at [https://dicook.github.io/mulgar_book/3-intro-dimred.html](https://dicook.github.io/mulgar_book/3-intro-dimred.html). +## The meaning of dimension Dimension is perceived in a tour using the spread of points. When the points are spread far apart, then the data is filling the space. Conversely when the points "collapse" into a sub-region then the data is only partially filling the space, and some dimension reduction to reduce to this smaller dimensional space may be worthwhile. @@ -88,11 +88,13 @@ dp3 <- ggplot(df) + #| fig-width: 9 #| fig-height: 3 #| out-width: 100% -#| fig-cap: "Explanation of how dimension reduction is perceived in 2D, relative to variables: (a) Two variables with strong linear association. Both variables contribute to the association, as indicated by their axes extending out from the 'collapsed' direction of the points; (b) Two variables with no linear association. But x3 has less variation, so points collapse in this direction; (c) The situation in plot (b) does not arise in a tour because all variables are (usually) scaled. When an axes extends out of a direction where the points are collapsed, it means that this variable is partially responsible for the reduced dimension." +#| fig-cap: "Explanation of how dimension reduction is perceived in 2D, relative to variables: (a) Two variables with strong linear association. Both variables contribute to the association, as indicated by their axes extending out from the 'collapsed' direction of the points; (b) Two variables with no linear association. But x3 has less variation, so points collapse in this direction; (c) The situation in plot (b) does not arise in a tour because all variables are (usually) scaled. When an axis extends out of a direction where the points are collapsed, it means that this variable is partially responsible for the reduced dimension." #| fig-alt: "Three scatterplots: (a) points lie close to a straight line in the x=y direction, (b) points lie close to a horizontal line, (c) points spread out in the full plot region. There are no axis labels or scales." dp1 + dp2 + dp3 + plot_layout(ncol=3) ``` +## How to perceive the dimensionality using a tour + Now let's think about what this looks like with five variables. `r ifelse(knitr::is_html_output(), '@fig-dimension-html', '@fig-dimension-pdf')` shows a grand tour on five variables, with (a) data that is primarily 2D, (b) data that is primarily 3D and (c) fully 5D data. You can see that both (a) and (b) the spread of points collapse in some projections, with it happening more in (a). In (c) the data is always spread out in the square, although it does seem to concentrate or pile in the centre. This piling is typical when projecting from high dimensions to low dimensions. The sage tour [@sagetour] makes a correction for this. ```{r echo=knitr::is_html_output()} @@ -153,7 +155,7 @@ Different dimensional planes - 2D, 3D, 5D - displayed in a grand tour projecting ![5D plane in 5D](images/cube5d.png){#fig-cube5 width=160} -Single frames from different dimensional planes - 2D, 3D, 5D - displayed in a grand tour projecting into 2D. Notice that the 5D in 5D always fills out the box (although it does concentrate some in the middle which is typical when projecting from high to low dimensions). Also you can see that the 2D in 5D, concentrates into a line more than the 3D in 5D. This suggests that it is lower dimensional. (Animations can be viewed [here](https://dicook.github.io/mulgar_book/3-intro-dimred.html).) +Single frames from different dimensional planes - 2D, 3D, 5D - displayed in a grand tour projecting into 2D. Notice that the 5D in 5D always fills out the box (although it does concentrate some in the middle which is typical when projecting from high to low dimensions). Also you can see that the 2D in 5D, concentrates into a line more than the 3D in 5D. This suggests that it is lower dimensional. {{< fa play-circle >}} ::: ::: @@ -188,6 +190,7 @@ To make an example where not all variables contribute, we have added two additio #| code-fold: false #| fig-height: 3 #| fig-width: 6 +#| warning: false # Add two pure noise dimensions to the plane plane_noise <- plane plane_noise$x6 <- rnorm(100) @@ -281,7 +284,7 @@ Grand tour of the plane with two additional dimensions of pure noise. The collap ![](images/plane_noise2.png){width=200 fig-align="center"} -Two frames from a grand tour of the plane with two additional dimensions of pure noise. The collapsing of the points indicates that this is not fully 7D. This only happens when any of x1-x5 are contributing strongly (frame 49 x4, x5; frame 79 x1; frame 115 x2, x3). If x6 or x7 are contributing strongly the data is spread out fully (frames 27, 96). This tells us that x6 and x7 are not linearly associated, but other variables are. +Two frames from a grand tour of the plane with two additional dimensions of pure noise. The collapsing of the points indicates that this is not fully 7D. This only happens when any of x1-x5 are contributing strongly (frame 49 x4, x5; frame 79 x1; frame 115 x2, x3). If x6 or x7 are contributing strongly the data is spread out fully (frames 27, 96). This tells us that x6 and x7 are not linearly associated, but other variables are. {{< fa play-circle >}} ::: ::: @@ -362,7 +365,7 @@ Examples of different types of dimensionality issues: outliers (a) and non-linea ![Non-linear relationship](images/plane_nonlin.png){#fig-nonlinear width=200} -Two frames from tours of examples of different types of dimensionality issues: outliers (a) and non-linearity (b). In (a) you can see two points far from the others in the projection. During a tour the two can be seen with different movement patterns -- moving faster and in different directions than other points. Outliers will affect detection of reduced dimension, but they can be ignored when assessing dimensionality with the tour. In (b) there is a non-linear relationship between several variables, primarily with x3. Non-linear relationships may not be easily captured by other techniques but are often visible with the tour. +Two frames from tours of examples of different types of dimensionality issues: outliers (a) and non-linearity (b). In (a) you can see two points far from the others in the projection. During a tour the two can be seen with different movement patterns -- moving faster and in different directions than other points. Outliers will affect detection of reduced dimension, but they can be ignored when assessing dimensionality with the tour. In (b) there is a non-linear relationship between several variables, primarily with x3. Non-linear relationships may not be easily captured by other techniques but are often visible with the tour. {{< fa play-circle >}} ::: ::: @@ -432,6 +435,27 @@ htmlwidgets::saveWidget(pn_tour, 2. Examine 5D multivariate normal samples drawn from populations with a range of variance-covariance matrices. (You can use the `mvtnorm` package to do the sampling, for example.) Examine the data using a grand tour. What changes when you change the correlation from close to zero to close to 1? Can you see a difference between strong positive correlation and strong negative correlation? 3. The following code shows how to hide a point in a four-dimensional space, so that it is not visible in any of the plots of two variables. Generate both `d` and `d_r` and confirm that the point is visible in a scatterplot matrix of `d`, but not in the scatterplot matrix of `d_r`. Also confirm that it is visible in both data sets when you use a tour. +```{r} +#| eval: false +#| echo: false +# Answer to Q2 +library(tourr) +library(mvtnorm) + +s1 <- diag(5) +s2 <- diag(5) +s2[3,4] <- 0.7 +s2[4,3] <- 0.7 +s3 <- s2 +s3[1,2] <- -0.7 +s3[2,1] <- -0.7 + +set.seed(1234) +d1 <- as.data.frame(rmvnorm(500, sigma = s1)) +d2 <- as.data.frame(rmvnorm(500, sigma = s2)) +d3 <- as.data.frame(rmvnorm(500, sigma = s3)) +``` + ```{r eval=FALSE} library(tidyverse) library(tourr) diff --git a/4-pca.qmd b/4-pca.qmd index 7d47caf..655146a 100644 --- a/4-pca.qmd +++ b/4-pca.qmd @@ -31,15 +31,15 @@ PCA is not very effective when the distribution of the variables is highly skewe We would start by examining the data using a grand tour. The goal is to check whether there might be potential issues for PCA, such as skewness, outliers or clustering, or even non-linear dependencies. -We'll start be showing PCA on the simulated data from @sec-dimension-overview. The scree plots show that PCA supports that the data are 2D, 3D and 5D respectively. +We'll start by showing PCA on the simulated data from @sec-dimension-overview. The scree plots are produced using the `mulgar::ggscree()` function, and include a grey guideline to help decide how many PCs are sufficient. This guideline is generated by taking the median value from of the eigenvalues generated by doing PCA on 100 samples from a standard multivariate normal distribution. Any values much lower than this line would indicate that those PCs are not contributing to the explanation of variation. For these three simulated examples, the scree plots illustrate that PCA supports that the data are 2D, 3D and 5D respectively. \index{dimension reduction!scree plot} -```{r} +```{r echo=knitr::is_html_output()} #| message: FALSE #| error: FALSE #| warning: FALSE -#| code-fold: false +#| code-summary: "Code to make scree plots" # Conduct PCA and make the scree plot for # the 2-, 3- and 5-D planar data library(dplyr) @@ -56,7 +56,6 @@ p_pca <- prcomp(plane) b_pca <- prcomp(box) c_pca <- prcomp(cube5d) p_scree <- ggscree(p_pca, q = 5) + theme_minimal() - b_scree <- ggscree(b_pca, q = 5) + theme_minimal() c_scree <- ggscree(c_pca, q = 5) + theme_minimal() ``` @@ -73,7 +72,7 @@ p_scree + b_scree + c_scree + plot_layout(ncol=3) ``` -The next step is to look at the coefficients for the selected number of PCs. @tbl-plane-pcs shows the coefficients for the first two PCs of the `plane` data. All five variables contribute, with `x1`, `x2`, `x3` contributing more to `PC1`, and `x4`, `x5` contributing more to `PC2`. @tbl-box-pcs shows the coefficients for the first three PCs. Variables `x1`, `x2`, `x3` contribute strongly to `PC1`, `PC2` has contributions from all variables except `x3` and variables `x4` and `x5` contribute strongly to `PC3`. +The next step is to look at the coefficients for the selected number of PCs. @tbl-plane-pcs shows the coefficients for the first two PCs of the `plane` data. All five variables contribute, with `x1`, `x2`, `x3` contributing more to `PC1`, and `x4`, `x5` contributing more to `PC2`. @tbl-box-pcs shows the coefficients for the first three PCs of the `box` data. Variables `x1`, `x2`, `x3` contribute strongly to `PC1`, `PC2` has contributions from all variables except `x3` and variables `x4` and `x5` contribute strongly to `PC3`. \index{dimension reduction!coefficients} \index{dimension reduction!principal components} @@ -102,11 +101,11 @@ b_pca$rotation[,1:3] %>% decimals = 2) ``` -In each of these simulated data sets, all five variables contributed to the dimension reduction. If we added two purely noise variables to the plane data, as done in @sec-dimension-overview, the scree plot would indicate that the data is now 4D, and we would get a different interpretation of the coefficients from the PCA. We see that `PC1` and `PC2` are approximately the same as before, with main variables being (`x1`, `x2`, `x3`) and (`x4`, `x5`) respectively. `PC3` and `PC4` are both `x6` and `x7`. +In each of these simulated data sets, all five variables contributed to the dimension reduction. If we added two purely noise variables to the plane data, as done in @sec-dimension-overview, the scree plot in @fig-plane-noise-scree would indicate that the data is now 4D, and we would get a different interpretation of the coefficients from the PCA, see @tbl-plane-noise-pcs. We see that `PC1` and `PC2` are approximately the same as before, with main variables being (`x1`, `x2`, `x3`) and (`x4`, `x5`) respectively. `PC3` and `PC4` are both `x6` and `x7`. ```{r echo=knitr::is_html_output()} #| label: fig-plane-noise-scree -#| fig-cap: Additional noise variables expands the data to 4D. +#| fig-cap: Additional noise variables expands the plane data to 4D. #| code-fold: false #| fig-width: 6 #| fig-height: 4 @@ -123,7 +122,7 @@ ggscree(pn_pca, q = 7) + theme_minimal() ```{r echo=knitr::is_html_output()} #| label: tbl-plane-noise-pcs -#| tbl-cap: "Coefficients for the first four PCs for the box data." +#| tbl-cap: "Coefficients for PCs 1-4 of the plane plus noise data." #| code-summary: "Code to print PC coefficients" pn_pca$rotation[,1:4] %>% as_tibble(rownames="Variable") %>% @@ -200,7 +199,7 @@ Scree plot and tour of the `pisa` data, with 30 variables being the plausible sc ![Grand tour frame](images/pisa_gt_249.png){#fig-pisa-gt fig-alt="Selected linear projection of the pisa data from a grand tour. You can see strong linear dependence." fig.align="center"} -Scree plot and a frame from a tour of the `pisa` data, with 30 variables being the plausible scores for Australian students. In combination, these suggest that the data is effectively 1D. +Scree plot and a frame from a tour of the `pisa` data, with 30 variables being the plausible scores for Australian students. In combination, these suggest that the data is effectively 1D. {{< fa play-circle >}} ::: ::: @@ -224,7 +223,7 @@ The tour verifies that the `pisa` data is primarily 1D, indicating that a studen ### Example: aflw \index{data!aflw} -This data has player statistics for all the matches in the 2021 season. We would be interested to know which variables contain similar information, and thus might be combined into single variables. We would expect that many statistics to group into a few small sets, such as offensive and defensive skills. We might also expect that some of the statistics are skewed, most players have low values and just a handful of players are stellar. It is also possible that there are some extreme values. These are interesting features, but they will distract from the main purpose of grouping the statistics. Thus the tour is used to check for potential problems with the data prior to conducting PCA. +This data has player statistics for all the matches in the 2021 season. We would be interested to know which variables contain similar information, and thus might be combined into single variables. We would expect that many statistics group into a few small sets, such as offensive and defensive skills. We might also expect that some of the statistics are skewed, most players have low values and just a handful of players are stellar. It is also possible that there are some extreme values. These are interesting features, but they will distract from the main purpose of grouping the statistics. Thus the tour is used to check for potential problems with the data prior to conducting PCA. ```{r} #| label: pca-libraries @@ -273,7 +272,7 @@ Grand tour of the AFLW player statistics. Most player statistics concentrate nea ![](images/aflw_gt_329.png){width=228} -Two frames from a grand tour of the AFLW player statistics. Most player statistics concentrate near the centre, indicating most players are "average"! There are a few outliers appearing in different combinations of the skills, which one would expect to be the star players for particular skill sets. +Two frames from a grand tour of the AFLW player statistics. Most player statistics concentrate near the centre, indicating most players are "average"! There are a few outliers appearing in different combinations of the skills, which one would expect to be the star players for particular skill sets. {{< fa play-circle >}} ::: ::: @@ -281,10 +280,10 @@ No major surprises! There is a small amount of skewness, and there are no major Below we have the conventional summary of the PCA, a scree plot showing the reduction in variance to be explained when each additional PC is considered. It is also conventional to look at a table summarising the proportions of variance explained by PCs, but with almost 30 variables it is easier to make some decision on the number of PCs needed based on the scree plot. -```{r} +```{r echo=knitr::is_html_output()} #| label: fig-aflw-pca -#| fig-cap: "Scree plot showing decay in variance of PCs." -#| alt-text: "Scree plot showing variance vertically against PC number horizontally. Variance drops from close to 10 for PC 1 to about 1.2 for PC 4 then slowly decays through to PC 29" +#| fig-cap: "Scree plot showing decay in variance of PCs. There are sharp drops for the first four PCs, and then smaller declines." +#| alt-text: "Scree plot showing variance vertically against PC number horizontally. Variance drops from close to 10 for PC 1 to about 1.2 for PC 4 then slowly decays through to PC 29." #| fig-width: 6 #| fig-height: 4 #| out-width: 80% @@ -303,7 +302,7 @@ From the scree plot in @fig-aflw-pca, we see a sharp drop from one to two, two t ```{r echo=knitr::is_html_output()} #| label: tbl-aflw-pcs -#| tbl-cap: "Coefficients for the first four PCs." +#| tbl-cap: "Coefficients for the first four PCs. PC 1 contrasts some with PC 1, with the first having large coefficients primarily on field play statistics, and the second having large coefficients on the scoring statistics." #| code-summary: "Code to print PC coefficients" library(gt) aflw_pca$rotation[,1:4] %>% @@ -316,26 +315,20 @@ aflw_pca$rotation[,1:4] %>% When there are as many variables as this, it can be hard to digest the combinations of variables most contributing to each PC. Rearranging the table by sorting on a selected PC can help. @tbl-aflw-pcs has been sorted according to the PC 1 coefficients. -PC 1 is primarily composed of `disposals`, `possessions`, `kicks`, `metres`, `uncontested`, `contested`, .... Actually almost all variables positively contribute, albeit in different amounts! It is quite common in PCA for the first PC to be a combination of all variables, although it might commonly be a closer to equal contribution, and it tells us that there is one main direction of variation in the data. For PC 1 in the `aflw` data, PCA is telling us that the primary variation is through a combination of skills, and this maps to basic football playing skills, where some skills (e.g. disposals, possessions, kicks, ...) are more important. +PC 1 is primarily composed of `disposals`, `possessions`, `kicks`, `metres`, `uncontested`, `contested`, .... primarily the field play statistics! It is quite common in PCA for the first PC to be a combination of all variables, which suggests that there is one main direction of variation in the data. Here it is not quite that. PCA suggests that the primary variation is through a combination of field skills, or basic football playing skills. -Thus the second PC might be the more interesting. PC 2 is primarily a combination of `shots`, `goals`, `marks_in50`, `accuracy`, and `behinds` contrasted against `rebounds_in50` and `intercepts`. The negative coefficients are primary offensive skills and the positive coefficients are defensive skills. This PC is reasonable measure of the offensive vs defensive skills of a player. +Thus the second PC contrasts the first, because it is primarily a combination of `shots`, `goals`, `marks_in50`, `accuracy`, and `behinds` contrasted against `rebounds_in50` and `intercepts`. The positive coefficients are primary offensive skills and the negative coefficients are defensive skills. This PC is reasonable measure of the offensive vs defensive skills of a player. \index{dimension reduction!interpretation} -We would continue to interpret each PC by examining large coefficients to help decide how many PCs are a suitable summary of the information in the data. Briefly, PC 3 is a measure of worth of the player because `time_pct` has a large coefficient, so players that are on the field longer will contribute strongly to this new variable. It also has large (and opposite) contributions from `clearances`, `tackles`, `contested_marks`. PC 4 appears to be related to aggressive play with `clangers`, `turnovers`, `bounces` and `frees_against` featuring. So all four PCs have useful information. (Note, if we had continued to examine large coefficients on PC 5 we would find that all variables already have had reasonably large coefficients on PC 1-4, which supports restricting attention to the first four.) +We could continue to interpret each PC by examining large coefficients to help decide how many PCs are a suitable summary of the information in the data. Briefly, PC 3 mixed but it is possibly a measure of worth of the player because `time_pct` has a large coefficient, so players that are on the field longer will contribute strongly to this new variable. It also has large (and opposite) contributions from `clearances`, `tackles`, `contested_marks`. PC 4 appears to be related to aggressive play with `clangers`, `turnovers`, `bounces` and `frees_against` featuring. All four PCs have useful information. -Ideally, when we tour the four PCs, we'd like to be able to stop and identify players. This involves creating a pre-computed animation, with additional mouse-over, made possible by `plotly`. This is only feasible with a small number of observations, like the `aflw` data, because all of the animation frames are constructed in a single object. This object gets large very quickly! +For deeper exploration, when we tour the four PCs, we'd like to be able to stop and identify players. This can be done by creating a pre-computed animation, with additional mouse-over, made using `plotly`. However, it is not size-efficient, can is only feasible with a small number of observations. Because all of the animation frames with the fully projected data in each, are composed into a single object, which gets large very quickly. ::: {.content-visible when-format="html"} The result is shown in @fig-aflw-pcatour. We can see that the shape of the four PCs is similar to that of all the variables, bunching of points in the centre with a lot of moderate outliers. ::: - -::: {.content-visible when-format="pdf"} -The code to make this animation, and the interactive plot is in the online version of the book. -::: - - ```{r echo=knitr::is_html_output()} #| label: aflw-plotly #| eval: false @@ -420,7 +413,6 @@ animate_pca(aflw_pca$x[,1:5], #| message: false #| warning: false #| code-summary: "Code to generate interactive plot of frame 18" -library(plotly) load("data/aflw_pct.rda") aflw_pcti <- interpolate(aflw_pct, 0.1) f18 <- matrix(aflw_pcti[,,18], ncol=2) @@ -444,7 +436,7 @@ pg18 <- ggplot() + ::: {.content-visible when-format="html"} -```{r} +```{r eval=knitr::is_html_output()} #| label: fig-aflw-pcaplots-html #| message: false #| warning: false @@ -452,22 +444,22 @@ pg18 <- ggplot() + #| fig-width: 5 #| fig-height: 4 #| out-width: 80% +library(plotly) ggplotly(pg18, width=500, height=500) ``` - -For any particular frame, like 18 re-plotted in @fig-aflw-pcaplots-html, we can investigate further. Here there is a branching pattern, where the branch points in the direction of PC 1. Mouse-over the players at the tip of this branch and we find players like Alyce Parker, Brittany Bonnici, Dana Hooker, Kiara Bowers. If you look up the bios of these players you'll find they all have generally good player descriptions like "elite disposals", "powerful left foot", "hard-running midfielder", "best and fairest". ::: ::: {.content-visible when-format="pdf"} -```{r} + +```{r eval=knitr::is_latex_output()} #| label: fig-aflw-pcaplots-pdf #| echo: false #| message: false #| warning: false -#| fig-cap: "Frame 18 re-plotted so that players can be identified, ideally using on mouse-over. Here some points are labelled." +#| fig-cap: "Frame 18 re-plotted so that players can be identified. Here some players are labelled, but ideally this plot is interactive and any player can be identified. {{< fa play-circle >}}" #| fig-width: 5 #| fig-height: 4 -#| out-width: 80% +#| out-width: 60% p18_labels <- p18[["data_prj"]] %>% as_tibble() %>% filter(obs_labels %in% c("AhrensLauren", "LivingstoneStacey", "ParkerAlyce", "BonniciBrittany", "HookerDana", "BowersKiara")) pg18 + theme(aspect.ratio=1, legend.position="none") + geom_point(data=p18_labels, aes(x=P1, y=P2, @@ -478,11 +470,11 @@ pg18 + theme(aspect.ratio=1, legend.position="none") + colour=obs_labels), vjust = 1.2) ``` - -For any particular frame, like 18 re-plotted in @fig-aflw-pcaplots-pdf, we can investigate further. Here there is a branching pattern, where the branch points in the direction of PC 1. Mouse-over the players at the tip of this branch and we find players like Alyce Parker, Brittany Bonnici, Dana Hooker, Kiara Bowers. If you look up the bios of these players you'll find they all have generally good player descriptions like "elite disposals", "powerful left foot", "hard-running midfielder", "best and fairest". ::: -In the direction of PC 2, you'll find players like Lauren Ahrens, Stacey Livingstone who are star defenders. Players in this end of PC 1, have high scores on `intercepts` and `rebounds_in50`. +For any particular frame, like 18 re-plotted in `r ifelse(knitr::is_html_output(), '@fig-aflw-pcaplots-html', '@fig-aflw-pcaplots-pdf')`, we can investigate further. Here there is a branching pattern, where the branch points in the direction of PC 1. Mouse-over the players at the tip of this branch and we find players like Alyce Parker, Brittany Bonnici, Dana Hooker, Kiara Bowers. If you look up the bios of these players you'll find they all have generally good player descriptions like "elite disposals", "powerful left foot", "hard-running midfielder", "best and fairest". + +In the direction of PC 2, you'll find players like Lauren Ahrens, Stacey Livingstone who are star defenders. Players in this end of PC 2, have high scores on `intercepts` and `rebounds_in50`. Another interesting frame for inspecting PC 2 is 59. PC 2 at one end has players with high goal scoring skills, and the other good defending skills. So mousing over the other end of PC 2 finds players like Gemma Houghton and Katie Brennan who are known for their goal scoring. The branch pattern is an interesting one, because it tells us there is some combination of skills that are lacking among all players, primarily this appears to be there some distinction between defenders skills and general playing skills. It's not as simple as this because the branching is only visible when PC 1 and PC 2 are examined with PC 3. @@ -503,7 +495,7 @@ The tour verifies that PCA on the `aflw` data is complicated and doesn't capture When you choose a smaller number of PCs $(k)$ than the number of original variables, this is essentially producing a model for the data. The model is the lower dimensional $k$-D space. It is analogous to a linear regression model, except that the residuals from the model are $(p-k)$-D. -It is common to show the model, that is the data projected into the $k$-D model space. When $k=2$ this is called a "biplot". For the `plane` and `plane_noise` data the biplots are shown in @fig-plane-biplot. This is useful for checking which variables contribute most to the new principal component variables, and also to check for any problems that might have affected the fit, such as outliers, clusters or non-linearity. Interestingly, biplots are typically only made in 2D, even if the data should be summarised by more than two PCs. Occasionally you will see the biplot made for PC $j$ vs PC $k$ also. With the `pca_tour()` function in the `tourr` package you can view a $k$-D biplot. This will display the $k$ PCs with the axes displaying the original variables, and thus see their contribution to the PCs. +It is common to show the model, that is the data projected into the $k$-D model space. When $k=2$ this is called a "biplot". For the `plane` and `plane_noise` data the biplots are shown in @fig-plane-biplot. This is useful for checking which variables contribute most to the new principal component variables, and also to check for any problems that might have affected the fit, such as outliers, clusters or non-linearity. Interestingly, biplots are typically only made in 2D, even if the data should be summarised by more than two PCs. Occasionally you will see the biplot made for PC $j$ vs PC $k$ also. With the `pca_tour()` function in the `tourr` package you can view a $k$-D biplot. This will display the $k$ PCs with the axes displaying the original variables, and thus show their contribution to the PCs. ```{r echo=knitr::is_html_output()} @@ -602,17 +594,17 @@ PCA model overlaid on the data for the 2D in 5D, and 3D in 5D simulated data. ![Model for the 3D in 5D data.](images/box_model_13.png){#fig-box-model fig-alt="FIX ME." fig.align="center"} -PCA model overlaid on the data for the 2D in 5D, and 3D in 5D simulated data. +PCA model overlaid on the data for the 2D in 5D, and 3D in 5D simulated data. {{< fa play-circle >}} ::: ::: ### Example: pisa \index{data!pisa} -The model for the `pisa` data is a 1D vector, shown in `r ifelse(knitr::is_html_output(), '@fig-pisa-model-html', '@fig-pisa-model-pdf')`. +The model for the `pisa` data is a 1D vector, shown in `r ifelse(knitr::is_html_output(), '@fig-pisa-model-html', '@fig-pisa-model-pdf')`. In this example there is a good agreement between the model and the data. ```{r echo=knitr::is_html_output()} -#| eval: FALSE +#| eval: false #| code-summary: Code for model-in-the-data pisa_model <- pca_model(pisa_pca, d=1, s=2) @@ -646,7 +638,7 @@ PCA model of the `pisa` data. The 1D model captures the primary variation in the ![](images/pisa_model_17.png){width=300} -PCA model of the `pisa` data. The 1D model captures the primary variation in the data and there is a small amount of spread in all directions away from the model. +PCA model of the `pisa` data. The 1D model captures the primary variation in the data and there is a small amount of spread in all directions away from the model. {{< fa play-circle >}} ::: ::: @@ -656,7 +648,7 @@ The `pisa` data fits fairly closely to the 1D PCA model. The variance of points ::: ::: -::: {.content-visible when-format="html"} +::: {.content-visible when-format="pdf"} \insightbox{The `pisa` data fits fairly closely to the 1D PCA model. The variance of points away from the model is symmetric and relatively small. These suggest the 1D model is a reasonably summary of the test scores.} ::: @@ -666,7 +658,7 @@ The `pisa` data fits fairly closely to the 1D PCA model. The variance of points It is less useful to examine the PCA model for the `aflw` data, because the main patterns that were of interest were the exceptional players. However, we will do it anyway! `r ifelse(knitr::is_html_output(), '@fig-aflw-model-html', '@fig-aflw-model-pdf')` shows the 4D PCA model overlain on the data. Even though the distribution of points is not as symmetric and balanced as the other examples, we can see that the cube structure mirrors the variation. We can see that the relationships between variables are not strictly linear, because the spread extends unevenly away from the box. ```{r echo=knitr::is_html_output()} -#| eval: FALSE +#| eval: false #| code-summary: Code for model-in-the-data aflw_model <- pca_model(aflw_pca, d=4, s=1) @@ -704,7 +696,7 @@ PCA model of the `aflw` data. The linear model is not ideal for this data, which ![](images/aflw_model_70.png){width=300} -PCA model of the `aflw` data. The linear model is not ideal for this data, which has other patterns like outliers, and some branching. However, the model roughly captures the linear associations, and leaves unexplained variation in different directions. +PCA model of the `aflw` data. The linear model is not ideal for this data, which has other patterns like outliers, and some branching. However, the model roughly captures the linear associations, and leaves unexplained variation in different directions. {{< fa play-circle >}} ::: ::: @@ -792,11 +784,11 @@ Examining the handling of outliers in the PCA of the planar data with noise vari ::: {.content-visible when-format="pdf"} ::: {#fig-p-o-pca-pdf fig-align="center" layout-ncol=2} -![Outliers clearly visible](images/plane_n_o_clr_181.png){#fig-plane-n-o-clr width=230} +![Outliers clearly visible](images/plane_n_o_clr_181.png){#fig-plane-n-o-clr width=210} -![Outliers not clearly visible in PC1-4](images/plane_n_o_pca_181.png){#fig-plane-n-o-pca width=230} +![Outliers not clearly visible in PC1-4](images/plane_n_o_pca_181.png){#fig-plane-n-o-pca width=210} -Examining the handling of outliers in the PCA of the planar data with noise variables and two outliers. PCA has lost these two extreme values. +Examining the handling of outliers in the PCA of the planar data with noise variables and two outliers. PCA has lost these two extreme values. {{< fa play-circle >}} ::: ::: @@ -860,11 +852,11 @@ Comparison of the full data and first three principal components. Non-linear rel ::: {.content-visible when-format="pdf"} ::: {#fig-plane-nonlin-pdf fig-align="center" layout-ncol=2} -![All five variables](images/plane_nonlin_61.png){#fig-nonlinear2 width=250} +![All five variables](images/plane_nonlin_61.png){#fig-nonlinear2 width=210} -![First three PCs](images/plane_nonlin_pca_129.png){#fig-plane-nonlin-pca width=250} +![First three PCs](images/plane_nonlin_pca_129.png){#fig-plane-nonlin-pca width=210} -Comparison of the full data and first three principal components. Non-linear relationships between several variables can be seen in a tour on all five variables. The first three principal components reveal a strong non-linear relationship. Some of the non-linearity is clearly visible in the reduced dimension space, but the full data has more complexities. +Comparison of the full data and first three principal components. Non-linear relationships between several variables can be seen in a tour on all five variables. The first three principal components reveal a strong non-linear relationship. Some of the non-linearity is clearly visible in the reduced dimension space, but the full data has more complexities. {{< fa play-circle >}} ::: ::: @@ -944,7 +936,7 @@ Linear dimension reduction can optimise for other criteria, and here we will exp 1. Start by looking up the documentation of `dobin::dobin`. How many parameters does the method depend on? 2. We first apply the function to the `plane_noise_outliers` data using default values for all parameters. -3. Recall that the outliers were added in rows 101 and 102 of the data. Make a scatter plots showing the projection onto the first, second and third component, using color to highlight the outliers. Are they visible as outliers with three components? +3. Recall that the outliers were added in rows 101 and 102 of the data. Make a scatter plots showing the projection onto the first, second and third component found by `dobin`, using color to highlight the outliers. Are they visible as outliers with three components? 4. Adjust the `frac` parameter of the `dobin` function to `frac = 0.99` and repeat the graphical evaluation from point 3. How does it compare to the previous solution? ```{r} diff --git a/5-nldr.qmd b/5-nldr.qmd index c83f703..034d2c3 100644 --- a/5-nldr.qmd +++ b/5-nldr.qmd @@ -2,14 +2,19 @@ ## Explanation of NLDR methods -Non-linear dimension reduction (NLDR) aims to find a low-dimensional representation of the high-dimensional data that shows the main features of the data. In statistics, it dates back to @Kr64a's work on multidimensional scaling (MDS). Some techniques only require an interpoint similarity or distance matrix as the main ingredient, rather than the full data. We'll focus on when the full data is available here, so we can also compare structure perceived using the tour on the high-dimensional space, relative to structure revealed in the low-dimensional embedding. +Non-linear dimension reduction (NLDR) aims to find a low-dimensional representation of the high-dimensional data that shows the main features of the data. In statistics, it dates back to the work of @Kr64a on multidimensional scaling (MDS). Some techniques only require an interpoint similarity or distance matrix as the main ingredient, rather than the full data. We'll focus on when the full data is available here, so we can also compare structure perceived using the tour on the high-dimensional space, relative to structure revealed in the low-dimensional embedding. -There are many methods available for generating non-linear low dimensional representations of the data. MDS is a classical technique that minimises the difference between two interpoint distance matrices, the distance between points in the high-dimensions, and in the low-dimensional representations. A good resource for learning about MDS is @BG05. +There are many methods available for generating non-linear low dimensional representations of the data. Classically, MDS minimises some function of the difference between two interpoint distance matrices, the distance between points in the high-dimensions, and in the low-dimensional representations. + +$$ +\mbox{Stress}_D(x_1, ..., x_n) = \left(\sum_{i, j=1; i\neq j}^n (d_{ij} - d_k(i,j))^2\right)^{1/2} +$$ +where $D$ is an $n\times n$ matrix of distances $(d_{ij})$ between all pairs of points, and $d_k(i,j)$ is the distance between the points in the low-dimensional space. PCA is a special case of MDS. The result from PCA is a linear projection, but generally MDS can provide non-linear transformations to represent unusual high-dimensional patterns. A good resource for learning about MDS is @BG05. \index{dimension reduction!t-SNE} \index{dimension reduction!UMAP} -```{r} +```{r echo=knitr::is_html_output()} #| label: fig-nldr-clusters #| fig-cap: "Two non-linear embeddings of the non-linear clusters data: (a) t-SNE, (b) UMAP. Both suggest four clusters, with two being non-linear in some form." #| fig-alt: FIXME @@ -38,10 +43,13 @@ n2 <- ggplot(as.data.frame(cnl_umap), aes(x=V1, y=V2)) + n1 + n2 ``` -@fig-nldr-clusters show two NLDR views of the `clusters_nonlin` data set from the `mulgar` package. Both suggest that there are four clusters, and that some clusters are non-linearly shaped. They disagree on the type of non-linear pattern, where t-SNE represents one cluster as a wavy-shape and UMAP both have a simple parabolic shape. Popular methods in current use include t-SNE [@Maaten2008], UMAP [@McInnes2018] and PHATE [@Moon2019]. +Popular methods in current use for NLDR include t-SNE [@Maaten2008] and UMAP [@McInnes2018]. The approach of t-SNE is to compare interpoint distances with a standard probability distribution (eg $t$-distribution) to exaggerate local neighbourhood differences. UMAP compares the interpoint distances with what might be expected if the data was uniformly distributed in the high-dimensions. + +@fig-nldr-clusters shows two NLDR views of the `clusters_nonlin` data set from the `mulgar` package. Both suggest that there are four clusters, and that some clusters are non-linearly shaped. They disagree on the type of non-linear pattern, where t-SNE represents one cluster as a wavy-shape and UMAP both have a simple parabolic shape. ```{r} #| eval: false +#| echo: false #| code-summary: "Code to create animated gif" library(tourr) render_gif(clusters_nonlin, @@ -62,30 +70,35 @@ render_gif(clusters_nonlin, ::: {#fig-clusters-nonlin-pdf layout-ncol=2} -![](images/clusters_nonlin_60.png){width=250} +![](images/clusters_nonlin_60.png){width=220} -![](images/clusters_nonlin_233.png){width=250} +![](images/clusters_nonlin_233.png){width=220} -Two frames from a grand tour of the nonlinear clusters data set, shows four clusters. Two are very small and spherical in shape. One is large, and has a sine wave shape, and the other is fairly small with a bent rod shape. +Two frames from a grand tour of the nonlinear clusters data set, shows four clusters. Two are very small and spherical in shape. One is large, and has a sine wave shape, and the other is fairly small with a bent rod shape. {{< fa play-circle >}} ::: ::: -The full 4D data is shown with a grand tour in `r ifelse(knitr::is_html_output(), '@fig-clusters-nonlin-html', '@fig-clusters-nonlin-pdf')` @. The four clusters suggested by the NLDR methods can be seen. We also get a better sense of the relative size and proximity of the clusters. There are two small spherical clusters, one quite close to the end of the large sine wave cluster. The fourth cluster is relatively small, and has a slight curve, like a bent rod. The t-SNE representation is slightly more accurate than the UMAP representation. We would expect that the wavy cluster is the sine wave seen in the tour. - +The full 4D data is shown with a grand tour in `r ifelse(knitr::is_html_output(), '@fig-clusters-nonlin-html', '@fig-clusters-nonlin-pdf')`. The four clusters suggested by the NLDR methods can be seen. We also get a better sense of the relative size and proximity of the clusters. There are two small spherical clusters, one quite close to the end of the large sine wave cluster. The fourth cluster is relatively small, and has a slight curve, like a bent rod. The t-SNE representation is slightly more accurate than the UMAP representation. We would expect that the wavy cluster is the sine wave seen in the tour. +::: {.content-visible when-format="html"} ::: info NLDR can provide useful low-dimensional summaries of high-dimensional structure but you need to check whether it is a sensible and accurate representation by comparing with what is perceived from a tour. ::: +::: + +::: {.content-visible when-format="pdf"} +\infobox{NLDR can provide useful low-dimensional summaries of high-dimensional structure but you need to check whether it is a sensible and accurate representation by comparing with what is perceived from a tour.} +::: ## Assessing reliability of the NLDR representation -NLDR can produce useful low-dimensional summaries of structure in high-dimensional data, like those shown in @fig-nldr-clusters. However, there are numerous pitfalls. The fitting procedure can produce very different representations depending on the parameter choices, and even the random number seeding the fit. (You can check this by changing the `set.seed` in the code above, and by changing from the default parameters.) Also, it may not be possible to represent the high-dimensional structures faithfully low dimensions. For these reasons, one needs to connect the NLDR view with a tour of the data, to help assess its usefulness and accuracy. For example, with this data, we would want to know which of the two curved clusters in the UMAP representation correspond to the sine wave cluster. +NLDR can produce useful low-dimensional summaries of structure in high-dimensional data, like those shown in @fig-nldr-clusters. However, there are numerous pitfalls. The fitting procedure can produce very different representations depending on the parameter choices, and even the random number seeding the fit. (You can check this by changing the `set.seed` in the code above, and by changing from the default parameters.) Also, it may not be possible to represent the high-dimensional structures faithfully in low dimensions. For these reasons, one needs to connect the NLDR view with a tour of the data, to help assess its usefulness and accuracy. For example, with this data, we would want to know which of the two curved clusters in the UMAP representation correspond to the sine wave cluster. ### Using `liminal` \index{liminal} -@fig-liminal-clusters-nonlin shows how the NLDR plot can be linked to a tour view, using the `liminal` package, to better understand how well the structure of the data is represented. Here we see learn that the smile in the UMAP embedding is the small bent rod cluster, and that the unibrow is the sine wave. +@fig-liminal-clusters-nonlin shows how the NLDR plot can be linked to a tour view, using the `liminal` package, to better understand how well the structure of the data is represented. Here we learn that the smile in the UMAP embedding is the small bent rod cluster, and that the unibrow is the sine wave. ```{r} @@ -185,31 +198,37 @@ limn_tour_link( ::: {#fig-liminal-trees layout-ncol=1} -![Linked views of t-SNE dimension reduction with a tour of the fake trees data. The t-SNE view clearly shows ten 1D non-linear clusters, while the tour of the full 100 variables suggests a lot more variation in the data, and less difference between clusters. ](images/fake_trees1.png){#fig-trees1} +![Linked views of t-SNE dimension reduction with a tour of the fake trees data. The t-SNE view clearly shows ten 1D non-linear clusters, while the tour of the full 100 variables suggests a lot more variation in the data, and less difference between clusters. ](images/fake_trees1.png){#fig-trees1 width=300} -![Focus on the green cluster which is split by t-SNE. The shape as viewed in many linear projections shown by the tour shows that it is a single curved cluster. The split is an artifact of the t-SNE mapping.](images/fake_trees2.png){#fig-trees2} +![Focus on the green cluster which is split by t-SNE. The shape as viewed in many linear projections shown by the tour shows that it is a single curved cluster. The split is an artifact of the t-SNE mapping.](images/fake_trees2.png){#fig-trees2 width=300} -![Focus on the purple cluster which splits the green cluster in the t-SNE view. The tour shows that these two clusters are distinct, but are close in one neighbourhood of the 100D space. The close proximity in the t-SNE view is reasonable, though.](images/fake_trees3.png){#fig-trees3} +![Focus on the purple cluster which splits the green cluster in the t-SNE view. The tour shows that these two clusters are distinct, but are close in one neighbourhood of the 100D space. The close proximity in the t-SNE view is reasonable, though.](images/fake_trees3.png){#fig-trees3 width=300} Three snapshots of using the `liminal` linked views to explore how t-SNE has summarised the `fake_trees` data in 2D. ::: - +::: {.content-visible when-format="html"} ::: insight The t-SNE representation clearly shows the linear structures of the data, but viewing this 10D data with the tour shows that t-SNE makes several inaccurate breaks of some of the branches. ::: +::: +::: {.content-visible when-format="pdf"} +\insightbox{The t-SNE representation clearly shows the linear structures of the data, but viewing this 10D data with the tour shows that t-SNE makes several inaccurate breaks of some of the branches. } +::: ## Exercises {-} -1. Using the `penguins_sub` data generate a 2D representation using t-SNE. Plot the points mapping the colour to species. What is most surprising? (Hint: Are the three species represented by three distinct clusters?) -2. Re-do the t-SNE representation with different parameter choices. Are the results different each time, or could they be considered to be equivalent? -3. Use `liminal` or `detourr` to link the t-SNE representation to a tour of the penguins. Highlight the points that have been placed in an awkward position by t-SNE from others in their species. Watch them relative to the others in their species in the tour view, and think about whether there is any rationale for the awkward placement. -4. Use UMAP to make the 2D representation, and use `liminal` or `detourr` to link with a tour to explore the result. -5. Conduct your best t-SNE and UMAP representations of the `aflw` data. Compare and contrast what is learned relative to a tour on the principal component analysis. +1. This question uses the `penguins_sub` data + +a. Generate a 2D representation using t-SNE. Plot the points mapping the colour to species. What is most surprising? (Hint: Are the three species represented by three distinct clusters?) +b. Re-do the t-SNE representation with different parameter choices, including using different random seeds. Are the results different each time, or do you think that they could be considered to be equivalent? +c. Use `liminal` or `detourr` to link the t-SNE representation to a tour of the penguins. Highlight the points that have been placed in an awkward position by t-SNE from others in their species. Watch them relative to the others in their species in the tour view, and think about whether there is any rationale for the awkward placement. +d. Try again using UMAP to make the 2D representation, and use `liminal` or `detourr` to link with a tour to explore the result. +2. Conduct your best t-SNE and UMAP representations of the `aflw` data. Compare and contrast what is learned relative to a tour or the principal component analysis. ```{r} #| label: penguins-tsne @@ -227,9 +246,34 @@ limn_tour_link( cols = bl:bm, color = species ) +# The t-SNE mapping of the penguins data inaccurately splits one of the clusters. The three clusters are clearly distinct when viewed with the tour. ``` +## Project {-} + +Gene expressions measured as scRNA-Seq of 2622 human peripheral blood mononuclear cells data is available from the `Seurat` R package [@seurat1, @seurat2, @seurat3, @seurat4]. The paper web site has code to extract and pre-process the data, which follow the tutorial at https://satijalab.org/seurat/articles/pbmc3k_tutorial.html. The processed data, containing the first 50 PCs is provided with the book, as `pbmc_pca_50.rds`. + +The original paper [@chen2023] used UMAP on the first 15 PCs to find a representation of the data to illustrate the clustering. They used the default settings of the `RunUMAP()` function in `Seurat`, without setting a seed. + +Generate the t-SNE and UMAP representations of the first 9 PCs of data, using their default settings. They should be quite different. (We use 9 PCs because the scree plot in the data pre-processing suggests that 15 is too many.) Based on your examination of the data in a tour, which method yields the more accurate representation? Explain what the structure in the 2D is relative to that seen in the tour. + + +```{r} +#| label: pbmc +#| message: false +#| eval: false +#| echo: false +pbmc <- readRDS("data/pbmc_pca_50.rds") - \ No newline at end of file +# t-SNE +set.seed(1041) +p_tsne <- Rtsne::Rtsne(pbmc[,1:15]) +p_tsne_df <- data.frame(tsneX = p_tsne$Y[, 1], tsneY = p_tsne$Y[, 2]) +ggplot(p_tsne_df, aes(x=tsneX, y=tsneY)) + geom_point() + +# UMAP +set.seed(1045) +p_umap <- uwot::umap(pbmc[,1:15]) +p_umap_df <- data.frame(umapX = p_umap[, 1], umapY = p_umap[, 2]) +ggplot(p_umap_df, aes(x=umapX, y=umapY)) + geom_point() +``` diff --git a/6-intro-clust.qmd b/6-intro-clust.qmd index a877e80..ee772ef 100644 --- a/6-intro-clust.qmd +++ b/6-intro-clust.qmd @@ -114,7 +114,7 @@ Dynamic graphical methods help us to find and understand the cluster structure i situations we can use graphics alone to group observations into clusters, using a "spin and brush" method. In more difficult data problems, we can assess and refine numerical solutions using graphics.\index{brushing!persistent} \index{cluster analysis!spin-and-brush} -This part of the book discusses the use of interactive and dynamic graphics in the clustering of data. @sec-clust-bg introduces cluster analysis, focusing on interpoint distance measures. @sec-clust-graphics describes an example of a purely graphical approach to cluster analysis, the spin and brush method. In the example shown in that section, we were able to find simplifications of the data that had not been found using numerical clustering methods, and to find a variety of structures in high-dimensional space. @sec-hclust describes methods for reducing the interpoint distance matrix to an intercluster distance matrix using hierarchical algorithms, @sec-mclust covers model-based clustering, and @sec-som described clustering with self-organising maps. Each of these chapters shows how graphical tools can be used to assess the results of numerical methods. @sec-clust-compare summarizes the chapter and revisits the data analysis strategies used in the examples. Additional references that provide good companions to the material presented in these chapters are @VR02, @HOML, @hennig, @giordani, @kassambara, and the CRAN Task View [@ctv-clustering]. @sec-clust-compare summarizes the chapter and revisits the data analysis strategies used in the examples. +This part of the book discusses the use of interactive and dynamic graphics in the clustering of data. @sec-clust-bg introduces cluster analysis, focusing on interpoint distance measures. @sec-clust-graphics describes an example of a purely graphical approach to cluster analysis, the spin and brush method. In the example shown in that section, we were able to find simplifications of the data that had not been found using numerical clustering methods, and to find a variety of structures in high-dimensional space. @sec-hclust describes methods for reducing the interpoint distance matrix to an intercluster distance matrix using hierarchical algorithms, in @sec-kmeans shows the the $k$-means algorithm, @sec-mclust covers model-based clustering, and @sec-som describes clustering with self-organising maps. Each of these chapters shows how graphical tools can be used to assess the results of numerical methods. @sec-clust-compare summarizes these chapters and revisits the data analysis strategies used in the examples. Additional references that provide good companions to the material presented in these chapters are @VR02, @HOML, @hennig, @giordani, @kassambara, and the CRAN Task View [@ctv-clustering]. ## The importance of defining similar {#sec-clust-bg} @@ -238,7 +238,7 @@ overlapping with case $a_1$, has a very different shape (high, medium, medium, l \begin{align*} \rho(a_i,a_j) = \frac{(a_i-c_i)^\top(a_j-c_j)} -{\sqrt(a_i-c_i)^\top(a_i-c_i) \sqrt(a_j-c_j)^\top(a_j-c_j)} +{\sqrt{(a_i-c_i)^\top(a_i-c_i)} \sqrt{(a_j-c_j)^\top(a_j-c_j)}} \label{corc} \end{align*} @@ -313,7 +313,7 @@ colnames(vc) <- paste0("x", 1:3) vc ``` -(The function `mahalanobis` will calculate this in R. Technically this gives distance between each case and the mean vector.) +(The base function `mahalanobis()` will calculate this in R. Technically this gives distance between each case and the mean vector.) 4. Is the ordering of distance between cases the same if Manhattan distance is used instead of Euclidean? diff --git a/7-spin-and-brush.qmd b/7-spin-and-brush.qmd index e21bdd8..898db9e 100644 --- a/7-spin-and-brush.qmd +++ b/7-spin-and-brush.qmd @@ -16,122 +16,6 @@ It will not work very well when there are no distinct clusters and the purpose o With a complex problem where there are many clusters, one can work sequentially, and remove each cluster after it is brushed, to de-clutter the display, in order to find more clusters. - Spin-and-brush is best achieved using a fully interactive graphics system like in the `detourr` package, where the results can be saved for further analysis. The code is very easy, and then all the controls are interactive. @@ -180,28 +64,6 @@ table(penguins_sub$species, detourr_penguins$colour) It's quite close! All but two of the 119 Gentoo penguins were identified as a cluster (labelled as "3e9eb6" from the chosen light blue hex colour), and all but three of the 146 Adelie penguins were identified as a cluster, (labelled as "000000" which is the unbrushed black group). Most of the Chinstrap species were recovered also (labelled as "f5191c" for the red hex colour). - -```{r} -#| eval: false -#| echo: false -# remotes::install_github("pfh/langevitour") -# remotes::install_github("plotly/plotly.R") -library(langevitour) -library(crosstalk) -shared <- SharedData$new(penguins_sub) - -langevitourWidget <- langevitour( - penguins_sub[,1:4], - link=shared, - pointSize=2, - width=700, height=700) - -library(liminal) -limn_tour(fake_trees, dim1:dim10) -``` - - - ## Exercises {-} 1. Use the spin-and-brush approach to identify the three clusters in the `mulgar::clusters` data set. @@ -218,12 +80,13 @@ library(detourr) strt <- tourr::basis_random(10, 2) detour(multicluster, tour_aes(projection = -group)) |> - tour_path(grand_tour(2), start=strt, fps = 60) |> - show_scatter(alpha = 0.7, axes = FALSE) - + tour_path(grand_tour(2), + start=strt, fps = 60) |> + show_scatter(alpha = 0.7, + axes = FALSE) ``` -3. Use the spin-and-brush technique to identify the branches of the `fake_trees` data. The result should look something like this: +4. Use the spin-and-brush technique to identify the branches of the `fake_trees` data. The result should look something like this: ![Example solution after spin-and-brush on fake trees data.](images/fake_trees_sb.png){#fig-fake-trees-sb fig-alt="Projection where some clusters extend in different direction, with point colors indicating the user-identified clusters."} @@ -243,9 +106,16 @@ ft_pca <- prcomp(fake_trees[,1:100], ggscree(ft_pca) detour(as.data.frame(ft_pca$x[,1:10]), tour_aes(projection = PC1:PC10)) |> - tour_path(grand_tour(2), fps = 60, max_bases=50) |> - show_scatter(alpha = 0.7, axes = FALSE) + tour_path(grand_tour(2), + fps = 60, + max_bases=50) |> + show_scatter(alpha = 0.7, + axes = FALSE) ft_sb <- read_csv("data/fake_trees_sb.csv") table(fake_trees$branches, ft_sb$colour) ``` + +## Project {-} + +This exercise continues from the project in Chapter 5, to check your choice of NLDR representation. Using your best NLDR representation, cluster the data into as many clusters or clumps as you can see. Save the clusters. Now use spin-and-brush in `detourr` to colour as many clusters in the high dimensions as you can find. Save your clusters. How closely do these two approaches agree? diff --git a/8-hierarchical.qmd b/8-hierarchical.qmd index 13919f3..8f96739 100644 --- a/8-hierarchical.qmd +++ b/8-hierarchical.qmd @@ -23,8 +23,8 @@ Here we will take a look at hierarchical clustering, using Wards linkage, on the 1. Plot the data to check for presence of clusters and their shape. 2. Compute the hierarchical clustering. -3. Plot the dendrogram to help decide on an appropriate number of clusters, using the `dendro_data` function from the `ggdendro` package. -4. Show the dendrogram overlaid on the data, calculated by the `hierfly` function in `mulgar`. +3. Plot the dendrogram to help decide on an appropriate number of clusters, using the `dendro_data()` function from the `ggdendro` package. +4. Show the dendrogram overlaid on the data, calculated by the `hierfly()` function in `mulgar`. 5. Plot the clustering result, by colouring points in the plot of the data. \index{cluster analysis!dendrogram} diff --git a/A1-toolbox.qmd b/A1-toolbox.qmd index 2e36849..b4516dc 100644 --- a/A1-toolbox.qmd +++ b/A1-toolbox.qmd @@ -60,7 +60,7 @@ Scaling of multivariate data is really important in many ways. It affects most m It is generally useful to standardise your data to have mean 0 and variance-covariance equal to the identity matrix before using the tour. We use the tour to discover associations between variables. Characteristics of single variables should be examined and understood before embarking on looking for high-dimensional structure. -The `rescale` parameter in the `animate()` function will scale all variables to range between 0 and 1, prior to starting the tour. This will force all to have the same range. It is the default, and without this data with different ranges across variable may have some strange patterns. If you have already scaled the data yourself, even if using a different scaling such as using standardised variables you should set `rescale=FALSE`. +The `rescale` parameter in the `animate()` function will scale all variables to range between 0 and 1, prior to starting the tour. This will force all to have the same range. It is not the default, and without this data with different ranges across variable may have some strange patterns. You should set `rescale=TRUE`. If you have already scaled the data yourself, even if using a different scaling such as using standardised variables then the default `rescale=FALSE` is best. A more severe transformation that can be useful prior to starting a tour is to **sphere** the data. This is also an option in the `animate()` function, but is `FALSE` by default. Sphering is the same as conducting a principal component analysis, and using the principal components as the variables. It removes all linear association between variables! This can be especially useful if you want to focus on finding non-linear associations, including clusters, and outliers. @@ -82,10 +82,11 @@ Versions and elements of tours can be saved for publication using a variety of f ### Understanding your tour path -`r ifelse(knitr::is_html_output(), '@fig-tour-paths-html', '@fig-tour-paths-pdf')` shows tour paths on 3D data spaces. For 1D projections the space of all possible projections is a $p$-dimensional sphere [@fig-tourpaths1d]. For 2D projections the space of all possible projections is a $p\times 2$-dimensional torus [@fig-tourpaths2d]! The geometry is elegant. +`r ifelse(knitr::is_html_output(), '@fig-tour-paths-html', '@fig-tour-paths-pdf')` shows tour paths on 3D data spaces. For 1D projections the space of all possible projections is a $p$-dimensional sphere (@fig-tourpaths1d). For 2D projections the space of all possible projections is a $p\times 2$-dimensional torus (@fig-tourpaths2d)! The geometry is elegant. In these figures, the space is represented by the light colour, and is constructed by simulating a large number of random projections. The two darker colours indicate paths generated by a grand tour and a guided tour. The grand tour will cover the full space of all possible projections if allowed to run for some time. The guided tour will quickly converge to an optimal projection, so will cover only a small part of the overall space. -```{r} + +```{r echo=knitr::is_html_output()} #| code-summary: "Load libraries" #| message: false library(ferrn) @@ -177,9 +178,9 @@ render_gif(p2[,1:6], ::: {#fig-tour-paths-html layout-ncol=2} -![1D tour paths](gifs/tour_paths1d.gif){#fig-tourpaths1d width=40%} +![1D tour paths](gifs/tour_paths1d.gif){#fig-tourpaths1d width=48%} -![2D tour paths](gifs/tour_paths2d.gif){#fig-tourpaths2d width=40%} +![2D tour paths](gifs/tour_paths2d.gif){#fig-tourpaths2d width=48%} Grand and guided tour paths of 1D and 2D projections of 3D data. The light points represent the space of all 1D and 2D projections respectively. You can see the grand tour is more comprehensively covering the space, as expected, whereas the guided tour is more focused, and quickly moves to the best projection. ::: @@ -191,11 +192,11 @@ Grand and guided tour paths of 1D and 2D projections of 3D data. The light point ::: {#fig-tour-paths-pdf layout-ncol=2} -![1D tour paths](images/tour_paths1d.png){#fig-tourpaths1d width=40%} +![1D tour paths](images/tour_paths1d.png){#fig-tourpaths1d width=210} -![2D tour paths](images/tour_paths2d.png){#fig-tourpaths2d width=40%} +![2D tour paths](images/tour_paths2d.png){#fig-tourpaths2d width=210} -Grand and guided tour paths of 1D and 2D projections of 3D data. The light points represent the space of all 1D and 2D projections respectively. You can see the grand tour is more comprehensively covering the space, as expected, whereas the guided tour is more focused, and quickly moves to the best projection. +Grand and guided tour paths of 1D and 2D projections of 3D data. The light points represent the space of all 1D and 2D projections respectively. You can see the grand tour is more comprehensively covering the space, as expected, whereas the guided tour is more focused, and quickly moves to the best projection. {{< fa play-circle >}} ::: ::: @@ -206,7 +207,7 @@ Grand and guided tour paths of 1D and 2D projections of 3D data. The light point Tour methods are for numerical data, particularly real-valued measurements. If your data is numerical, but discrete the data can look artificially clustered. `r ifelse(knitr::is_html_output(), '@fig-discrete-data-html', '@fig-discrete-data-pdf')` shows an example. The data is numeric but discrete, so it is ok to examine it in a tour. In this example, there will be overplotting of observations and the artificial clustering (plot a). It can be helpful to jitter observations, by adding a small amount of noise (plot b). This helps to remove the artificial clustering, but preserve the main pattern which is the strong linear association. Generally, jittering is a useful tool for working with discrete data, so that you can focus on examining the multivariate association. If the data is categorical, with no natural ordering of categories, the tour is not advised. -```{r} +```{r echo=knitr::is_html_output()} #| eval: false #| code-summary: "Discrete data code" set.seed(430) @@ -256,18 +257,18 @@ Discrete data can look like clusters, which is misleading. Adding a small amount ::: {#fig-discrete-data-pdf layout-ncol=2} -![Discrete data](images/discrete_data.png){#fig-discrete width=40%} +![Discrete data](images/discrete_data.png){#fig-discrete width=220} -![Jittered data](images/jittered_data.png){#fig-jittered width=40%} +![Jittered data](images/jittered_data.png){#fig-jittered width=220} -Discrete data can look like clusters, which is misleading. Adding a small amount of jitter (random number) can help. The noise is not meaningful but it could allow the viewer to focus on linear or non-linear association between variables without being distracted by artificial clustering. +Discrete data can look like clusters, which is misleading. Adding a small amount of jitter (random number) can help. The noise is not meaningful but it could allow the viewer to focus on linear or non-linear association between variables without being distracted by artificial clustering. {{< fa play-circle >}} ::: ::: ### Missing values -```{r} +```{r echo=knitr::is_html_output()} #| code-summary: "Code to handle missing values" library(naniar) library(ggplot2) @@ -315,7 +316,7 @@ ob_p_mean <- ob_nomiss_mean %>% theme(aspect.ratio=1, legend.position = "None") ``` -```{r} +```{r echo=knitr::is_html_output()} #| eval: false #| code-summary: "Code to make animation" animate_xy(ob_nomiss_below[,1:3], col=ob_nomiss$anymiss) @@ -336,7 +337,7 @@ render_gif(ob_nomiss_mean[,1:3], ``` Missing values can also pose a problem for high-dimensional visualisation, but they shouldn't just be ignored or removed. Methods used in 2D to display missings as done in the `naniar` package [@naniar] like placing them below the complete data don't translate well to high dimensions. -`r ifelse(knitr::is_html_output(), '@fig-missings-html', '@fig-missings-pdf')` illustrates this. It leads to artificial clustering of observations [@fig-below-highD]. It is better to impute the values, and mark them with colour when plotting. The cases are then included in the visualisation so we can assess the multivariate relationships, and also obtain some sense of how these cases should be handled, or imputed. In the example in @fig-imputed-highD we imputed the values simply, using the mean of the complete cases. We can see this is not an ideal approach for imputation for this data because some of the imputed values are outside the domain of the complete cases. +`r ifelse(knitr::is_html_output(), '@fig-missings-html', '@fig-missings-pdf')` illustrates this. It leads to artificial clustering of observations (@fig-below-highD). It is better to impute the values, and mark them with colour when plotting. The cases are then included in the visualisation so we can assess the multivariate relationships, and also obtain some sense of how these cases should be handled, or imputed. In the example in @fig-imputed-highD we imputed the values simply, using the mean of the complete cases. We can see this is not an ideal approach for imputation for this data because some of the imputed values are outside the domain of the complete cases. ::: {.content-visible when-format="html"} @@ -372,16 +373,16 @@ Ways to visualise missings for 2D don't transfer to higher dimensions. When the ::: {#fig-missings-pdf layout-ncol=2} -![Missings below in 2-D](images/fig-missings-below-2D-pdf-1.png){#fig-below-2D-pdf width=40%} +![Missings below in 2-D](images/fig-missings-below-2D-pdf-1.png){#fig-below-2D-pdf width=210} -![Missings below in high-D](images/missing_values1.png){#fig-below-highD width=40%} +![Missings below in high-D](images/missing_values1.png){#fig-below-highD width=210} -![Missings imputed in 2-D](images/fig-missings-mean-2D-pdf-1.png){#fig-mean-2D-pdf width=40%} +![Missings imputed in 2-D](images/fig-missings-mean-2D-pdf-1.png){#fig-mean-2D-pdf width=210} -![Missings imputed in high-D](images/missing_values2.png){#fig-imputed-highD width=40%} +![Missings imputed in high-D](images/missing_values2.png){#fig-imputed-highD width=210} -Ways to visualise missings for 2D don't transfer to higher dimensions. When the missings are set at 10% below the complete cases it appears to be clustered data when viewed in a tour (b). It is better to impute the value, and use colour to indicate that it is originally a missing value (d). +Ways to visualise missings for 2D don't transfer to higher dimensions. When the missings are set at 10% below the complete cases it appears to be clustered data when viewed in a tour (b). It is better to impute the value, and use colour to indicate that it is originally a missing value (d). {{< fa play-circle >}} ::: ::: @@ -399,7 +400,7 @@ There are tours available in various software packages. For most examples we use - [liminal](https://sa-lee.github.io/liminal/): to combine tours with (non-linear) dimension reduction algorithms. - [detourr](https://casperhart.github.io/detourr/): animations for {tourr} using `htmlwidgets` for performance and portability. -- [langevitour](https://logarithmic.net/langevitour/): HTML widget that randomly tours projections of a high-dimensional dataset with an animated scatterplot. +- [langevitour](https://logarithmic.net/langevitour/): HTML widget that shows tours projections of a high-dimensional dataset with an animated scatterplot. - [woylier](https://numbats.github.io/woylier/): alternative method for generating a tour path by interpolating between d-D frames in p-D space rather than d-D planes. - [spinifex](https://nspyrison.github.io/spinifex/): manual control of dynamic projections of numeric multivariate data. - [ferrn](https://huizezhang-sherry.github.io/ferrn/): extracts key components in the data object collected by the guided tour optimisation, and produces diagnostic plots. diff --git a/A2-data.qmd b/A2-data.qmd index 215ca9c..6f34e2f 100644 --- a/A2-data.qmd +++ b/A2-data.qmd @@ -42,10 +42,10 @@ datasets |> dplyr::select(Name, Description, Analysis) |> knitr::kable(format="latex", booktabs = T) |> kable_paper(full_width = F) |> - kable_styling() |> + kable_styling(font_size = 8) |> column_spec(1, width = "2cm") |> column_spec(2, width = "5cm") |> - column_spec(3, width = "5cm") + column_spec(3, width = "4cm") ``` ::: @@ -146,30 +146,6 @@ Considerable pre-processing was done to produce these data sets. The original da ## Palmer penguins -```{r} -#| label: penguins -#| message: FALSE -#| eval: FALSE -library(palmerpenguins) -penguins <- penguins %>% - na.omit() # 11 observations out of 344 removed -# use only vars of interest, and standardise -# them for easier interpretation -penguins_sub <- penguins %>% - select(bill_length_mm, - bill_depth_mm, - flipper_length_mm, - body_mass_g, - species, - sex) %>% - mutate(across(where(is.numeric), ~ scale(.)[,1])) %>% - rename(bl = bill_length_mm, - bd = bill_depth_mm, - fl = flipper_length_mm, - bm = body_mass_g) -save(penguins_sub, file="data/penguins_sub.rda") -``` - ### Description {-} This data measure four physical characteristics of three species of penguins. @@ -250,7 +226,7 @@ The data was reduced to country and the plausible scores, and filtered to the tw ### Description {-} -This data is a subset of images from https://quickdraw.withgoogle.com. The subset was created using the quickdraw R package at https://huizezhang-sherry.github.io/quickdraw/. It has 6 different groups: banana, boomerang, cactus, flip flops, kangaroo. Each image is 28x28 pixels. The `sketches_train` data would be used to train a classification model, and the unlabelled `sketches_test` can be used for prediction. +This data is a subset of images from https://quickdraw.withgoogle.com. The subset was created using the quickdraw R package at https://huizezhang-sherry.github.io/quickdraw/. It has 6 different groups: banana, boomerang, cactus, crab, flip flops, kangaroo. Each image is 28x28 pixels. The `sketches_train` data would be used to train a classification model, and the unlabelled `sketches_test` can be used for prediction. ### Variables {-} @@ -323,7 +299,7 @@ The primary goal is to find the different clusters. ### Source {-} -This data is originally from http://ifs.tuwien.ac.at/dm/download/multiChallenge-matrix.txt, and provided as a challenge for non-linear dimension reduction.It was used as an example in Lee, Laa, Cook (2023) https://doi.org/10.52933/jdssv.v2i3. +This data is originally from http://ifs.tuwien.ac.at/dm/download/multiChallenge-matrix.txt, and provided as a challenge for non-linear dimension reduction. It was used as an example in Lee, Laa, Cook (2023) https://doi.org/10.52933/jdssv.v2i3. ## `clusters`, `clusters_nonlin`, `simple_clusters` @@ -364,7 +340,7 @@ data("box") ### Description {-} -This data has a various number of numeric variables. +This data has a varying number of numeric variables. ### Variables {-} @@ -382,15 +358,9 @@ Simulated using the code in the `simulate.R` file of the `data-raw` directory of ## `c1` - `c7` -```{r} -library(mulgar) -data("c1") -# Load others similarly -``` - ### Description {-} -This data has a various number of numeric variables, and a variety of cluster shapes. +This data has a varying number of numeric variables, and a variety of cluster shapes. ### Variables {-} @@ -427,6 +397,53 @@ Primarily this data is useful as an example for neural network modeling, followi The data is available from https://github.com/zalandoresearch/fashion-mnist. +## `risk_MSA` + +### Description {-} + +The data was collected in Australia in 2015 [@risk-survey] and includes six types of risks (recreational, health, career, financial, safety and social) with responses on a scale from 1 (never) to 5 (very often). + +### Variables {-} + +```{r} +#| label: data-risk +#| echo: FALSE +#| message: FALSE +#| warning: FALSE +library(MSA) +data("risk", package = "MSA") +glimpse(as_tibble(risk)) +``` + +### Purpose {-} + +This data is useful for the demonstration of clustering methods, it was also used in @msabook. + +### Source {-} + +The data is available from https://homepage.boku.ac.at/leisch/MSA/. + +## Peripheral Blood Mononuclear Cells + +### Description {-} + +The data was described in @chen2023, which is available through the R package `Seurat` (@seurat1, @seurat2, @seurat3, @seurat4). Here the data has been pre-processed following the tutorial at https://satijalab.org/seurat/articles/pbmc3k_tutorial.html, and the first 50 PCs are made available in the data file `pbmc_pca_50.rds` which is read into R using the `readRDS()` function. + +### Variables {-} + +|Name | Description | +|:---------|:----------------| +|`PC_1`-`PC_50` | Principal component scores | + + +### Purpose {-} + +The purpose is to understand the clustering of cell types, relative to clustering in the gene expression. Here, our purpose is to determine if the low-dimensional representation provided by NLDR is an accurate representation of the clustering, as understood from using the tour on the PCs. We ignore the cell type labels, and focus on the geometric shapes of clumps and clusters in the high dimensions. + +### Source {-} + +The data can be downloaded and pre-processed following https://satijalab.org/seurat/articles/pbmc3k_tutorial.html. + \n\n\nThis type of visualisation is useful for many activities in dealing with high-dimensional data, including: \n\n- exploring high-dimensional data.\n- detecting if the data lives in a lower dimensional space than the number of variables.\n- checking assumptions required for multivariate models to be applicable.\n- check for potential problems in modeling such as multicollinearity among predictors.\n- checking assumptions required for probabilties calculated for statistical hypothesis testing to be valid.\n- diagnosing the fit of multivariate models.\n\n::: {.content-visible when-format=\"html\"}\n::: info\nWith a tour we slowly rotate the viewing direction, this allows us to see many individual projections and to track movement patterns. Look for interesting structures such as clusters or outlying points.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\n\\infobox{With a tour we slowly rotate the viewing direction, this allows us to see many individual projections and to track movement patterns. Look for interesting structures such as clusters or outlying points.}\n\n:::\n\n## A little history\n\nViewing high-dimensional data based on low-dimensional projections can probably be traced back to the early work on principal component analysis by @pearson-pca and @hotelling-pca, which was extended to known classes as part of discriminant analysis by @fisher1936. \n\nWith computer graphics, the capability of animating plots to show more than a single best projection became possible. The video library [@ASA23] is the best place to experience the earliest work. Kruskal's 1962 animation of multidimensional scaling showed the process of finding a good 2D representation of high dimensional data, although the views are not projections. Chang's 1970 video shows her rotating a high dimensional point cloud along coordinate axes to find a special projection where all the numbers align. The classic video that must be watched is PRIM9 [@PRIM9-video] where a variety of interactive and dynamic tools are used together to explore high dimensional physics data, documented in @tukey. \n\nThe methods in this book primarily emerge from @As85's grand tour method. The algorithm provided the first smooth and continuous sequence of low dimensional projections, and guaranteed that all possible low dimensional projections were likely to be shown. The algorithm was refined in @BA86b (and documented in detail in @BCAH05) to make it *efficiently* show all possible projections. Since then there have been numerous varieties of tour algorithms developed to focus on specific tasks in exploring high dimensional data, and these are documented in @tours2022. \n\nThis book is an evolution from @CS07. One of the difficulties in working on interactive and dynamic graphics research has been the rapid change in technology. Programming languages have changed a little (fortran to C to java to python) but graphics toolkits and display devices have changed a lot! The tour software used in this book evolved from XGobi, which was written in C and used the X Window System, which was then rewritten in GGobi using gtk. The video library has engaging videos of these software systems There have been several other short-lived implementations, including orca [@orca], written in java, and cranvas [@cranvas], written in R with a back-end provided by wrapper functions to qt libraries. \n\nAlthough attempts were made with these ancestor systems to connect the data plots to a statistical analysis system, these were always limited. With the emergence of R, having graphics in the data analysis workflow has been much easier, albeit at the cost of the interactivity with graphics that matches the old systems. We are mostly using the R package, `tourr` [@tourr] for examples in this book. It provides the machinery for running a tour, and has the flexibility that it can be ported, modified, and used as a regular element of data analysis.\n\n## Exercises {-}\n\n1. Randomly generate data points that are uniformly distributed in a hyper-cube of 3, 5 and 10 dimensions, with 500 points in each sample, using the `cube.solid.random` function of the `geozoo` package. What differences do we expect to see? Now visualise each set in a grand tour and describe how they differ, and whether this matched your expectations? \n2. Use the `geozoo` package to generate samples from different shapes and use them to get a better understanding of how shapes appear in a grand tour. You can start with exploring the conic spiral in 3D, a torus in 4D and points along the wire frame of a cube in 5D.\n3. For each of the challenge data sets, `c1`, ..., `c7` from the `mulgar` package, use the grand tour to view and try to identify structure (outliers, clusters, non-linear relationships). \n\n \n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n:::\n", + "supporting": [ + "1-intro_files/figure-html" + ], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": {}, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/1-intro/execute-results/tex.json b/_freeze/1-intro/execute-results/tex.json index 1b1d7e9..2f11b8c 100644 --- a/_freeze/1-intro/execute-results/tex.json +++ b/_freeze/1-intro/execute-results/tex.json @@ -1,8 +1,8 @@ { - "hash": "8598cc5d59c1e0faaaf128435d4ecfcf", + "hash": "7be10c3eb27bee030be46b742f6b7ced", "result": { "engine": "knitr", - "markdown": "# Picturing high dimensions {#intro}\n\nHigh-dimensional data means that we have a large number of numeric features or variables, which can be considered as dimensions in a mathematical space. The variables can be different types, such as categorical or temporal, but the handling of these variables involves different techniques. \n\\index{dimensionality}\n\\index{variable}\\index{feature}\n\\index{projection}\n\n![](images/shadow_puppets.png){width=450 fig-align=\"center\" fig-env=\"figure*\" fig-cap=\"Viewing high dimensions using low-dimensional displays is like playing shadow puppets, looking at the shadows to guess what the shape is.\" fig-alt=\"Three images, each with a hand or two hands, illustrating making shadows of a bird in flight, snail and dog.\"}\n\n\n\n## Getting familiar with tours\n\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n:::\n\n\n\n::: {.content-visible when-format=\"html\"}\n::: {#fig-explain-1D-html layout=\"[[40, 60]]\"}\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![2D data](1-intro_files/figure-pdf/fig-explain-1D-data-1.pdf){#fig-explain-1D-data fig-alt='Plot shows 2D scatterplot, with lines indicating three 1D projection vectors, and their coefficients. ' width=100%}\n:::\n:::\n\n\n\n![1D grand tour of the 2D data](gifs/explain_1d.gif){#fig-explain-1D-tour width=290 fig-alt=\"The animation shows a sequence of 1D projections of the 2D data.\"}\n\nHow a tour can be used to explore high-dimensional data illustrated using (a) 2D data with two clusters and (b) a tour of 1D projections shown as a density plot. Imagine spinning a line around the centre of the data plot, with points projected orthogonally onto the line. With this data, when the line is at `x1=x2 (0.707, 0.707)` or `(-0.707, -0.707)` the clustering is the strongest. When it is at `x1=-x2 (0.707, -0.707)` there is no clustering.\n:::\n:::\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![How a tour can be used to explore high-dimensional data illustrated using (a) 2D data with two clusters and (b,c,d) 1D projections from a tour shown as a density plot. Imagine spinning a line around the centre of the data plot, with points projected orthogonally onto the line. With this data, when the line is at `x1=x2 (0.707, 0.707)` or `(-0.707, -0.707)` the clustering is the strongest. When it is at `x1=-x2 (0.707, -0.707)` there is no clustering.](1-intro_files/figure-pdf/fig-explain-1D-pdf-1.pdf){#fig-explain-1D-pdf fig-env='figure*' width=100%}\n:::\n:::\n\n\n\n@fig-explain-1D-pdf illustrates a tour for 2D data and 1D projections. The (grand) tour will generate all possible 1D projections of the data, and display with a univariate plot like a histogram or density plot. For this data, the `simple_clusters` data, depending on the projection, the distribution might be clustered into two groups (bimodal), or there might be no clusters (unimodal). In this example, all projections are generated by rotating a line around the centre of the plot. Clustering can be seen in many of the projections, with the strongest being when the contribution of both variables is equal, and the projection is `(0.707, 0.707)` or `(-0.707, -0.707)`. (If you are curious about the number `0.707`, read the last section of this chapter.)\n\n\\index{projection!1D}\n\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n:::\n\n\n\n::: {.content-visible when-format=\"html\"}\n::: {#fig-explain-2D-html layout=\"[[57, 43]]\"}\n\n![2D tour of 3D data](gifs/explain_2d.gif){#fig-explain-2D-tour fig-alt=\"The animation shows a sequence of scatterplots of 2D projections of a 3D torus.\"}\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![A projection revealing the hole](1-intro_files/figure-pdf/fig-explain-2D-data-1.pdf){#fig-explain-2D-data fig-alt='A scatterplot of a single 2D projection where the donut hole is visible.' width=80%}\n:::\n:::\n\n\n\nHow a tour can be used to explore high-dimensional data illustrated by showing a sequence of random 2D projections of 3D data (a). The data has a donut shape with the hole revealed in a single 2D projection (b). Data usually arrives with a given number of observations, and when we plot it like this using a scatterplot, it is like shadows of a transparent object.\n:::\n:::\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![How a tour can be used to explore high-dimensional data illustrated by showing a sequence of random 2D projections of 3D data (a). The data has a donut shape with the hole revealed in a single 2D projection (b). Data usually arrives with a given number of observations, and when we plot it like this using a scatterplot, it is like shadows of a transparent object.](1-intro_files/figure-pdf/fig-explain-2D-pdf-1.pdf){#fig-explain-2D-pdf fig-env='figure*' width=100%}\n:::\n:::\n\n\n\n@fig-explain-2D-pdf illustrates a tour for 3D data using 2D projections. The data are points on the surface of a donut shape. By showing the projections using a scatterplot the donut looks transparent and we can see through the data. The donut shape can be inferred from watching many 2D projections but some are more revealing that others. The projection shown in (b) is where the hole in the donut is clearly visible.\n\\index{projection!2D}\n\n\n## What's different about space beyond 2D?\n\nThe term \"high-dimensional\" in this book refers to the dimensionality of the Euclidean space. @fig-dimension-cubes shows a way to imagine this. It shows a sequence of cube wireframes, ranging from one-dimensional (1D) through to five-dimensional (5D), where beyond 2D is a linear projection of the cube. As the dimension increases, a new orthogonal axis is added. For cubes, this is achieved by doubling the cube: a 2D cube consists of two 1D cubes, a 3D cube consists of two 2D cubes, and so forth. This is a great way to think about the space being examined by the visual methods, and also all of the machine learning methods mentioned, in this book. \n\n\\index{dimensionality}\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Space can be considered to be a high-dimensional cube. Here we have pictured a sequence of increasing dimension cubes, from 1D to 5D, as wireframes, it can be seen that as the dimension increase by one, the cube doubles.](1-intro_files/figure-pdf/fig-dimension-cubes-1.pdf){#fig-dimension-cubes fig-alt='Wireframe diagrams show 1D, 2D, 3D, 4D and 5D cubes. Half of each cube is coloured orange to show how a new dimension expands from the previous one.' width=80%}\n:::\n:::\n\n\n\nInterestingly, the struggle with imagining high-dimensions this way is described in a novel published in 1884 [@Ab1884] [^4]. Yes, more than 100 years ago! This is a story about characters living in a 2D world, being visited by an alien 3D character. It also is a social satire, serving the reader strong messages about gender inequity, although this provides the means to explain more intricacies in perceiving dimensions. There have been several movies made based on the book in recent decades (e.g. @Ma65, @JT07). Although purchasing the movies may be prohibitive, watching the trailers available for free online is sufficient to gain enough geometric intuition on the nature of understanding high-dimensional spaces while living in a low-dimensional world. \n\n[^4]: Thanks to Barret Schloerke for directing co-author Cook to this history when he was an undergraduate student and we were starting the [geozoo](http://schloerke.com/geozoo/) project.\n\nWhen we look at high-dimensional spaces from a low-dimensional space, we meet the \"curse of dimensionality\", a term introduced by @BellmanRichard1961 to express the difficulty of doing optimization in high dimensions because of the exponential growth in space as dimension increases. A way to imagine this is look at the cubes in @fig-dimension-cubes: As you go from 1D to 2D, 2D to 3D, the space expands a lot, and imagine how vast space might get as more dimensions are added[^5]. The volume of the space grows exponentially with dimension, which makes it infeasible to sample enough points -- any sample will be less densely covering the space as dimension increases. The effect is that most points will be far from the sample mean, on the edge of the sample space.\n\n\\index{dimensionality!curse of}\n\n[^5]: \"Space is big. Really big. You might think it's a long way to the pharmacy, but that’s peanuts to space.\" from Douglas Adams' [Hitchhiker's Guide to the Galaxy](https://en.wikipedia.org/wiki/The_Hitchhiker%27s_Guide_to_the_Galaxy#Stage_shows) always springs to mind when thinking about high dimensions!\n\nFor visualisation, the curse manifests in an opposite manner. Projecting from high to low dimensions creates a crowding or piling of points near the center of the distribution. This was noted by @diaconis1984. @fig-density illustrates this phenomenon. As dimension increases, the points crowd the centre, even with as few as ten dimensions. This is something that we may need to correct for when exploring high dimensions with low-dimensional projections.\n\n\\index{dimensionality!crowding}\n\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![Illustration of data crowding in the low-dimensional projection as dimension increases, here from 3, 10, 100. Colour shows the number of points in each hexagon bin (pink is large, navy is small). As dimension increases the points concentrate near the centre.](1-intro_files/figure-pdf/fig-density-1.pdf){#fig-density fig-align='center' fig-alt='Three hexagon binned plots. The plot on the left is relatively uniform in colour, and looks like a disk, and the plot on the right has a high concentration of pink hexagons in the center, and rings of green and navy blue around the outside. The middle plot is in between the two patterns.' width=95%}\n:::\n:::\n\n\n\n::: {.content-visible when-format=\"html\"}\n\n@fig-tour-intro-html shows 2D tours of two different 5D data sets. One has clusters (a) and the other has two outliers and a plane (b). Can you see these? One difference in the viewing of data with more than three dimensions with 2D projections is that the points seem to shrink towards the centre, and then expand out again. This the effect of dimensionality, with different variance or spread in some directions.\n\n::: {#fig-tour-intro-html layout-ncol=2}\n\n![Clusters](gifs/clusters-intro.gif){#fig-tour-clusters width=250 fig-alt=\"Animation of sequences of 2D projections shown as scatterplots. You can see points moving in three different movement patterns, and sometimes they separate into clusters.\"}\n\n![Outliers](gifs/outlier-intro.gif){#fig-tour-outliers width=250 fig-alt=\"Animation of sequences of 2D projections shown as scatterplots. You can see most points lie in a flat planar shape, and two points can be seen to move differently from the others and separate from the rest of the points in some projections.\"}\n\nTwo 5D datasets shown as tours of 2D projections. Can you see clusters of points in (a) and two outliers with a plane in (b)?\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n@fig-tour-intro-pdf shows 2D tours of two different 5D data sets. One has clusters (a) and the other has two outliers and a plane (b). Can you see these? One difference in the viewing of data with more than three dimensions with 2D projections is that the points seem to shrink towards the centre, and then expand out again. This the effect of dimensionality, with different variance or spread in some directions.\n\n::: {#fig-tour-intro-pdf layout-ncol=2}\n![Clusters](images/clusters-intro.png){#fig-tour-clusters width=200}\n\n![Outliers](images/outlier-intro.png){#fig-tour-clusters width=200}\n\nFrames from 2D tours on two 5D datasets, with clusters of points in (a) and two outliers with a plane in (b). This figure is best viewed in the HTML version of the book.\n:::\n\n:::\n\n## What can you learn?\n\nThere are two ways of detecting structure in tours:\n\n- patterns in a single low-dimensional projection\n- movement patterns\n\nwith the latter being especially useful when displaying the projected data as a scatterplot. @fig-example-structure shows examples of patterns we typically look for when making a scatterplot of data. These include clustering, linear and non-linear association, outliers, barriers where there is a sharp edge beyond which no observations are seen. Not shown, but it also might be possible to observe multiple modes, or density of observations, L-shapes, discreteness or uneven spread of points. The tour is especially useful if these patterns are only visible in combinations of variables. \n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Example structures that might be visible in a 2D projection that imply presence of structure in high dimensions. These include clusters, linear and non-linear association, outliers and barriers.](1-intro_files/figure-pdf/fig-example-structure-1.pdf){#fig-example-structure fig-alt='Four scatterplots showing different types of patterns you might expect to see. Plot (a) has three elliptical clusters of points, roughly lying horizontal, making a geese flying pattern. Plot (b) has a nonlinear pattern looking like a horseshoe. Plot (c) has a strong negative linear association and a single outlier in the top right. Plot (d) has points lying only in the bottom triangle.' width=100%}\n:::\n:::\n\n::: {.cell}\n\n:::\n\n\n\n@fig-trails illustrates how movement patterns of points when using scatterplots to display 2D projections indicate clustering (a, b) and outliers (c, d). \n\n::: {#fig-trails layout-ncol=2 fig-align=\"center\"}\n\n![Clustering](images/trails-clusters.png){#fig-clusters-trails-static fig-alt=\"Frame from the animations shown earlier annotated to mark clustering movement. Movement pattern is indicated by a point and a line.\"}\n\n![Outliers](images/trails-outlier.png){#fig-outlier-trails-static fig-alt=\"Frame from the animations shown earlier annotated to mark outliers movement. Movement pattern is indicated by a point and a line.\"}\n\nThe movement of points give further clues about the structure of the data in high-dimensions. In the data with clustering, often we can see a group of points moving differently from the others. Because there are three clusters, you should see three distinct movement patterns. It is similar with outliers, except these may be individual points moving alone, and different from all others. This can be seen in the static plot, one point (top left) has a movement pattern upwards whereas most of the other observations near it are moving down towards the right. \n:::\n\n\n\n\nThis type of visualisation is useful for many activities in dealing with high-dimensional data, including: \n\n- exploring high-dimensional data.\n- detecting if the data lives in a lower dimensional space than the number of variables.\n- checking assumptions required for multivariate models to be applicable.\n- check for potential problems in modeling such as multicollinearity among predictors.\n- checking assumptions required for probabilities calculated for statistical hypothesis testing to be valid.\n- diagnosing the fit of multivariate models.\n\n::: {.content-visible when-format=\"html\"}\n::: info\nWith a tour we slowly rotate the viewing direction, this allows us to see many individual projections and to track movement patterns. Look for interesting structures such as clusters or outlying points.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\n\\infobox{With a tour we slowly rotate the viewing direction, this allows us to see many individual projections and to track movement patterns. Look for interesting structures such as clusters or outlying points.}\n\n:::\n\n## A little history\n\nViewing high-dimensional data based on low-dimensional projections can probably be traced back to the early work on principal component analysis by @pearson-pca and @hotelling-pca, which was extended to known classes as part of discriminant analysis by @fisher1936. \n\nWith computer graphics, the capability of animating plots to show more than a single best projection became possible. The video library [@ASA23] is the best place to experience the earliest work. Kruskal's 1962 animation of multidimensional scaling showed the process of finding a good 2D representation of high dimensional data, although the views are not projections. Chang's 1970 video shows her rotating a high dimensional point cloud along coordinate axes to find a special projection where all the numbers align. The classic video that must be watched is PRIM9 [@PRIM9-video] where a variety of interactive and dynamic tools are used together to explore high dimensional physics data, documented in @tukey. \n\nThe methods in this book primarily emerge from @As85's grand tour method. The algorithm provided the first smooth and continuous sequence of low dimensional projections, and guaranteed that all possible low dimensional projections were likely to be shown. The algorithm was refined in @BA86b (and documented in detail in @BCAH05) to make it *efficiently* show all possible projections. Since then there have been numerous varieties of tour algorithms developed to focus on specific tasks in exploring high dimensional data, and these are documented in @tours2022. \n\nThis book is an evolution from @CS07. One of the difficulties in working on interactive and dynamic graphics research has been the rapid change in technology. Programming languages have changed a little (FORTRAN to C to java to python) but graphics toolkits and display devices have changed a lot! The tour software used in this book evolved from XGobi, which was written in C and used the X Window System, which was then rewritten in GGobi using gtk. The video library has engaging videos of these software systems There have been several other short-lived implementations, including orca [@orca], written in java, and cranvas [@cranvas], written in R with a back-end provided by wrapper functions to qt libraries. \n\nAlthough attempts were made with these ancestor systems to connect the data plots to a statistical analysis system, these were always limited. With the emergence of R, having graphics in the data analysis workflow has been much easier, albeit at the cost of the interactivity with graphics that matches the old systems. We are mostly using the R package, `tourr` [@tourr] for examples in this book. It provides the machinery for running a tour, and has the flexibility that it can be ported, modified, and used as a regular element of data analysis.\n\n## Exercises {-}\n\n1. Randomly generate data points that are uniformly distributed in a hyper-cube of 3, 5 and 10 dimensions, with 500 points in each sample, using the `cube.solid.random` function of the `geozoo` package. What differences do we expect to see? Now visualise each set in a grand tour and describe how they differ, and whether this matched your expectations? \n2. Use the `geozoo` package to generate samples from different shapes and use them to get a better understanding of how shapes appear in a grand tour. You can start with exploring the conic spiral in 3D, a torus in 4D and points along the wire frame of a cube in 5D.\n3. For each of the challenge data sets, `c1`, ..., `c7` from the `mulgar` package, use the grand tour to view and try to identify structure (outliers, clusters, non-linear relationships). \n\n \n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n:::\n", + "markdown": "# Picturing high dimensions {#intro}\n\nHigh-dimensional data means that we have a large number of numeric features or variables, which can be considered as dimensions in a mathematical space. The variables can be different types, such as categorical or temporal, but the handling of these variables involves different techniques. \n\\index{dimensionality}\n\\index{variable}\\index{feature}\n\\index{projection}\n\n![Viewing high dimensions using low-dimensional displays is like playing shadow puppets, looking at the shadows to guess what the shape is.](images/shadow_puppets.png){#fig-shadow-puppets width=450 fig-alt=\"Three images, each with a hand or two hands, illustrating making shadows of a bird in flight, snail and dog.\"}\n\n\nOne approach to visualise high dimensional data and models is by using linear projections, as done in a tour. You can think of projections of high-dimensional data like shadows (@fig-shadow-puppets). Unlike shadow puppets, though the object stays fixed, and with multiple projections we can obtain a *view of the object from all sides*. \n\n\n## Getting familiar with tours\n\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n:::\n\n\n\n::: {.content-visible when-format=\"html\"}\n::: {#fig-explain-1D-html layout=\"[[40, 60]]\"}\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![2D data](1-intro_files/figure-pdf/fig-explain-1D-data-1.pdf){#fig-explain-1D-data fig-alt='Plot shows 2D scatterplot, with lines indicating three 1D projection vectors, and their coefficients. ' width=100%}\n:::\n:::\n\n\n\n![1D grand tour of the 2D data](gifs/explain_1d.gif){#fig-explain-1D-tour width=290 fig-alt=\"The animation shows a sequence of 1D projections of the 2D data.\"}\n\nHow a tour can be used to explore high-dimensional data illustrated using (a) 2D data with two clusters and (b) a tour of 1D projections shown as a density plot. Imagine spinning a line around the centre of the data plot, with points projected orthogonally onto the line. With this data, when the line is at `x1=x2 (0.707, 0.707)` or `(-0.707, -0.707)` the clustering is the strongest. When it is at `x1=-x2 (0.707, -0.707)` there is no clustering.\n:::\n:::\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![How a tour can be used to explore high-dimensional data illustrated using (a) 2D data with two clusters and (b,c,d) 1D projections from a tour shown as a density plot. Imagine spinning a line around the centre of the data plot, with points projected orthogonally onto the line. With this data, when the line is at `x1=x2 (0.707, 0.707)` or `(-0.707, -0.707)` the clustering is the strongest. When it is at `x1=-x2 (0.707, -0.707)` there is no clustering. {{< fa play-circle >}}](1-intro_files/figure-pdf/fig-explain-1D-pdf-1.pdf){#fig-explain-1D-pdf fig-env='figure*' width=100%}\n:::\n:::\n\n\n\n@fig-explain-1D-pdf illustrates a tour for 2D data and 1D projections. The (grand) tour will generate all possible 1D projections of the data, and display with a univariate plot like a histogram or density plot. For this data, the `simple_clusters` data, depending on the projection, the distribution might be clustered into two groups (bimodal), or there might be no clusters (unimodal). In this example, all projections are generated by rotating a line around the centre of the plot. Clustering can be seen in many of the projections, with the strongest being when the contribution of both variables is equal, and the projection is `(0.707, 0.707)` or `(-0.707, -0.707)`. (If you are curious about the number `0.707`, read the last section of this chapter.)\n\n\\index{projection!1D}\n\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n:::\n\n\n\n::: {.content-visible when-format=\"html\"}\n::: {#fig-explain-2D-html layout=\"[[57, 43]]\"}\n\n![2D tour of 3D data](gifs/explain_2d.gif){#fig-explain-2D-tour fig-alt=\"The animation shows a sequence of scatterplots of 2D projections of a 3D torus.\"}\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![A projection revealing the hole](1-intro_files/figure-pdf/fig-explain-2D-data-1.pdf){#fig-explain-2D-data fig-alt='A scatterplot of a single 2D projection where the donut hole is visible.' width=80%}\n:::\n:::\n\n\n\nHow a tour can be used to explore high-dimensional data illustrated by showing a sequence of random 2D projections of 3D data (a). The data has a donut shape with the hole revealed in a single 2D projection (b). Data usually arrives with a given number of observations, and when we plot it like this using a scatterplot, it is like shadows of a transparent object.\n:::\n:::\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![How a tour can be used to explore high-dimensional data illustrated by showing a sequence of random 2D projections of 3D data (a). The data has a donut shape with the hole revealed in a single 2D projection (b). Data usually arrives with a given number of observations, and when we plot it like this using a scatterplot, it is like shadows of a transparent object. {{< fa play-circle >}}](1-intro_files/figure-pdf/fig-explain-2D-pdf-1.pdf){#fig-explain-2D-pdf fig-env='figure*' width=100%}\n:::\n:::\n\n\n\n@fig-explain-2D-pdf illustrates a tour for 3D data using 2D projections. The data are points on the surface of a donut shape. By showing the projections using a scatterplot the donut looks transparent and we can see through the data. The donut shape can be inferred from watching many 2D projections but some are more revealing that others. The projection shown in (b) is where the hole in the donut is clearly visible.\n\\index{projection!2D}\n\n\n## What's different about space beyond 2D?\n\nThe term \"high-dimensional\" in this book refers to the dimensionality of the Euclidean space. @fig-dimension-cubes shows a way to imagine this. It shows a sequence of cube wireframes, ranging from one-dimensional (1D) through to five-dimensional (5D), where beyond 2D is a linear projection of the cube. As the dimension increases, a new orthogonal axis is added. For cubes, this is achieved by doubling the cube: a 2D cube consists of two 1D cubes, a 3D cube consists of two 2D cubes, and so forth. This is a great way to think about the space being examined by the visual methods, and also all of the machine learning methods mentioned, in this book. \n\n\\index{dimensionality}\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Space can be considered to be a high-dimensional cube. Here we have pictured a sequence of increasing dimension cubes, from 1D to 5D, as wireframes, it can be seen that as the dimension increase by one, the cube doubles.](1-intro_files/figure-pdf/fig-dimension-cubes-1.pdf){#fig-dimension-cubes fig-alt='Wireframe diagrams show 1D, 2D, 3D, 4D and 5D cubes. Half of each cube is coloured orange to show how a new dimension expands from the previous one.' width=80%}\n:::\n:::\n\n\n\nInterestingly, the struggle with imagining high-dimensions this way is described in a novel published in 1884 [@Ab1884] [^4]. Yes, more than 100 years ago! This is a story about characters living in a 2D world, being visited by an alien 3D character. It also is a social satire, serving the reader strong messages about gender inequity, although this provides the means to explain more intricacies in perceiving dimensions. There have been several movies made based on the book in recent decades (e.g. @Ma65, @JT07). Although purchasing the movies may be prohibitive, watching the trailers available for free online is sufficient to gain enough geometric intuition on the nature of understanding high-dimensional spaces while living in a low-dimensional world. \n\n[^4]: Thanks to Barret Schloerke for directing co-author Cook to this history when he was an undergraduate student and we were starting the [geozoo](http://schloerke.com/geozoo/) project.\n\nWhen we look at high-dimensional spaces from a low-dimensional space, we meet the \"curse of dimensionality\", a term introduced by @BellmanRichard1961 to express the difficulty of doing optimization in high dimensions because of the exponential growth in space as dimension increases. A way to imagine this is look at the cubes in @fig-dimension-cubes: As you go from 1D to 2D, 2D to 3D, the space expands a lot, and imagine how vast space might get as more dimensions are added[^5]. The volume of the space grows exponentially with dimension, which makes it infeasible to sample enough points -- any sample will be less densely covering the space as dimension increases. The effect is that most points will be far from the sample mean, on the edge of the sample space.\n\n\\index{dimensionality!curse of}\n\n[^5]: \"Space is big. Really big. You might think it's a long way to the pharmacy, but that’s peanuts to space.\" from Douglas Adams' [Hitchhiker's Guide to the Galaxy](https://en.wikipedia.org/wiki/The_Hitchhiker%27s_Guide_to_the_Galaxy#Stage_shows) always springs to mind when thinking about high dimensions!\n\nFor visualisation, the curse manifests in an opposite manner. Projecting from high to low dimensions creates a crowding or piling of points near the center of the distribution. This was noted by @diaconis1984. @fig-density illustrates this phenomenon. As dimension increases, the points crowd the centre, even with as few as ten dimensions. This is something that we may need to correct for when exploring high dimensions with low-dimensional projections.\n\n\\index{dimensionality!crowding}\n\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![Illustration of data crowding in the low-dimensional projection as dimension increases, here from 3, 10, 100. Colour shows the number of points in each hexagon bin (pink is large, navy is small). As dimension increases the points concentrate near the centre.](1-intro_files/figure-pdf/fig-density-1.pdf){#fig-density fig-align='center' fig-alt='Three hexagon binned plots. The plot on the left is relatively uniform in colour, and looks like a disk, and the plot on the right has a high concentration of pink hexagons in the center, and rings of green and navy blue around the outside. The middle plot is in between the two patterns.' width=95%}\n:::\n:::\n\n\n\n::: {.content-visible when-format=\"html\"}\n\n@fig-tour-intro-html shows 2D tours of two different 5D data sets. One has clusters (a) and the other has two outliers and a plane (b). Can you see these? One difference in the viewing of data with more than three dimensions with 2D projections is that the points seem to shrink towards the centre, and then expand out again. This the effect of dimensionality, with different variance or spread in some directions.\n\n::: {#fig-tour-intro-html layout-ncol=2}\n\n![Clusters](gifs/clusters-intro.gif){#fig-tour-clusters width=250 fig-alt=\"Animation of sequences of 2D projections shown as scatterplots. You can see points moving in three different movement patterns, and sometimes they separate into clusters.\"}\n\n![Outliers](gifs/outlier-intro.gif){#fig-tour-outliers width=250 fig-alt=\"Animation of sequences of 2D projections shown as scatterplots. You can see most points lie in a flat planar shape, and two points can be seen to move differently from the others and separate from the rest of the points in some projections.\"}\n\nTwo 5D datasets shown as tours of 2D projections. Can you see clusters of points in (a) and two outliers with a plane in (b)?\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n@fig-tour-intro-pdf shows 2D tours of two different 5D data sets. One has clusters (a) and the other has two outliers and a plane (b). Can you see these? One difference in the viewing of data with more than three dimensions with 2D projections is that the points seem to shrink towards the centre, and then expand out again. This the effect of dimensionality, with different variance or spread in some directions.\n\n::: {#fig-tour-intro-pdf layout-ncol=2}\n![Clusters](images/clusters-intro.png){#fig-tour-clusters width=200}\n\n![Outliers](images/outlier-intro.png){#fig-tour-clusters width=200}\n\nFrames from 2D tours on two 5D datasets, with clusters of points in (a) and two outliers with a plane in (b). This figure is best viewed in the HTML version of the book. {{< fa play-circle >}}\n:::\n\n:::\n\n## What can you learn?\n\nThere are two ways of detecting structure in tours:\n\n- patterns in a single low-dimensional projection\n- movement patterns\n\nwith the latter being especially useful when displaying the projected data as a scatterplot. @fig-example-structure shows examples of patterns we typically look for when making a scatterplot of data. These include clustering, linear and non-linear association, outliers, barriers where there is a sharp edge beyond which no observations are seen. Not shown, but it also might be possible to observe multiple modes, or density of observations, L-shapes, discreteness or uneven spread of points. The tour is especially useful if these patterns are only visible in combinations of variables. \n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Example structures that might be visible in a 2D projection that imply presence of structure in high dimensions. These include clusters, linear and non-linear association, outliers and barriers.](1-intro_files/figure-pdf/fig-example-structure-1.pdf){#fig-example-structure fig-alt='Four scatterplots showing different types of patterns you might expect to see. Plot (a) has three elliptical clusters of points, roughly lying horizontal, making a geese flying pattern. Plot (b) has a nonlinear pattern looking like a horseshoe. Plot (c) has a strong negative linear association and a single outlier in the top right. Plot (d) has points lying only in the bottom triangle.' width=100%}\n:::\n:::\n\n::: {.cell}\n\n:::\n\n\n\n@fig-trails illustrates how movement patterns of points when using scatterplots to display 2D projections indicate clustering (a, b) and outliers (c, d). \n\n::: {#fig-trails layout-ncol=2 fig-align=\"center\"}\n\n![Clustering](images/trails-clusters.png){#fig-clusters-trails-static fig-alt=\"Frame from the animations shown earlier annotated to mark clustering movement. Movement pattern is indicated by a point and a line.\"}\n\n![Outliers](images/trails-outlier.png){#fig-outlier-trails-static fig-alt=\"Frame from the animations shown earlier annotated to mark outliers movement. Movement pattern is indicated by a point and a line.\"}\n\nThe movement of points give further clues about the structure of the data in high-dimensions. In the data with clustering, often we can see a group of points moving differently from the others. Because there are three clusters, you should see three distinct movement patterns. It is similar with outliers, except these may be individual points moving alone, and different from all others. This can be seen in the static plot, one point (top left) has a movement pattern upwards whereas most of the other observations near it are moving down towards the right. \n:::\n\n\nThis type of visualisation is useful for many activities in dealing with high-dimensional data, including: \n\n- exploring high-dimensional data.\n- detecting if the data lives in a lower dimensional space than the number of variables.\n- checking assumptions required for multivariate models to be applicable.\n- check for potential problems in modeling such as multicollinearity among predictors.\n- checking assumptions required for probabilities calculated for statistical hypothesis testing to be valid.\n- diagnosing the fit of multivariate models.\n\n::: {.content-visible when-format=\"html\"}\n::: info\nWith a tour we slowly rotate the viewing direction, this allows us to see many individual projections and to track movement patterns. Look for interesting structures such as clusters or outlying points.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\n\\infobox{With a tour we slowly rotate the viewing direction, this allows us to see many individual projections and to track movement patterns. Look for interesting structures such as clusters or outlying points.}\n\n:::\n\n## A little history\n\nViewing high-dimensional data based on low-dimensional projections can probably be traced back to the early work on principal component analysis by @pearson-pca and @hotelling-pca, which was extended to known classes as part of discriminant analysis by @fisher1936. \n\nWith computer graphics, the capability of animating plots to show more than a single best projection became possible. The video library [@ASA23] is the best place to experience the earliest work. Kruskal's 1962 animation of multidimensional scaling showed the process of finding a good 2D representation of high dimensional data, although the views are not projections. Chang's 1970 video shows her rotating a high dimensional point cloud along coordinate axes to find a special projection where all the numbers align. The classic video that must be watched is PRIM9 [@PRIM9-video] where a variety of interactive and dynamic tools are used together to explore high dimensional physics data, documented in @tukey. \n\nThe methods in this book primarily emerge from @As85's grand tour method. The algorithm provided the first smooth and continuous sequence of low dimensional projections, and guaranteed that all possible low dimensional projections were likely to be shown. The algorithm was refined in @BA86b (and documented in detail in @BCAH05) to make it *efficiently* show all possible projections. Since then there have been numerous varieties of tour algorithms developed to focus on specific tasks in exploring high dimensional data, and these are documented in @tours2022. \n\nThis book is an evolution from @CS07. One of the difficulties in working on interactive and dynamic graphics research has been the rapid change in technology. Programming languages have changed a little (FORTRAN to C to java to python) but graphics toolkits and display devices have changed a lot! The tour software used in this book evolved from XGobi, which was written in C and used the X Window System, which was then rewritten in GGobi using gtk. The video library has engaging videos of these software systems There have been several other short-lived implementations, including orca [@orca], written in java, and cranvas [@cranvas], written in R with a back-end provided by wrapper functions to `qt` libraries. \n\nAlthough attempts were made with these ancestor systems to connect the data plots to a statistical analysis system, these were always limited. With the emergence of R, having graphics in the data analysis workflow has been much easier, albeit at the cost of the interactivity with graphics that matches the old systems. We are mostly using the R package, `tourr` [@tourr] for examples in this book. It provides the machinery for running a tour, and has the flexibility that it can be ported, modified, and used as a regular element of data analysis.\n\n## Exercises {-}\n\n1. Randomly generate data points that are uniformly distributed in a hyper-cube of 3, 5 and 10 dimensions, with 500 points in each sample, using the `cube.solid.random()` function of the `geozoo` package. What differences do we expect to see? Now visualise each set in a grand tour and describe how they differ, and whether this matched your expectations? \n2. Use the `geozoo` package to generate samples from different shapes and use them to get a better understanding of how shapes appear in a grand tour. You can start with exploring the conic spiral in 3D, a torus in 4D and points along the wire frame of a cube in 5D.\n3. For each of the challenge data sets, `c1`, ..., `c7` from the `mulgar` package, use the grand tour to view and try to identify structure (outliers, clusters, non-linear relationships). \n\n \n\n\n::: {.cell}\n\n:::\n\n\n\n::: {.content-hidden when-format=\"pdf\"}\n::: {.hidden}\nAnswer 1. Each of the projections has a boxy shape, which gets less distinct as the dimension increases. \n\nAs the dimension increases, the points tend to concentrate in the centre of the plot window, with a smattering of points in the edges. \n:::\n:::\n\n\n\n::: {.cell}\n\n:::\n", "supporting": [ "1-intro_files/figure-pdf" ], diff --git a/_freeze/1-intro/figure-html/fig-density-1.png b/_freeze/1-intro/figure-html/fig-density-1.png new file mode 100644 index 0000000..cdd83e2 Binary files /dev/null and b/_freeze/1-intro/figure-html/fig-density-1.png differ diff --git a/_freeze/1-intro/figure-html/fig-dimension-cubes-1.png b/_freeze/1-intro/figure-html/fig-dimension-cubes-1.png new file mode 100644 index 0000000..beed249 Binary files /dev/null and b/_freeze/1-intro/figure-html/fig-dimension-cubes-1.png differ diff --git a/_freeze/1-intro/figure-html/fig-example-structure-1.png b/_freeze/1-intro/figure-html/fig-example-structure-1.png new file mode 100644 index 0000000..13c08b1 Binary files /dev/null and b/_freeze/1-intro/figure-html/fig-example-structure-1.png differ diff --git a/_freeze/1-intro/figure-html/fig-explain-1D-data-1.png b/_freeze/1-intro/figure-html/fig-explain-1D-data-1.png new file mode 100644 index 0000000..3cac76b Binary files /dev/null and b/_freeze/1-intro/figure-html/fig-explain-1D-data-1.png differ diff --git a/_freeze/1-intro/figure-html/fig-explain-2D-data-1.png b/_freeze/1-intro/figure-html/fig-explain-2D-data-1.png new file mode 100644 index 0000000..da03b83 Binary files /dev/null and b/_freeze/1-intro/figure-html/fig-explain-2D-data-1.png differ diff --git a/_freeze/1-intro/figure-pdf/fig-density-1.pdf b/_freeze/1-intro/figure-pdf/fig-density-1.pdf index d53b5b3..137452b 100644 Binary files a/_freeze/1-intro/figure-pdf/fig-density-1.pdf and b/_freeze/1-intro/figure-pdf/fig-density-1.pdf differ diff --git a/_freeze/1-intro/figure-pdf/fig-dimension-cubes-1.pdf b/_freeze/1-intro/figure-pdf/fig-dimension-cubes-1.pdf index 6403640..94663d0 100644 Binary files a/_freeze/1-intro/figure-pdf/fig-dimension-cubes-1.pdf and b/_freeze/1-intro/figure-pdf/fig-dimension-cubes-1.pdf differ diff --git a/_freeze/1-intro/figure-pdf/fig-example-structure-1.pdf b/_freeze/1-intro/figure-pdf/fig-example-structure-1.pdf index 2a057c2..2b5cad3 100644 Binary files a/_freeze/1-intro/figure-pdf/fig-example-structure-1.pdf and b/_freeze/1-intro/figure-pdf/fig-example-structure-1.pdf differ diff --git a/_freeze/1-intro/figure-pdf/fig-explain-1D-data-1.pdf b/_freeze/1-intro/figure-pdf/fig-explain-1D-data-1.pdf index b5c3d9b..0872cf8 100644 Binary files a/_freeze/1-intro/figure-pdf/fig-explain-1D-data-1.pdf and b/_freeze/1-intro/figure-pdf/fig-explain-1D-data-1.pdf differ diff --git a/_freeze/1-intro/figure-pdf/fig-explain-1D-pdf-1.pdf b/_freeze/1-intro/figure-pdf/fig-explain-1D-pdf-1.pdf index 9ab4937..84b0fbf 100644 Binary files a/_freeze/1-intro/figure-pdf/fig-explain-1D-pdf-1.pdf and b/_freeze/1-intro/figure-pdf/fig-explain-1D-pdf-1.pdf differ diff --git a/_freeze/1-intro/figure-pdf/fig-explain-2D-data-1.pdf b/_freeze/1-intro/figure-pdf/fig-explain-2D-data-1.pdf index 9c76ccf..c683050 100644 Binary files a/_freeze/1-intro/figure-pdf/fig-explain-2D-data-1.pdf and b/_freeze/1-intro/figure-pdf/fig-explain-2D-data-1.pdf differ diff --git a/_freeze/1-intro/figure-pdf/fig-explain-2D-pdf-1.pdf b/_freeze/1-intro/figure-pdf/fig-explain-2D-pdf-1.pdf index effd247..de2cd27 100644 Binary files a/_freeze/1-intro/figure-pdf/fig-explain-2D-pdf-1.pdf and b/_freeze/1-intro/figure-pdf/fig-explain-2D-pdf-1.pdf differ diff --git a/_freeze/10-model-based/execute-results/html.json b/_freeze/10-model-based/execute-results/html.json new file mode 100644 index 0000000..4c5d510 --- /dev/null +++ b/_freeze/10-model-based/execute-results/html.json @@ -0,0 +1,21 @@ +{ + "hash": "328b115f3710e3985793a3626229439c", + "result": { + "engine": "knitr", + "markdown": "# Model-based clustering {#sec-mclust}\n\n\\index{cluster analysis!model-based} \n\nModel-based clustering @FR02 fits a multivariate normal mixture model to the data. It uses the EM algorithm to fit the parameters for the mean, variance--covariance of each population, and the mixing proportion. The variance-covariance matrix is re-parametrized using an eigen-decomposition\n\n$$\n\\Sigma_k = \\lambda_kD_kA_kD_k^\\top, ~~~k=1, \\dots, g ~~\\mbox{(number of clusters)}\n$$\n\n\\noindent resulting in several model choices, ranging from simple to complex, as shown in @tbl-covariances.\n\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Load libraries\"}\nlibrary(dplyr)\nlibrary(kableExtra)\nlibrary(ggplot2)\nlibrary(mclust)\nlibrary(mulgar)\nlibrary(patchwork)\nlibrary(colorspace)\nlibrary(tourr)\n```\n:::\n\n::: {#tbl-covariances .cell tbl-cap='Parameterizations of the covariance matrix.'}\n::: {.cell-output-display}\n`````{=html}\n\n \n \n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Model Sigma Family Volume Shape Orientation
EII $$\\lambda I$$ Spherical Equal Equal NA
VII $$\\lambda_k I$$ Spherical Variable Equal NA
EEI $$\\lambda A$$ Diagonal Equal Equal Coordinate axes
VEI $$\\lambda_kA$$ Diagonal Variable Equal Coordinate axes
EVI $$\\lambda A_k$$ Diagonal Equal Variable Coordinate axes
VVI $$\\lambda_k A_k$$ Diagonal Variable Variable Coordinate axes
EEE $$\\lambda DAD^\\top$$ Diagonal Equal Equal Equal
EVE $$\\lambda DA_kD^\\top$$ Ellipsoidal Equal Variable Equal
VEE $$\\lambda_k DAD^\\top$$ Ellipsoidal Variable Equal Equal
VVE $$\\lambda_k DA_kD^\\top$$ Ellipsoidal Variable Equal Equal
EEV $$\\lambda D_kAD_k^\\top$$ Ellipsoidal Equal Variable Variable
VEV $$\\lambda_k D_kAD_k^\\top$$ Ellipsoidal Variable Variable Variable
EVV $$\\lambda D_kA_kD_k^\\top$$ Ellipsoidal Equal Variable Variable
VVV $$\\lambda_k D_kA_kD_k^\\top$$ Ellipsoidal Variable Variable Variable
\n\n`````\n:::\n:::\n\n\n\\noindent Note the distribution descriptions \"spherical\" and \"ellipsoidal\". These are descriptions of the shape of the variance-covariance for a multivariate normal distribution. A standard multivariate normal distribution has a variance-covariance matrix with zeros in the off-diagonal elements, which corresponds to spherically shaped data. When the variances (diagonals) are different or the variables are correlated, then the shape of data from a multivariate normal is ellipsoidal.\n\n\\index{Bayes Information Criterion (BIC)}\n\nThe models are typically scored using the Bayes Information Criterion (BIC), which is based on the log likelihood, number of variables, and number of mixture components. They should also be assessed using graphical methods, as we demonstrate using the penguins data. \n\n## Examining the model in 2D\n\nWe start with two of the four real-valued variables (`bl`, `fl`) and the three `species`. The goal is to determine whether model-based methods can discover clusters that closely correspond to the three species. Based on the scatterplot in @fig-penguins-bl-fl we would expect it to do well, and suggest an elliptical variance-covariance of roughly equal sizes as the model.\n\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to make plot\"}\nload(\"data/penguins_sub.rda\")\nggplot(penguins_sub, aes(x=bl, \n y=fl)) + #, \n #colour=species)) +\n geom_point() +\n geom_density2d(colour=\"#3B99B1\") +\n theme_minimal() +\n theme(aspect.ratio = 1)\n```\n\n::: {.cell-output-display}\n![Scatterplot of flipper length by bill length of the penguins data.](10-model-based_files/figure-html/fig-penguins-bl-fl-1.png){#fig-penguins-bl-fl width=480}\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\npenguins_BIC <- mclustBIC(penguins_sub[,c(1,3)])\nggmc <- ggmcbic(penguins_BIC, cl=2:9, top=4) + \n scale_color_discrete_divergingx(palette = \"Roma\") +\n ggtitle(\"(a)\") +\n theme_minimal() \npenguins_mc <- Mclust(penguins_sub[,c(1,3)], \n G=3, \n modelNames = \"EVE\")\npenguins_mce <- mc_ellipse(penguins_mc)\npenguins_cl <- penguins_sub[,c(1,3)]\npenguins_cl$cl <- factor(penguins_mc$classification)\nggell <- ggplot() +\n geom_point(data=penguins_cl, aes(x=bl, y=fl,\n colour=cl),\n alpha=0.3) +\n geom_point(data=penguins_mce$ell, aes(x=bl, y=fl,\n colour=cl),\n shape=16) +\n geom_point(data=penguins_mce$mn, aes(x=bl, y=fl,\n colour=cl),\n shape=3, size=2) +\n scale_color_discrete_divergingx(palette = \"Zissou 1\") +\n theme_minimal() +\n theme(aspect.ratio=1, legend.position=\"none\") +\n ggtitle(\"(b)\")\nggmc + ggell + plot_layout(ncol=2)\n```\n\n::: {.cell-output-display}\n![Summary plots from model-based clustering: (a) BIC values for clusters 2-9 of top four models, (b) variance-covariance ellipses and cluster means (+) corresponding to the best model. The best model is three-cluster EVE, which has differently shaped variance-covariances albeit the same volume and orientation.](10-model-based_files/figure-html/fig-penguins-bl-fl-mc-1.png){#fig-penguins-bl-fl-mc width=768}\n:::\n:::\n\n\n@fig-penguins-bl-fl-mc summarises the results. All models agree that three clusters is the best. The different variance-covariance models for three clusters have similar BIC values with EVE (different shape, same volume and orientation) being slightly higher. These plots are made from the `mclust` package output using the `ggmcbic` and `mc_ellipse` functions fro the `mulgar` package.\n\n## Extending to higher dimensions\n\nNow we will examine how model-based clustering will group the penguins data using all four variables. \n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code code-fold=\"false\"}\npenguins_BIC <- mclustBIC(penguins_sub[,1:4])\nggmc <- ggmcbic(penguins_BIC, cl=2:9, top=7) + \n scale_color_discrete_divergingx(palette = \"Roma\") +\n theme_minimal() \nggmc\n```\n\n::: {.cell-output-display}\n![BIC values for the top models for 2-9 clusters on the penguins data. The interpretation is mixed: if one were to choose three clusters any of the variance-covariance models would be equally as good, but the very best model is the four-cluster VEE.](10-model-based_files/figure-html/fig-penguins-bic-1.png){#fig-penguins-bic fig-align='center' width=576}\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\npenguins_mc <- Mclust(penguins_sub[,1:4], \n G=4, \n modelNames = \"VEE\")\npenguins_mce <- mc_ellipse(penguins_mc)\npenguins_cl <- penguins_sub\npenguins_cl$cl <- factor(penguins_mc$classification)\n\npenguins_mc_data <- penguins_cl %>%\n select(bl:bm, cl) %>%\n mutate(type = \"data\") %>%\n bind_rows(bind_cols(penguins_mce$ell,\n type=rep(\"ellipse\",\n nrow(penguins_mce$ell)))) %>%\n mutate(type = factor(type))\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to make animated gifs\"}\nanimate_xy(penguins_mc_data[,1:4],\n col=penguins_mc_data$cl,\n pch=c(4, 20 )[as.numeric(penguins_mc_data$type)], \n axes=\"off\")\n\nload(\"data/penguins_tour_path.rda\")\nrender_gif(penguins_mc_data[,1:4], \n planned_tour(pt1), \n display_xy(col=penguins_mc_data$cl,\n pch=c(4, 20)[\n as.numeric(penguins_mc_data$type)], \n axes=\"off\",\n half_range = 0.7),\n gif_file=\"gifs/penguins_best_mc.gif\",\n frames=500,\n loop=FALSE)\n\npenguins_mc <- Mclust(penguins_sub[,1:4], \n G=3, \n modelNames = \"EEE\")\npenguins_mce <- mc_ellipse(penguins_mc)\npenguins_cl <- penguins_sub\npenguins_cl$cl <- factor(penguins_mc$classification)\n\npenguins_mc_data <- penguins_cl %>%\n select(bl:bm, cl) %>%\n mutate(type = \"data\") %>%\n bind_rows(bind_cols(penguins_mce$ell,\n type=rep(\"ellipse\",\n nrow(penguins_mce$ell)))) %>%\n mutate(type = factor(type))\n\nanimate_xy(penguins_mc_data[,1:4],\n col=penguins_mc_data$cl,\n pch=c(4, 20)[as.numeric(penguins_mc_data$type)], \n axes=\"off\")\n\n# Save the animated gif\nload(\"data/penguins_tour_path.rda\")\nrender_gif(penguins_mc_data[,1:4], \n planned_tour(pt1), \n display_xy(col=penguins_mc_data$cl,\n pch=c(4, 20)[\n as.numeric(penguins_mc_data$type)], \n axes=\"off\",\n half_range = 0.7),\n gif_file=\"gifs/penguins_simpler_mc.gif\",\n frames=500,\n loop=FALSE)\n```\n:::\n\n\n::: {#fig-penguins-mc layout-ncol=2}\n::: {.content-visible when-format=\"html\"}\n![Best model: four-cluster VEE](gifs/penguins_best_mc.gif){#fig-penguins-best_mc fig-alt=\"FIX ME\" fig.align=\"center\"}\n:::\n\n::: {.content-visible when-format=\"html\"}\n![Three-cluster EEE](gifs/penguins_simpler_mc.gif){#fig-penguins-simpler_mc fig-alt=\"FIX ME\" fig.align=\"center\"}\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n![Best model: four-cluster VEE](images/penguins_best_mc_60.png){#fig-penguins-best_mc fig-alt=\"FIX ME\" fig.align=\"center\"}\n:::\n\n::: {.content-visible when-format=\"html\"}\n![Three-cluster EEE](images/penguins_simpler_mc_60.png){#fig-penguins-simpler_mc fig-alt=\"FIX ME\" fig.align=\"center\"}\n:::\n\nExamining the model-based clustering results for the penguins data: (a) best model according to BIC value, (b) simpler three-cluster model. Dots are ellipse points, and \"x\" are data points. It is important to note that the three cluster solution fits the data better, even though it has a lower BIC. \n:::\n\n::: {.content-visible when-format=\"html\"}\n::: info\nUsing the tour to visualise the final choices of models with similarly high BIC values helps to choose which best fits the data. It may not be the one with the highest BIC value. \n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\infobox{Using the tour to visualise the final choices of models with similarly high BIC values helps to choose which best fits the data. It may not be the one with the highest BIC value.}\n:::\n\n\n::: {.cell}\n\n:::\n\n\n## Exercises {-}\n\n1. Examine the three cluster EVE, VVE and VEE models with the tour, and explain whether these are distinguishably different from the EEE three cluster model.\n2. Fit model-based clustering to the the `clusters`. Does it suggest the data has three clusters? Using the tour examine the best model model. How well does this fit the data?\n3. Fit model-based clustering to the the `multicluster`. Does it suggest the data has six clusters? Using the tour examine the best model model. How well does this fit the data?\n4. Fit model-based clustering to the `fake_trees` data. Does it suggest that the data has 10 clusters? If not, why do you think this is? Using the tour examine the best model model. How well does this fit the branching structure?\n5. Try fitting model-based clustering to the `aflw` data? What is the best model? Is the solution related to offensive vs defensive vs mid-fielder skills?\n6. Use model-based clustering on the challenge data sets, `c1`-`c7` from the `mulgar` package. Explain why or why not the best model fits the cluster structure or not.\n", + "supporting": [ + "10-model-based_files/figure-html" + ], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": { + "include-in-header": [ + "\n\n" + ] + }, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/10-model-based/execute-results/tex.json b/_freeze/10-model-based/execute-results/tex.json index ccc689f..1b88582 100644 --- a/_freeze/10-model-based/execute-results/tex.json +++ b/_freeze/10-model-based/execute-results/tex.json @@ -1,8 +1,8 @@ { - "hash": "406f8f0959e454d38e6b14120b705ded", + "hash": "be4e72f557013e8af87477036281525c", "result": { "engine": "knitr", - "markdown": "# Model-based clustering {#sec-mclust}\n\n\\index{cluster analysis!model-based} \n\nModel-based clustering @FR02 fits a multivariate normal mixture model to the data. It uses the EM algorithm to fit the parameters for the mean, variance--covariance of each population, and the mixing proportion. The variance-covariance matrix is re-parameterised using an eigen-decomposition\n\n$$\n\\Sigma_k = \\lambda_kD_kA_kD_k^\\top, ~~~k=1, \\dots, g ~~\\mbox{(number of clusters)}\n$$\n\n\\noindent resulting in several model choices, ranging from simple to complex, as shown in @tbl-covariances.\n\n\n\n::: {.cell}\n\n:::\n\n::: {#tbl-covariances .cell tbl-cap='Parameterizations of the covariance matrix.'}\n::: {.cell-output-display}\n\\begin{table}\n\\centering\n\\begin{tabular}{c|c|c|c|c|c}\n\\hline\nModel & Sigma & Family & Volume & Shape & Orientation\\\\\n\\hline\nEII & \\$\\$\\textbackslash{}lambda I\\$\\$ & Spherical & Equal & Equal & NA\\\\\n\\hline\nVII & \\$\\$\\textbackslash{}lambda\\_k I\\$\\$ & Spherical & Variable & Equal & NA\\\\\n\\hline\nEEI & \\$\\$\\textbackslash{}lambda A\\$\\$ & Diagonal & Equal & Equal & Coordinate axes\\\\\n\\hline\nVEI & \\$\\$\\textbackslash{}lambda\\_kA\\$\\$ & Diagonal & Variable & Equal & Coordinate axes\\\\\n\\hline\nEVI & \\$\\$\\textbackslash{}lambda A\\_k\\$\\$ & Diagonal & Equal & Variable & Coordinate axes\\\\\n\\hline\nVVI & \\$\\$\\textbackslash{}lambda\\_k A\\_k\\$\\$ & Diagonal & Variable & Variable & Coordinate axes\\\\\n\\hline\nEEE & \\$\\$\\textbackslash{}lambda DAD\\textasciicircum{}\\textbackslash{}top\\$\\$ & Diagonal & Equal & Equal & Equal\\\\\n\\hline\nEVE & \\$\\$\\textbackslash{}lambda DA\\_kD\\textasciicircum{}\\textbackslash{}top\\$\\$ & Ellipsoidal & Equal & Variable & Equal\\\\\n\\hline\nVEE & \\$\\$\\textbackslash{}lambda\\_k DAD\\textasciicircum{}\\textbackslash{}top\\$\\$ & Ellipsoidal & Variable & Equal & Equal\\\\\n\\hline\nVVE & \\$\\$\\textbackslash{}lambda\\_k DA\\_kD\\textasciicircum{}\\textbackslash{}top\\$\\$ & Ellipsoidal & Variable & Equal & Equal\\\\\n\\hline\nEEV & \\$\\$\\textbackslash{}lambda D\\_kAD\\_k\\textasciicircum{}\\textbackslash{}top\\$\\$ & Ellipsoidal & Equal & Variable & Variable\\\\\n\\hline\nVEV & \\$\\$\\textbackslash{}lambda\\_k D\\_kAD\\_k\\textasciicircum{}\\textbackslash{}top\\$\\$ & Ellipsoidal & Variable & Variable & Variable\\\\\n\\hline\nEVV & \\$\\$\\textbackslash{}lambda D\\_kA\\_kD\\_k\\textasciicircum{}\\textbackslash{}top\\$\\$ & Ellipsoidal & Equal & Variable & Variable\\\\\n\\hline\nVVV & \\$\\$\\textbackslash{}lambda\\_k D\\_kA\\_kD\\_k\\textasciicircum{}\\textbackslash{}top\\$\\$ & Ellipsoidal & Variable & Variable & Variable\\\\\n\\hline\n\\end{tabular}\n\\end{table}\n\n\n:::\n:::\n\n\n\n\\noindent Note the distribution descriptions \"spherical\" and \"ellipsoidal\". These are descriptions of the shape of the variance-covariance for a multivariate normal distribution. A standard multivariate normal distribution has a variance-covariance matrix with zeros in the off-diagonal elements, which corresponds to spherically shaped data. When the variances (diagonals) are different or the variables are correlated, then the shape of data from a multivariate normal is ellipsoidal.\n\n\\index{Bayes Information Criterion (BIC)}\n\nThe models are typically scored using the Bayes Information Criterion (BIC), which is based on the log likelihood, number of variables, and number of mixture components. They should also be assessed using graphical methods, as we demonstrate using the penguins data. \n\n## Examining the model in 2D\n\nWe start with two of the four real-valued variables (`bl`, `fl`) and the three `species`. The goal is to determine whether model-based methods can discover clusters that closely correspond to the three species. Based on the scatterplot in @fig-penguins-bl-fl we would expect it to do well, and suggest an elliptical variance-covariance of roughly equal sizes as the model.\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Scatterplot of flipper length by bill length of the penguins data.](10-model-based_files/figure-pdf/fig-penguins-bl-fl-1.pdf){#fig-penguins-bl-fl width=80%}\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\npenguins_BIC <- mclustBIC(penguins_sub[,c(1,3)])\nggmc <- ggmcbic(penguins_BIC, cl=2:9, top=4) + \n scale_color_discrete_divergingx(palette = \"Roma\") +\n ggtitle(\"(a)\") +\n theme_minimal() \npenguins_mc <- Mclust(penguins_sub[,c(1,3)], \n G=3, \n modelNames = \"EVE\")\npenguins_mce <- mc_ellipse(penguins_mc)\npenguins_cl <- penguins_sub[,c(1,3)]\npenguins_cl$cl <- factor(penguins_mc$classification)\nggell <- ggplot() +\n geom_point(data=penguins_cl, aes(x=bl, y=fl,\n colour=cl),\n alpha=0.3) +\n geom_point(data=penguins_mce$ell, aes(x=bl, y=fl,\n colour=cl),\n shape=16) +\n geom_point(data=penguins_mce$mn, aes(x=bl, y=fl,\n colour=cl),\n shape=3, size=2) +\n scale_color_discrete_divergingx(palette = \"Zissou 1\") +\n theme_minimal() +\n theme(aspect.ratio=1, legend.position=\"none\") +\n ggtitle(\"(b)\")\nggmc + ggell + plot_layout(ncol=2)\n```\n\n::: {.cell-output-display}\n![Summary plots from model-based clustering: (a) BIC values for clusters 2-9 of top four models, (b) variance-covariance ellipses and cluster means (+) corresponding to the best model. The best model is three-cluster EVE, which has differently shaped variance-covariances albeit the same volume and orientation.](10-model-based_files/figure-pdf/fig-penguins-bl-fl-mc-1.pdf){#fig-penguins-bl-fl-mc fig-pos='H' width=100%}\n:::\n:::\n\n\n\n@fig-penguins-bl-fl-mc summarises the results. All models agree that three clusters is the best. The different variance-covariance models for three clusters have similar BIC values with EVE (different shape, same volume and orientation) being slightly higher. These plots are made from the `mclust` package output using the `ggmcbic` and `mc_ellipse` functions fro the `mulgar` package.\n\n## Examining the model in high dimensions\n\nNow we will examine how model-based clustering will group the penguins data using all four variables. @fig-penguins-bic shows the summary of different models, of which we would choose the four-cluster VEE, if we strictly followed the BIC choice. @fig-penguins-mc-pdf shows this model in a tour, and the best three-cluster model. You can see that the four-cluster result is inadequate, in various ways. One of the species (Chinstrap) does have a bimodal density, due to the two species, and we would expect that a four cluster solution might detect this. The tour shows that the three-cluster solution is the best match to the data shape.\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code code-fold=\"false\"}\npenguins_BIC <- mclustBIC(penguins_sub[,1:4])\nggmc <- ggmcbic(penguins_BIC, cl=2:9, top=7) + \n scale_color_discrete_divergingx(palette = \"Roma\") +\n theme_minimal() \nggmc\n```\n\n::: {.cell-output-display}\n![BIC values for the top models for 2-9 clusters on the penguins data. The interpretation is mixed: if one were to choose three clusters any of the variance-covariance models would be equally as good, but the very best model is the four-cluster VEE.](10-model-based_files/figure-pdf/fig-penguins-bic-1.pdf){#fig-penguins-bic fig-align='center' fig-pos='H' width=80%}\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\npenguins_mc <- Mclust(penguins_sub[,1:4], \n G=4, \n modelNames = \"VEE\")\npenguins_mce <- mc_ellipse(penguins_mc)\npenguins_cl <- penguins_sub\npenguins_cl$cl <- factor(penguins_mc$classification)\n\npenguins_mc_data <- penguins_cl %>%\n select(bl:bm, cl) %>%\n mutate(type = \"data\") %>%\n bind_rows(bind_cols(penguins_mce$ell,\n type=rep(\"ellipse\",\n nrow(penguins_mce$ell)))) %>%\n mutate(type = factor(type))\n```\n:::\n\n::: {.cell}\n\n:::\n\n\n\n::: {.content-visible when-format=\"html\"}\n::: {#fig-penguins-mc-html layout-ncol=2}\n\n![Four-cluster VEE](gifs/penguins_best_mc.gif){#fig-penguins-best_mc fig-alt=\"FIX ME\" fig.align=\"center\"}\n\n![Three-cluster EEE](gifs/penguins_simpler_mc.gif){#fig-penguins-simpler_mc fig-alt=\"FIX ME\" fig.align=\"center\"}\n\nExamining the model-based clustering results for the penguins data: (a) best model according to BIC value, (b) simpler three-cluster model. Dots are ellipse points, and \"x\" are data points. It is important to note that the three cluster solution fits the data better, even though it has a lower BIC. \n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n::: {#fig-penguins-mc-pdf layout-ncol=2}\n\n![Four-cluster VEE](images/penguins_best_mc_60.png){#fig-penguins-best-mc fig-alt=\"FIX ME\" fig.align=\"center\"}\n\n![Three-cluster EEE](images/penguins_simpler_mc_60.png){#fig-penguins-simpler-mc fig-alt=\"FIX ME\" fig.align=\"center\"}\n\nExamining the model-based clustering results for the penguins data: (a) best model according to BIC value, (b) simpler three-cluster model. Dots are ellipse points, and \"x\" are data points. It is important to note that the three cluster solution fits the data better, even though it has a lower BIC. \n:::\n:::\n\n::: {.content-visible when-format=\"html\"}\n::: info\nUsing the tour to visualise the final choices of models with similarly high BIC values helps to choose which best fits the data. It may not be the one with the highest BIC value. \n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\infobox{Using the tour to visualise the final choices of models with similarly high BIC values helps to choose which best fits the data. It may not be the one with the highest BIC value.}\n:::\n\n\n\n::: {.cell}\n\n:::\n\n\n\n## Exercises {-}\n\n1. Examine the three cluster EVE, VVE and VEE models with the tour, and explain whether these are distinguishably different from the EEE three cluster model.\n2. Fit model-based clustering to the the `clusters`. Does it suggest the data has three clusters? Using the tour examine the best model model. How well does this fit the data?\n3. Fit model-based clustering to the the `multicluster`. Does it suggest the data has six clusters? Using the tour examine the best model model. How well does this fit the data?\n4. Fit model-based clustering to the `fake_trees` data. Does it suggest that the data has 10 clusters? If not, why do you think this is? Using the tour examine the best model model. How well does this fit the branching structure?\n5. Try fitting model-based clustering to the `aflw` data? What is the best model? Is the solution related to offensive vs defensive vs mid-fielder skills?\n6. Use model-based clustering on the challenge data sets, `c1`-`c7` from the `mulgar` package. Explain why or why not the best model fits the cluster structure or not.\n", + "markdown": "# Model-based clustering {#sec-mclust}\n\n\\index{cluster analysis!model-based} \n\nModel-based clustering @FR02 fits a multivariate normal mixture model to the data. It uses the EM algorithm to fit the parameters for the mean, variance-covariance of each population, and the mixing proportion. The variance-covariance matrix is re-parameterised using an eigen-decomposition\n\n$$\n\\Sigma_k = \\lambda_kD_kA_kD_k^\\top, ~~~k=1, \\dots, g ~~\\mbox{(number of clusters)}\n$$\n\n\\noindent resulting in several model choices, ranging from simple to complex, as shown in @tbl-covariances-pdf.\n\n\n\n::: {.cell}\n\n:::\n\n\n\n::: {.content-visible when-format=\"html\"}\n\n\n\n::: {#tbl-covariances-html .cell tbl-cap='Parameterizations of the covariance matrix.'}\n\n:::\n\n\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\n\n::: {#tbl-covariances-pdf .cell tbl-cap='Parameterizations of the covariance matrix.'}\n::: {.cell-output-display}\n\\begin{table}\n\\centering\n\\begin{tabular}{cccccc}\n\\toprule\nModel & Sigma & Family & Volume & Shape & Orientation\\\\\n\\midrule\nEII & $\\lambda I$ & Spherical & Equal & Equal & NA\\\\\nVII & $\\lambda_k I$ & Spherical & Variable & Equal & NA\\\\\nEEI & $\\lambda A$ & Diagonal & Equal & Equal & Coordinate axes\\\\\nVEI & $\\lambda_kA$ & Diagonal & Variable & Equal & Coordinate axes\\\\\nEVI & $\\lambda A_k$ & Diagonal & Equal & Variable & Coordinate axes\\\\\n\\addlinespace\nVVI & $\\lambda_k A_k$ & Diagonal & Variable & Variable & Coordinate axes\\\\\nEEE & $\\lambda DAD^\\top$ & Diagonal & Equal & Equal & Equal\\\\\nEVE & $\\lambda DA_kD^\\top$ & Ellipsoidal & Equal & Variable & Equal\\\\\nVEE & $\\lambda_k DAD^\\top$ & Ellipsoidal & Variable & Equal & Equal\\\\\nVVE & $\\lambda_k DA_kD^\\top$ & Ellipsoidal & Variable & Variable & Equal\\\\\n\\addlinespace\nEEV & $\\lambda D_kAD_k^\\top$ & Ellipsoidal & Equal & Variable & Variable\\\\\nVEV & $\\lambda_k D_kAD_k^\\top$ & Ellipsoidal & Variable & Equal & Variable\\\\\nEVV & $\\lambda D_kA_kD_k^\\top$ & Ellipsoidal & Equal & Variable & Variable\\\\\nVVV & $\\lambda_k D_kA_kD_k^\\top$ & Ellipsoidal & Variable & Variable & Variable\\\\\n\\bottomrule\n\\end{tabular}\n\\end{table}\n\n\n:::\n:::\n\n\n:::\n\n\\noindent Note the distribution descriptions \"spherical\" and \"ellipsoidal\". These are descriptions of the shape of the variance-covariance for a multivariate normal distribution. A standard multivariate normal distribution has a variance-covariance matrix with zeros in the off-diagonal elements, which corresponds to spherically shaped data. When the variances (diagonals) are different or the variables are correlated, then the shape of data from a multivariate normal is ellipsoidal.\n\n\\index{Bayes Information Criterion (BIC)}\n\nThe models are typically scored using the Bayes Information Criterion (BIC), which is based on the log likelihood, number of variables, and number of mixture components. They should also be assessed using graphical methods, as we demonstrate using the penguins data. \n\n## Examining the model in 2D\n\nWe start with two of the four real-valued variables (`bl`, `fl`) and the three `species`. The goal is to determine whether model-based methods can discover clusters that closely correspond to the three species. Based on the scatterplot in @fig-penguins-bl-fl we would expect it to do well, and suggest an elliptical variance-covariance of roughly equal sizes as the model.\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Scatterplot of flipper length by bill length of the penguins data.](10-model-based_files/figure-pdf/fig-penguins-bl-fl-1.pdf){#fig-penguins-bl-fl width=80%}\n:::\n:::\n\n\n\nTo draw ellipses in any dimension, a reasonable procedure is to sample points uniformly on a sphere, and then transform this into a sphere using the inverse of the variance-covariance matrix. The `mulgar` function `mc_ellipse()` does this for each cluster in the fitted model.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\n# Fit the model, plot BIC, construct and plot ellipses\npenguins_BIC <- mclustBIC(penguins_sub[,c(1,3)])\nggmc <- ggmcbic(penguins_BIC, cl=2:9, top=4) + \n scale_color_discrete_divergingx(palette = \"Roma\") +\n ggtitle(\"(a)\") +\n theme_minimal() \npenguins_mc <- Mclust(penguins_sub[,c(1,3)], \n G=3, \n modelNames = \"EVE\")\npenguins_mce <- mc_ellipse(penguins_mc)\npenguins_cl <- penguins_sub[,c(1,3)]\npenguins_cl$cl <- factor(penguins_mc$classification)\nggell <- ggplot() +\n geom_point(data=penguins_cl, aes(x=bl, y=fl,\n colour=cl),\n alpha=0.3) +\n geom_point(data=penguins_mce$ell, aes(x=bl, y=fl,\n colour=cl),\n shape=16) +\n geom_point(data=penguins_mce$mn, aes(x=bl, y=fl,\n colour=cl),\n shape=3, size=2) +\n scale_color_discrete_divergingx(palette = \"Zissou 1\") +\n theme_minimal() +\n theme(aspect.ratio=1, legend.position=\"none\") +\n ggtitle(\"(b)\")\nggmc + ggell + plot_layout(ncol=2)\n```\n\n::: {.cell-output-display}\n![Summary plots from model-based clustering: (a) BIC values for clusters 2-9 of top four models, (b) variance-covariance ellipses and cluster means (+) corresponding to the best model. The best model is three-cluster EVE, which has differently shaped variance-covariances albeit the same volume and orientation.](10-model-based_files/figure-pdf/fig-penguins-bl-fl-mc-1.pdf){#fig-penguins-bl-fl-mc fig-pos='H' width=100%}\n:::\n:::\n\n\n\n@fig-penguins-bl-fl-mc summarises the results. All models agree that three clusters is the best. The different variance-covariance models for three clusters have similar BIC values with EVE (different shape, same volume and orientation) being slightly higher. These plots are made from the `mclust` package output using the `ggmcbic()` and `mc_ellipse()` functions from the `mulgar` package.\n\n## Examining the model in high dimensions\n\nNow we will examine how model-based clustering will group the penguins data using all four variables. @fig-penguins-bic shows the summary of different models, of which we would choose the four-cluster VEE, if we strictly followed the BIC choice. @fig-penguins-mc-pdf shows this model in a tour, and the best three-cluster model. You can see that the four-cluster result is inadequate, in various ways. One of the species (Chinstrap) does have a bimodal density, due to the two species, and we would expect that a four cluster solution might detect this. The tour shows that the three-cluster solution is the best match to the data shape.\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code code-fold=\"false\"}\npenguins_BIC <- mclustBIC(penguins_sub[,1:4])\nggmc <- ggmcbic(penguins_BIC, cl=2:9, top=7) + \n scale_color_discrete_divergingx(palette = \"Roma\") +\n theme_minimal() \nggmc\n```\n\n::: {.cell-output-display}\n![BIC values for the top models for 2-9 clusters on the penguins data. The interpretation is mixed: if one were to choose three clusters any of the variance-covariance models would be equally as good, but the very best model is the four-cluster VEE.](10-model-based_files/figure-pdf/fig-penguins-bic-1.pdf){#fig-penguins-bic fig-align='center' fig-pos='H' width=80%}\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\npenguins_mc <- Mclust(penguins_sub[,1:4], \n G=4, \n modelNames = \"VEE\")\npenguins_mce <- mc_ellipse(penguins_mc)\npenguins_cl <- penguins_sub\npenguins_cl$cl <- factor(penguins_mc$classification)\n\npenguins_mc_data <- penguins_cl %>%\n select(bl:bm, cl) %>%\n mutate(type = \"data\") %>%\n bind_rows(bind_cols(penguins_mce$ell,\n type=rep(\"ellipse\",\n nrow(penguins_mce$ell)))) %>%\n mutate(type = factor(type))\n```\n:::\n\n::: {.cell}\n\n:::\n\n\n\n::: {.content-visible when-format=\"html\"}\n::: {#fig-penguins-mc-html layout-ncol=2}\n\n![Four-cluster VEE](gifs/penguins_best_mc.gif){#fig-penguins-best_mc fig-alt=\"FIX ME\" fig.align=\"center\"}\n\n![Three-cluster EEE](gifs/penguins_simpler_mc.gif){#fig-penguins-simpler_mc fig-alt=\"FIX ME\" fig.align=\"center\"}\n\nExamining the model-based clustering results for the penguins data: (a) best model according to BIC value, (b) simpler three-cluster model. Dots are ellipse points, and \"x\" are data points. It is important to note that the three cluster solution fits the data better, even though it has a lower BIC. \n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n::: {#fig-penguins-mc-pdf layout-ncol=2}\n\n![Four-cluster VEE](images/penguins_best_mc_60.png){#fig-penguins-best-mc fig-alt=\"FIX ME\" fig.align=\"center\"}\n\n![Three-cluster EEE](images/penguins_simpler_mc_60.png){#fig-penguins-simpler-mc fig-alt=\"FIX ME\" fig.align=\"center\"}\n\nExamining the model-based clustering results for the penguins data: (a) best model according to BIC value, (b) simpler three-cluster model. Dots are ellipse points, and \"x\" are data points. It is important to note that the three cluster solution fits the data better, even though it has a lower BIC. \n:::\n:::\n\n::: {.content-visible when-format=\"html\"}\n::: info\nUsing the tour to visualise the final choices of models with similarly high BIC values helps to choose which best fits the data. It may not be the one with the highest BIC value. \n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\infobox{Using the tour to visualise the final choices of models with similarly high BIC values helps to choose which best fits the data. It may not be the one with the highest BIC value.}\n:::\n\n\n\n::: {.cell}\n\n:::\n\n\n\n## Exercises {-}\n\n1. Examine the three cluster EVE, VVE and VEE models with the tour, and explain whether these are distinguishably different from the EEE three cluster model.\n2. Fit model-based clustering to the the `clusters`. Does it suggest the data has three clusters? Using the tour examine the best model model. How well does this fit the data?\n3. Fit model-based clustering to the the `multicluster`. Does it suggest the data has six clusters? Using the tour examine the best model model. How well does this fit the data?\n4. Fit model-based clustering to the `fake_trees` data. Does it suggest that the data has 10 clusters? If not, why do you think this is? Using the tour examine the best model model. How well does this fit the branching structure?\n5. Try fitting model-based clustering to the `aflw` data? What is the best model? Is the solution related to offensive vs defensive vs mid-fielder skills?\n6. Use model-based clustering on the challenge data sets, `c1`-`c7` from the `mulgar` package. Explain why or why not the best model fits the cluster structure or not.\n", "supporting": [ "10-model-based_files/figure-pdf" ], diff --git a/_freeze/10-model-based/figure-html/fig-penguins-bic-1.png b/_freeze/10-model-based/figure-html/fig-penguins-bic-1.png new file mode 100644 index 0000000..d4df9e6 Binary files /dev/null and b/_freeze/10-model-based/figure-html/fig-penguins-bic-1.png differ diff --git a/_freeze/10-model-based/figure-html/fig-penguins-bl-fl-1.png b/_freeze/10-model-based/figure-html/fig-penguins-bl-fl-1.png new file mode 100644 index 0000000..3ec4ebd Binary files /dev/null and b/_freeze/10-model-based/figure-html/fig-penguins-bl-fl-1.png differ diff --git a/_freeze/10-model-based/figure-html/fig-penguins-bl-fl-mc-1.png b/_freeze/10-model-based/figure-html/fig-penguins-bl-fl-mc-1.png new file mode 100644 index 0000000..026f1a8 Binary files /dev/null and b/_freeze/10-model-based/figure-html/fig-penguins-bl-fl-mc-1.png differ diff --git a/_freeze/10-model-based/figure-pdf/fig-penguins-bic-1.pdf b/_freeze/10-model-based/figure-pdf/fig-penguins-bic-1.pdf index 4d6995c..334c531 100644 Binary files a/_freeze/10-model-based/figure-pdf/fig-penguins-bic-1.pdf and b/_freeze/10-model-based/figure-pdf/fig-penguins-bic-1.pdf differ diff --git a/_freeze/10-model-based/figure-pdf/fig-penguins-bl-fl-1.pdf b/_freeze/10-model-based/figure-pdf/fig-penguins-bl-fl-1.pdf index 1298947..f00155e 100644 Binary files a/_freeze/10-model-based/figure-pdf/fig-penguins-bl-fl-1.pdf and b/_freeze/10-model-based/figure-pdf/fig-penguins-bl-fl-1.pdf differ diff --git a/_freeze/10-model-based/figure-pdf/fig-penguins-bl-fl-mc-1.pdf b/_freeze/10-model-based/figure-pdf/fig-penguins-bl-fl-mc-1.pdf index 306ac93..6a96666 100644 Binary files a/_freeze/10-model-based/figure-pdf/fig-penguins-bl-fl-mc-1.pdf and b/_freeze/10-model-based/figure-pdf/fig-penguins-bl-fl-mc-1.pdf differ diff --git a/_freeze/11-som/execute-results/html.json b/_freeze/11-som/execute-results/html.json new file mode 100644 index 0000000..1f5089e --- /dev/null +++ b/_freeze/11-som/execute-results/html.json @@ -0,0 +1,17 @@ +{ + "hash": "fa129833eb67d371c86b67643558bed8", + "result": { + "engine": "knitr", + "markdown": "## Self-organizing maps {#sec-som}\n\\index{cluster analysis!self-organizing maps (SOM)}\n\nA self-organizing map (SOM) @Ko01 is constructed using a constrained $k$-means algorithm. A 1D or 2D net is stretched through the data. The knots, in the net, form the cluster means, and the points closest to the knot are considered to belong to that cluster. The similarity of nodes (and their corresponding clusters) is defined as proportional to their distance from one another on the net. Unlike $k$-means one would normally choose a largish net, with more nodes than expected clusters. A well-separated cluster in the data would likely be split across multiple nodes in the net. Examining the net where nodes are empty of points we would interpret this as a gap in the original data. \n\n::: {.content-visible when-format=\"html\"}\n::: info\nA self-organising map is like a fisherwoman's net, as the net is pulled in it catches the fish near knots in the net. We would examine the net in \n\n- 2D to extract the fish.\n- high-dimensions to see how it was woven through the space to catch fish.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\infobox{A self-organising map is like a fisherwoman's net, as the net is pulled in it catches the fish near knots in the net. We would examine the net in \n\\begin{itemize}\n\\item 2D to extract the fish.\n\\item high-dimensions to see how it was woven through the space to catch fish.\n\\end{itemize}}\n:::\n\n\n@fig-penguins-som-html illustrates how the SOM fits the penguins data. SOM is not ideal for clustered data where there are gaps. It is better suited for data that lies on a non-linear low-dimensional manifold. To model data like the penguins the first step is to set up a net that will cover more than the three clusters. Here we have chosen to use a $5\\times 5$ rectangular grid. (The option allows for a hexagonal grid, which would make for a better tiled 2D map, but this is not useful for viewing the model in high dimensions.) Like $k$-means clustering the fitted model can change substantially depending on the initialisation, so setting a seed will ensure a consistent result. We have also initialised the positions of the knots using PCA, which stretches the net in the main two directions of variance of the data, generally giving better results. \n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nlibrary(kohonen)\nlibrary(aweSOM)\nlibrary(mulgar)\nlibrary(dplyr)\nlibrary(ggplot2)\nlibrary(colorspace)\nload(\"data/penguins_sub.rda\")\n\nset.seed(947)\np_grid <- kohonen::somgrid(xdim = 5, ydim = 5,\n topo = 'rectangular')\np_init <- somInit(as.matrix(penguins_sub[,1:4]), 5, 5)\np_som <- som(as.matrix(penguins_sub[,1:4]), \n rlen=2000,\n grid = p_grid,\n init = p_init)\n```\n:::\n\n\nThe resulting model object is used to construct an object containing the original data, the 2D map, the map in $p$-D, with edges, and segments to connect points to represent the next using the `som_model()` function from `mulgar`. The 2D map shows a configuration of the data in 2D which best displays the clusters, much like how a PCA or LDA plot would eb used. \n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\np_som_df_net <- som_model(p_som)\np_som_data <- p_som_df_net$data %>% \n mutate(species = penguins_sub$species)\np_som_map_p <- ggplot() +\n geom_segment(data=p_som_df_net$edges_s, \n aes(x=x, xend=xend, y=y, \n yend=yend)) +\n geom_point(data=p_som_data, \n aes(x=map1, y=map2, \n colour=species), \n size=3, alpha=0.5) +\n xlab(\"map 1\") + ylab(\"map 2\") +\n scale_color_discrete_divergingx(\n palette=\"Zissou 1\") +\n theme_minimal() +\n theme(aspect.ratio = 1, \n legend.position = \"bottom\",\n legend.title = element_blank(),\n axis.text = element_blank())\n```\n:::\n\n\nThe object can also be modified into the pieces needed to show the net in $p$-D. You need the data, points marking the net, and edges indicating which points to connect to draw the net.\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nlibrary(tourr)\n\n# Set up data\np_som_map <- p_som_df_net$net %>%\n mutate(species = \"0\", type=\"net\")\np_som_data <- p_som_data %>% \n select(bl:bm, species) %>%\n mutate(type=\"data\", \n species = as.character(species)) \np_som_map_data <- bind_rows(p_som_map, p_som_data)\np_som_map_data$type <- factor(p_som_map_data$type,\n levels=c(\"net\", \"data\"))\np_som_map_data$species <- factor(p_som_map_data$species,\n levels=c(\"0\",\"Adelie\",\"Chinstrap\",\"Gentoo\"))\np_pch <- c(46, 16)[as.numeric(p_som_map_data$type)]\np_col <- c(\"black\", hcl.colors(3, \"Zissou 1\"))[as.numeric(p_som_map_data$species)]\nanimate_xy(p_som_map_data[,1:4],\n col=p_col, \n pch=p_pch,\n edges=as.matrix(p_som_df_net$edges), \n edges.col = \"black\",\n axes=\"bottomleft\")\n```\n:::\n\n::: {.cell}\n\n:::\n\n\n\n::: {.content-visible when-format=\"html\"}\n::: {#fig-penguins-som-html layout-ncol=2}\n\n\n::: {.cell}\n::: {.cell-output-display}\n![2D map](11-som_files/figure-html/fig-p-som2-html-1.png){#fig-p-som2-html width=672}\n:::\n:::\n\n\n![Map in 4D](gifs/p_som.gif){#fig-p-som1 fig-alt=\"\"}\n:::\n\nExamining the SOM map views in 2D and with the data in 4D. Points are coloured by species, which was not used for the modeling. The 2D map shows that the `map 2` direction is primarily distinguishing the Gentoo from the others, and `map 1` is imperfectly distinguishing the Chinstrap from Adelie. The map in the data space shows how it is woven into the shape of the data. \n:::\n\n::: {.content-visible when-format=\"pdf\"}\n::: {#fig-penguins-som-pdf layout-ncol=2}\n\n\n\n![2D map](images/fig-p-som2-1.png){#fig-p-som1 fig-alt=\"\"}\n\n![Map in 4D](images/p_som_47.png){#fig-p-som1 fig-alt=\"\"}\n:::\n\nExamining the SOM map views in 2D and with the data in 4D. Points are coloured by species, which was not used for the modeling. The 2D map shows that the `map 2` direction is primarily distinguishing the Gentoo from the others, and `map 1` is imperfectly distinguishing the Chinstrap from Adelie. The map in the data space shows how it is woven into the shape of the data. \n:::\n\nThe SOM fit, with a $5\\times 5$ grid, for the penguins has the data clustered into 25 groups. This doesn't work as a clustering technique on its own, if we remember that the data has three clusters corresponding to three species of penguins. Using species to colour the points helps to see what SOM has done. It has used about seven nodes to capture the separated Gentoo group. These are mostly in the `map 2` direction, which means that this direction (like a direction in PCA) is useful for distinguishing the Gentoo penguins from the others. The other two species are mixed on the map, but roughly spread out on the direction of `map 1`. \n\n## Exercises {.unnumbered}\n\n1. Fit an SOM to the first four PCs of the `aflw` data. Examine the best solution (you choose the size of the net), and describe how the map lays out the data. Does it show offensive vs defensive vs midfield players? Or does it tend to show high skills vs low skills?\n2. Fit an SOM to the first 10 PCs of the `fake_trees` data, using your choice of net size. How well does the map show the branching structure?\n3. Examine a range of SOM nets fitted to the first 10 PCs of the `fake_trees` data in the 10D space using a tour. Set the values of `rlen` to be 5, 50, 500. How does the net change on this parameter?\n4. Plot the distances output for the SOM fit to the penguins data. Mark the observations that have the 5 biggest distances, and show these in a tour. These are the observations where the net has fitted least well, and may be outliers.\n5. Use SOM on the challenge data sets, `c1`-`c7` from the `mulgar` package. What is the best choice of number of knots for each? Explain why or why not the model fits the cluster structure of each or not.\n\n", + "supporting": [ + "11-som_files/figure-html" + ], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": {}, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/11-som/figure-html/fig-p-som2-html-1.png b/_freeze/11-som/figure-html/fig-p-som2-html-1.png new file mode 100644 index 0000000..b647e94 Binary files /dev/null and b/_freeze/11-som/figure-html/fig-p-som2-html-1.png differ diff --git a/_freeze/11-som/figure-pdf/fig-p-som2-html-1.pdf b/_freeze/11-som/figure-pdf/fig-p-som2-html-1.pdf new file mode 100644 index 0000000..884854b Binary files /dev/null and b/_freeze/11-som/figure-pdf/fig-p-som2-html-1.pdf differ diff --git a/_freeze/11-som/figure-pdf/fig-p-som2-pdf-1.pdf b/_freeze/11-som/figure-pdf/fig-p-som2-pdf-1.pdf new file mode 100644 index 0000000..6ef0643 Binary files /dev/null and b/_freeze/11-som/figure-pdf/fig-p-som2-pdf-1.pdf differ diff --git a/_freeze/12-summary-clust/execute-results/html.json b/_freeze/12-summary-clust/execute-results/html.json new file mode 100644 index 0000000..db785a0 --- /dev/null +++ b/_freeze/12-summary-clust/execute-results/html.json @@ -0,0 +1,17 @@ +{ + "hash": "8a4c53e3b955f99a8640a0c3695d504a", + "result": { + "engine": "knitr", + "markdown": "# Summarising and comparing clustering results {#sec-clust-compare}\n\n\\index{cluster analysis!confusion table}\n\n\n\n## Summarising results\n\nThe key elements for summarising cluster results are the centres of the clusters and the within-cluster variability of the observations. Adding cluster means to any plot, including tour plots, is easy. You add the additional rows, or a new data set, and set the point shape to be distinct. \n\nSummarising the variability is difficult. For model-based clustering, the shape of the clusters is assumed to be elliptical, so $p$-dimensional ellipses can be used to show the solution, as done in @sec-mclust. Generally, it is common to plot a convex hull of the clusters, as in @fig-penguins-chull-html. This can also be done in high-dimensions, using the R package `cxhull` to compute the $p$-D convex hull.\n\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Load libraries\"}\nlibrary(mclust) \nlibrary(tidyr)\nlibrary(dplyr)\nlibrary(gt)\nlibrary(cxhull)\nlibrary(ggplot2)\nlibrary(colorspace)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to do clustering\"}\nlibrary(mclust) \nlibrary(tidyr)\nlibrary(dplyr)\nlibrary(gt)\nlibrary(cxhull)\nlibrary(ggplot2)\nlibrary(colorspace)\nload(\"data/penguins_sub.rda\")\np_dist <- dist(penguins_sub[,1:4])\np_hcw <- hclust(p_dist, method=\"ward.D2\")\n\np_cl <- data.frame(cl_w = cutree(p_hcw, 3))\n\npenguins_mc <- Mclust(penguins_sub[,1:4], \n G=3, \n modelNames = \"EEE\")\np_cl <- p_cl %>% \n mutate(cl_mc = penguins_mc$classification)\n\np_cl <- p_cl %>% \n mutate(cl_w_j = jitter(cl_w),\n cl_mc_j = jitter(cl_mc))\n\n# Arranging by cluster id is important to define edges \npenguins_cl <- penguins_sub %>%\n mutate(cl_w = p_cl$cl_w,\n cl_mc = p_cl$cl_mc) %>%\n arrange(cl_w)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code for convex hulls in 2D\"}\n# Penguins in 2D\n# Duplicate observations need to be removed fo convex hull calculation\npsub <- penguins_cl %>%\n select(bl, bd) \ndup <- duplicated(psub)\npsub <- penguins_cl %>%\n select(bl, bd, cl_w) %>%\n filter(!dup) %>%\n arrange(cl_w)\n\nncl <- psub %>%\n count(cl_w) %>%\n arrange(cl_w) %>%\n mutate(cumn = cumsum(n))\nphull <- NULL\nfor (i in unique(psub$cl_w)) {\n x <- psub %>%\n dplyr::filter(cl_w == i) %>%\n select(bl, bd) \n ph <- cxhull(as.matrix(x))$edges\n if (i > 1) {\n ph <- ph + ncl$cumn[i-1]\n }\n ph <- cbind(ph, rep(i, nrow(ph)))\n phull <- rbind(phull, ph)\n}\nphull <- as.data.frame(phull)\ncolnames(phull) <- c(\"from\", \"to\", \"cl_w\") \nphull_segs <- data.frame(x = psub$bl[phull$from],\n y = psub$bd[phull$from],\n xend = psub$bl[phull$to],\n yend = psub$bd[phull$to],\n cl_w = phull$cl_w)\nphull_segs$cl_w <- factor(phull$cl_w) \npsub$cl_w <- factor(psub$cl_w)\np_chull2D <- ggplot() +\n geom_point(data=psub, aes(x=bl, y=bd, \n colour=cl_w)) + \n geom_segment(data=phull_segs, aes(x=x, xend=xend,\n y=y, yend=yend,\n colour=cl_w)) +\n scale_colour_discrete_divergingx(palette = \"Zissou 1\") +\n theme_minimal() +\n theme(aspect.ratio = 1)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to generate pD convex hull and view in tour\"}\nncl <- penguins_cl %>%\n count(cl_w) %>%\n arrange(cl_w) %>%\n mutate(cumn = cumsum(n))\nphull <- NULL\nfor (i in unique(penguins_cl$cl_w)) {\n x <- penguins_cl %>%\n dplyr::filter(cl_w == i) \n ph <- cxhull(as.matrix(x[,1:4]))$edges\n if (i > 1) {\n ph <- ph + ncl$cumn[i-1]\n }\n ph <- cbind(ph, rep(i, nrow(ph)))\n phull <- rbind(phull, ph)\n}\nphull <- as.data.frame(phull)\ncolnames(phull) <- c(\"from\", \"to\", \"cl_w\") \nphull$cl_w <- factor(phull$cl_w)\npenguins_cl$cl_w <- factor(penguins_cl$cl_w)\n\nanimate_xy(penguins_cl[,1:4], col=penguins_cl$cl_w,\n edges=as.matrix(phull[,1:2]), edges.col=phull$cl_w)\nrender_gif(penguins_cl[,1:4], \n tour_path = grand_tour(),\n display = display_xy(col=penguins_cl$cl_w,\n edges=as.matrix(phull[,1:2]),\n edges.col=phull$cl_w),\n gif_file = \"gifs/penguins_chull.gif\",\n frames = 500, \n width = 400,\n height = 400)\n```\n:::\n\n::: {.cell}\n\n:::\n\n\n\n::: {.content-visible when-format=\"html\"}\n\n::: {#fig-penguins-chull-html layout-ncol=2}\n\n\n::: {.cell}\n::: {.cell-output-display}\n![2D](12-summary-clust_files/figure-html/fig-penguin-hull-2D-html-1.png){#fig-penguin-hull-2D-html width=384}\n:::\n:::\n\n\n![4D](gifs/penguins_chull.gif){#fig-penguins-chull-pD}\n:::\n\nConvex hulls summarising the extent of Wards linkage clustering in 2D and 4D.\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\n::: {#fig-penguins-chull-pdf layout-ncol=2}\n\n\n![2D](images/fig-penguin-hull-2D-1.png){#fig-penguins-chull-2D-pdf}\n\n![4D](images/penguins_chull_105.png){#fig-penguins-chull-pD}\n\nConvex hulls summarising the extent of Wards linkage clustering in 2D and 4D.\n:::\n:::\n\n## Comparing two clusterings\n\nEach cluster analysis will result in a vector of class labels for the data. To compare two results we would tabulate and plot the pair of integer variables. The labels given to each cluster will likely differ. If the two methods agree, there will be just a few cells with large counts among mostly empty cells. \n\nBelow is a comparison between the three cluster results of Wards linkage hierarchical clustering (rows) and model-based clustering (columns). The two methods mostly agree, as seen from the three cells with large counts, and most cells with zeros. They disagree only on eight penguins. These eight penguins would be considered to be part of cluster 1 by Wards, but model-based considers them to be members of cluster 2.\n\nThe two methods label them clusters differently: what Wards labels as cluster 3, model-based labels as cluster 2. The labels given by any algorithm are arbitrary, and can easily be changed to coordinate between methods. \n\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code for confusion table\"}\np_cl %>% \n count(cl_w, cl_mc) %>% \n pivot_wider(names_from = cl_mc, \n values_from = n, \n values_fill = 0) %>%\n gt() %>%\n tab_spanner(label = \"cl_mc\", columns=c(`2`, `3`, `1`)) %>%\n cols_width(everything() ~ px(60))\n```\n\n::: {.cell-output-display}\n\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n\n\n \n\n\n\n \n\n\n\n \n \n \n
cl_w\n cl_mc\n
231
180149
201190
35700
\n
\n```\n\n:::\n:::\n\n\nWe can examine the disagreement by linking a plot of the table, with a tour plot. Here is how to do this with `liminal`. @fig-compare-clusters1 and @fig-compare-clusters2 show screenshots of the exploration of the eight penguins on which the methods disagree. It makes sense that there is some confusion. These penguins are part of the large clump of observations that don't separate cleanly into two clusters. The eight penguins are in the middle of this clump. Realistically, both methods result in a plausible clustering, and it is not clear how these penguins should be grouped. \n\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to do linked brushing with liminal\"}\nlibrary(liminal)\nlimn_tour_link(\n p_cl[,3:4],\n penguins_cl,\n cols = bl:bm,\n color = cl_w\n)\n```\n:::\n\n\n![Linking the confusion table with a tour using liminal. Points are coloured according to Wards linkage. The disagreement on eight penguins is with cluster 1 from Wards and cluster 2 from model-based.](images/compare-clusters1.png){#fig-compare-clusters1}\n\n![Highlighting the penguins where the methods disagree so we can see where these observations are located relative to the two clusters.](images/compare-clusters2.png){#fig-compare-clusters2}\n\n## Exercises {-}\n\n1. Compare the results of the four cluster model-based clustering with that of the four cluster Wards linkage clustering of the penguins data.\n2. Compare the results from clustering of the `fake_trees` data for two different choices of $k$. (This follows from the exercise in @sec-kmeans.) Which choice of $k$ is best? And what choice of $k$ best captures the 10 known branches?\n3. Compare and contrast the cluster solutions for the first four PCs of the `aflw` data, conducted in @sec-hclust and @sec-kmeans. Which provides the most useful clustering of this data?\n4. Pick your two clusterings on one of the challenge data sets, `c1`-`c7` from the `mulgar` package, that give very different results. Compare and contrast the two solutions, and decide which is the better solution.\n\n## Project {-}\n\nMost of the time your data will not neatly separate into clusters, but partitioning it into groups of similar observations can still be useful. In this case our toolbox will be useful in comparing and contrasting different methods, understanding to what extend a cluster mean can describe the observations in the cluster, and also how the boundaries between clusters have been drawn. To explore this we will use survey data that examines the risk taking behavior of tourists. The data was collected in Australia in 2015 [@risk-survey] and includes six types of risks (recreational, health, career, financial, safety and social) with responses on a scale from 1 (never) to 5 (very often). You can download the data from risk_MSA.rds .\n\n1. We first examine the data in a grand tour. Do you notice that each variable was measured on a discrete scale?\n2. Next we explore different solutions from hierarchical clustering of the data. For comparison we will keep the number of clusters fixed to 6 and we will perform the hierarchical clustering with different combinations of distance functions (Manhattan distance and Euclidean distance) and linkage (single, complete and Ward linkage). Which combinations make sense based on what we know about the method and the data?\n3. For each of the hierarchical clustering solutions draw the dendrogram in 2D and also in the data space. You can also map the grouping into 6 clusters to different colors. How would you describe the different solutions?\n4. Using the method introduced in this chapter, compare the solution using Manhattan distance and complete linkage to one using Euclidean distance and Ward linkage. First compute a confusion table and then use `liminal` to explore some of the differences. For example, you should be able to see how small subsets where the two clustering solutions disagree can be outlying and are grouped differently depending on the choices we make.\n5. Selecting your preferred solution from hierarchical clustering, we will now compare it to what is found using $k$-means clustering with $k=6$. Use a tour to show the cluster means together with the data points (make sure to pick an appropriate symbol for the data points to avoid too much overplotting). What can you say about the variation within the clusters? Can you match some of the clusters with the most relevant variables from following the movement of the cluster means during the tour?\n6. Use a projection pursuit guided tour to best separate the clusters identified with $k$-means clustering. How are the clusters related to the different types of risk?\n7. Use the approaches from this chapter to summarize and compare the $k$-means solution to your selected hierarchical clustering results. Are the groupings mostly similar?\nYou can also use convex hulls to better compare what part of the space is occupied. Either look at subsets (selected from the liminal display) or you could facet the display using `tourr::animate_groupxy`.\n8. Some other possible activities include examining how model-based methods would cluster the data. We expect it should be similar to Wards hierarchical or $k$-means, that it will partition into roughly equal chunks with an EII variance-covariance model being optimal. Also examining an SOM fit. SOM is not ideal for this data because the data fills the space. If the SOM model is fitted properly it should be a tangled net where the nodes (cluster means) are fairly evenly spread out. Thus the result should again be similar to Wards hierarchical or $k$-means. A common problem with fitting an SOM is that optimisation stops early, before fully capturing the data set. This is the reasons to use the tour for SOM. If the net is bunched in one part of the data space, it means that the optimisation wasn't successful.\n\n\n\n::: {.cell}\n\n:::\n", + "supporting": [ + "12-summary-clust_files/figure-html" + ], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": {}, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/12-summary-clust/execute-results/tex.json b/_freeze/12-summary-clust/execute-results/tex.json index 5a8195e..a6e4b6c 100644 --- a/_freeze/12-summary-clust/execute-results/tex.json +++ b/_freeze/12-summary-clust/execute-results/tex.json @@ -1,8 +1,8 @@ { - "hash": "72deae634cda47ed61fbd7f862ef97f9", + "hash": "69af921d44c147f2bf065c771549a5e4", "result": { "engine": "knitr", - "markdown": "# Summarising and comparing clustering results {#sec-clust-compare}\n\n\\index{cluster analysis!confusion table}\n\n\n\n## Summarising results\n\nThe key elements for summarising cluster results are the centres of the clusters and the within-cluster variability of the observations. Adding cluster means to any plot, including tour plots, is easy. You add the additional rows, or a new data set, and set the point shape to be distinct. \n\nSummarising the variability is difficult. For model-based clustering, the shape of the clusters is assumed to be elliptical, so $p$-dimensional ellipses can be used to show the solution, as done in @sec-mclust. Generally, it is common to plot a convex hull of the clusters, as in @fig-penguins-chull-pdf. This can also be done in high-dimensions, using the R package `cxhull` to compute the $p$-D convex hull.\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Load libraries\"}\nlibrary(mclust) \nlibrary(tidyr)\nlibrary(dplyr)\nlibrary(gt)\nlibrary(cxhull)\nlibrary(ggplot2)\nlibrary(colorspace)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to do clustering\"}\nlibrary(mclust) \nlibrary(tidyr)\nlibrary(dplyr)\nlibrary(gt)\nlibrary(cxhull)\nlibrary(ggplot2)\nlibrary(colorspace)\nload(\"data/penguins_sub.rda\")\np_dist <- dist(penguins_sub[,1:4])\np_hcw <- hclust(p_dist, method=\"ward.D2\")\n\np_cl <- data.frame(cl_w = cutree(p_hcw, 3))\n\npenguins_mc <- Mclust(penguins_sub[,1:4], \n G=3, \n modelNames = \"EEE\")\np_cl <- p_cl %>% \n mutate(cl_mc = penguins_mc$classification)\n\np_cl <- p_cl %>% \n mutate(cl_w_j = jitter(cl_w),\n cl_mc_j = jitter(cl_mc))\n\n# Arranging by cluster id is important to define edges \npenguins_cl <- penguins_sub %>%\n mutate(cl_w = p_cl$cl_w,\n cl_mc = p_cl$cl_mc) %>%\n arrange(cl_w)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code for convex hulls in 2D\"}\n# Penguins in 2D\n# Duplicate observations need to be removed fo convex hull calculation\npsub <- penguins_cl %>%\n select(bl, bd) \ndup <- duplicated(psub)\npsub <- penguins_cl %>%\n select(bl, bd, cl_w) %>%\n filter(!dup) %>%\n arrange(cl_w)\n\nncl <- psub %>%\n count(cl_w) %>%\n arrange(cl_w) %>%\n mutate(cumn = cumsum(n))\nphull <- NULL\nfor (i in unique(psub$cl_w)) {\n x <- psub %>%\n dplyr::filter(cl_w == i) %>%\n select(bl, bd) \n ph <- cxhull(as.matrix(x))$edges\n if (i > 1) {\n ph <- ph + ncl$cumn[i-1]\n }\n ph <- cbind(ph, rep(i, nrow(ph)))\n phull <- rbind(phull, ph)\n}\nphull <- as.data.frame(phull)\ncolnames(phull) <- c(\"from\", \"to\", \"cl_w\") \nphull_segs <- data.frame(x = psub$bl[phull$from],\n y = psub$bd[phull$from],\n xend = psub$bl[phull$to],\n yend = psub$bd[phull$to],\n cl_w = phull$cl_w)\nphull_segs$cl_w <- factor(phull$cl_w) \npsub$cl_w <- factor(psub$cl_w)\np_chull2D <- ggplot() +\n geom_point(data=psub, aes(x=bl, y=bd, \n colour=cl_w)) + \n geom_segment(data=phull_segs, aes(x=x, xend=xend,\n y=y, yend=yend,\n colour=cl_w)) +\n scale_colour_discrete_divergingx(palette = \"Zissou 1\") +\n theme_minimal() +\n theme(aspect.ratio = 1)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to generate pD convex hull and view in tour\"}\nncl <- penguins_cl %>%\n count(cl_w) %>%\n arrange(cl_w) %>%\n mutate(cumn = cumsum(n))\nphull <- NULL\nfor (i in unique(penguins_cl$cl_w)) {\n x <- penguins_cl %>%\n dplyr::filter(cl_w == i) \n ph <- cxhull(as.matrix(x[,1:4]))$edges\n if (i > 1) {\n ph <- ph + ncl$cumn[i-1]\n }\n ph <- cbind(ph, rep(i, nrow(ph)))\n phull <- rbind(phull, ph)\n}\nphull <- as.data.frame(phull)\ncolnames(phull) <- c(\"from\", \"to\", \"cl_w\") \nphull$cl_w <- factor(phull$cl_w)\npenguins_cl$cl_w <- factor(penguins_cl$cl_w)\n\nanimate_xy(penguins_cl[,1:4], col=penguins_cl$cl_w,\n edges=as.matrix(phull[,1:2]), edges.col=phull$cl_w)\nrender_gif(penguins_cl[,1:4], \n tour_path = grand_tour(),\n display = display_xy(col=penguins_cl$cl_w,\n edges=as.matrix(phull[,1:2]),\n edges.col=phull$cl_w),\n gif_file = \"gifs/penguins_chull.gif\",\n frames = 500, \n width = 400,\n height = 400)\n```\n:::\n\n::: {.cell}\n\n:::\n\n\n\n\n::: {.content-visible when-format=\"html\"}\n\n::: {#fig-penguins-chull-html layout-ncol=2}\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![2D](12-summary-clust_files/figure-pdf/fig-penguin-hull-2D-html-1.pdf){#fig-penguin-hull-2D-html width=80%}\n:::\n:::\n\n\n\n![4D](gifs/penguins_chull.gif){#fig-penguins-chull-pD}\n:::\n\nConvex hulls summarising the extent of Wards linkage clustering in 2D and 4D.\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\n::: {#fig-penguins-chull-pdf layout-ncol=2}\n\n\n![2D](images/fig-penguin-hull-2D-1.png){#fig-penguins-chull-2D-pdf}\n\n![4D](images/penguins_chull_105.png){#fig-penguins-chull-pD}\n\nConvex hulls summarising the extent of Wards linkage clustering in 2D and 4D.\n:::\n:::\n\n## Comparing two clusterings\n\nEach cluster analysis will result in a vector of class labels for the data. To compare two results we would tabulate and plot the pair of integer variables. The labels given to each cluster will likely differ. If the two methods agree, there will be just a few cells with large counts among mostly empty cells. \n\nBelow is a comparison between the three cluster results of Wards linkage hierarchical clustering (rows) and model-based clustering (columns). The two methods mostly agree, as seen from the three cells with large counts, and most cells with zeros. They disagree only on eight penguins. These eight penguins would be considered to be part of cluster 1 by Wards, but model-based considers them to be members of cluster 2.\n\nThe two methods label them clusters differently: what Wards labels as cluster 3, model-based labels as cluster 2. The labels given by any algorithm are arbitrary, and can easily be changed to coordinate between methods. \n\n\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code for confusion table\"}\np_cl %>% \n count(cl_w, cl_mc) %>% \n pivot_wider(names_from = cl_mc, \n values_from = n, \n values_fill = 0) %>%\n gt() %>%\n tab_spanner(label = \"cl_mc\", columns=c(`2`, `3`, `1`)) %>%\n cols_width(everything() ~ px(60))\n```\n\n::: {.cell-output-display}\n\\begin{longtable}{rrrr}\n\\toprule\n & \\multicolumn{3}{c}{cl\\_mc} \\\\ \n\\cmidrule(lr){2-4}\ncl\\_w & 2 & 3 & 1 \\\\ \n\\midrule\n1 & 8 & 0 & 149 \\\\ \n2 & 0 & 119 & 0 \\\\ \n3 & 57 & 0 & 0 \\\\ \n\\bottomrule\n\\end{longtable}\n\n:::\n:::\n\n\n\nWe can examine the disagreement by linking a plot of the table, with a tour plot. Here is how to do this with `liminal`. @fig-compare-clusters1 and @fig-compare-clusters2 show screenshots of the exploration of the eight penguins on which the methods disagree. It makes sense that there is some confusion. These penguins are part of the large clump of observations that don't separate cleanly into two clusters. The eight penguins are in the middle of this clump. Realistically, both methods result in a plausible clustering, and it is not clear how these penguins should be grouped. \n\n\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to do linked brushing with liminal\"}\nlibrary(liminal)\nlimn_tour_link(\n p_cl[,3:4],\n penguins_cl,\n cols = bl:bm,\n color = cl_w\n)\n```\n:::\n\n\n\n![Linking the confusion table with a tour using liminal. Points are coloured according to Wards linkage. The disagreement on eight penguins is with cluster 1 from Wards and cluster 2 from model-based.](images/compare-clusters1.png){#fig-compare-clusters1}\n\n![Highlighting the penguins where the methods disagree so we can see where these observations are located relative to the two clusters.](images/compare-clusters2.png){#fig-compare-clusters2}\n\n## Exercises {-}\n\n1. Compare the results of the four cluster model-based clustering with that of the four cluster Wards linkage clustering of the penguins data.\n2. Compare the results from clustering of the `fake_trees` data for two different choices of $k$. (This follows from the exercise in @sec-kmeans.) Which choice of $k$ is best? And what choice of $k$ best captures the 10 known branches?\n3. Compare and contrast the cluster solutions for the first four PCs of the `aflw` data, conducted in @sec-hclust and @sec-kmeans. Which provides the most useful clustering of this data?\n4. Pick your two clusterings on one of the challenge data sets, `c1`-`c7` from the `mulgar` package, that give very different results. Compare and contrast the two solutions, and decide which is the better solution.\n\n## Project {-}\n\nMost of the time your data will not neatly separate into clusters, but partitioning it into groups of similar observations can still be useful. In this case our toolbox will be useful in comparing and contrasting different methods, understanding to what extend a cluster mean can describe the observations in the cluster, and also how the boundaries between clusters have been drawn. To explore this we will use survey data that examines the risk taking behavior of tourists. The data was collected in Australia in 2015 [@risk-survey] and includes six types of risks (recreational, health, career, financial, safety and social) with responses on a scale from 1 (never) to 5 (very often). The data is available in `risk_MSA.rds` from the book web site.\n\n1. We first examine the data in a grand tour. Do you notice that each variable was measured on a discrete scale?\n2. Next we explore different solutions from hierarchical clustering of the data. For comparison we will keep the number of clusters fixed to 6 and we will perform the hierarchical clustering with different combinations of distance functions (Manhattan distance and Euclidean distance) and linkage (single, complete and Ward linkage). Which combinations make sense based on what we know about the method and the data?\n3. For each of the hierarchical clustering solutions draw the dendrogram in 2D and also in the data space. You can also map the grouping into 6 clusters to different colors. How would you describe the different solutions?\n4. Using the method introduced in this chapter, compare the solution using Manhattan distance and complete linkage to one using Euclidean distance and Ward linkage. First compute a confusion table and then use `liminal` to explore some of the differences. For example, you should be able to see how small subsets where the two clustering solutions disagree can be outlying and are grouped differently depending on the choices we make.\n5. Selecting your preferred solution from hierarchical clustering, we will now compare it to what is found using $k$-means clustering with $k=6$. Use a tour to show the cluster means together with the data points (make sure to pick an appropriate symbol for the data points to avoid too much overplotting). What can you say about the variation within the clusters? Can you match some of the clusters with the most relevant variables from following the movement of the cluster means during the tour?\n6. Use a projection pursuit guided tour to best separate the clusters identified with $k$-means clustering. How are the clusters related to the different types of risk?\n7. Use the approaches from this chapter to summarize and compare the $k$-means solution to your selected hierarchical clustering results. Are the groupings mostly similar?\nYou can also use convex hulls to better compare what part of the space is occupied. Either look at subsets (selected from the liminal display) or you could facet the display using `tourr::animate_groupxy`.\n8. Some other possible activities include examining how model-based methods would cluster the data. We expect it should be similar to Wards hierarchical or $k$-means, that it will partition into roughly equal chunks with an EII variance-covariance model being optimal. Also examining an SOM fit. SOM is not ideal for this data because the data fills the space. If the SOM model is fitted properly it should be a tangled net where the nodes (cluster means) are fairly evenly spread out. Thus the result should again be similar to Wards hierarchical or $k$-means. A common problem with fitting an SOM is that optimisation stops early, before fully capturing the data set. This is the reasons to use the tour for SOM. If the net is bunched in one part of the data space, it means that the optimisation wasn't successful.\n\n\n\n\n::: {.cell}\n\n:::\n", + "markdown": "# Summarising and comparing clustering results {#sec-clust-compare}\n\n\\index{cluster analysis!confusion table}\n\n\n\n## Summarising results\n\nThe key elements for summarising cluster results are the centres of the clusters and the within-cluster variability of the observations. Adding cluster means to any plot, including tour plots, is easy. You add the additional rows, or a new data set, and set the point shape to be distinct. \n\nSummarising the variability is difficult. For model-based clustering, the shape of the clusters is assumed to be elliptical, so $p$-dimensional ellipses can be used to show the solution, as done in @sec-mclust. Generally, it is common to plot a convex hull of the clusters, as in @fig-penguins-chull-pdf. This can also be done in high-dimensions, using the R package `cxhull` to compute the $p$-D convex hull.\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Load libraries\"}\nlibrary(mclust) \nlibrary(tidyr)\nlibrary(dplyr)\nlibrary(gt)\nlibrary(cxhull)\nlibrary(ggplot2)\nlibrary(colorspace)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to do clustering\"}\nlibrary(mclust) \nlibrary(tidyr)\nlibrary(dplyr)\nlibrary(gt)\nlibrary(cxhull)\nlibrary(ggplot2)\nlibrary(colorspace)\nload(\"data/penguins_sub.rda\")\np_dist <- dist(penguins_sub[,1:4])\np_hcw <- hclust(p_dist, method=\"ward.D2\")\n\np_cl <- data.frame(cl_w = cutree(p_hcw, 3))\n\npenguins_mc <- Mclust(penguins_sub[,1:4], \n G=3, \n modelNames = \"EEE\")\np_cl <- p_cl %>% \n mutate(cl_mc = penguins_mc$classification)\n\np_cl <- p_cl %>% \n mutate(cl_w_j = jitter(cl_w),\n cl_mc_j = jitter(cl_mc))\n\n# Arranging by cluster id is important to define edges \npenguins_cl <- penguins_sub %>%\n mutate(cl_w = p_cl$cl_w,\n cl_mc = p_cl$cl_mc) %>%\n arrange(cl_w)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code for convex hulls in 2D\"}\n# Penguins in 2D\n# Duplicate observations need to be removed fo convex hull calculation\npsub <- penguins_cl %>%\n select(bl, bd) \ndup <- duplicated(psub)\npsub <- penguins_cl %>%\n select(bl, bd, cl_w) %>%\n filter(!dup) %>%\n arrange(cl_w)\n\nncl <- psub %>%\n count(cl_w) %>%\n arrange(cl_w) %>%\n mutate(cumn = cumsum(n))\nphull <- NULL\nfor (i in unique(psub$cl_w)) {\n x <- psub %>%\n dplyr::filter(cl_w == i) %>%\n select(bl, bd) \n ph <- cxhull(as.matrix(x))$edges\n if (i > 1) {\n ph <- ph + ncl$cumn[i-1]\n }\n ph <- cbind(ph, rep(i, nrow(ph)))\n phull <- rbind(phull, ph)\n}\nphull <- as.data.frame(phull)\ncolnames(phull) <- c(\"from\", \"to\", \"cl_w\") \nphull_segs <- data.frame(x = psub$bl[phull$from],\n y = psub$bd[phull$from],\n xend = psub$bl[phull$to],\n yend = psub$bd[phull$to],\n cl_w = phull$cl_w)\nphull_segs$cl_w <- factor(phull$cl_w) \npsub$cl_w <- factor(psub$cl_w)\np_chull2D <- ggplot() +\n geom_point(data=psub, aes(x=bl, y=bd, \n colour=cl_w)) + \n geom_segment(data=phull_segs, aes(x=x, xend=xend,\n y=y, yend=yend,\n colour=cl_w)) +\n scale_colour_discrete_divergingx(palette = \"Zissou 1\") +\n theme_minimal() +\n theme(aspect.ratio = 1)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to generate pD convex hull and view in tour\"}\nncl <- penguins_cl %>%\n count(cl_w) %>%\n arrange(cl_w) %>%\n mutate(cumn = cumsum(n))\nphull <- NULL\nfor (i in unique(penguins_cl$cl_w)) {\n x <- penguins_cl %>%\n dplyr::filter(cl_w == i) \n ph <- cxhull(as.matrix(x[,1:4]))$edges\n if (i > 1) {\n ph <- ph + ncl$cumn[i-1]\n }\n ph <- cbind(ph, rep(i, nrow(ph)))\n phull <- rbind(phull, ph)\n}\nphull <- as.data.frame(phull)\ncolnames(phull) <- c(\"from\", \"to\", \"cl_w\") \nphull$cl_w <- factor(phull$cl_w)\npenguins_cl$cl_w <- factor(penguins_cl$cl_w)\n\nanimate_xy(penguins_cl[,1:4], col=penguins_cl$cl_w,\n edges=as.matrix(phull[,1:2]), edges.col=phull$cl_w)\nrender_gif(penguins_cl[,1:4], \n tour_path = grand_tour(),\n display = display_xy(col=penguins_cl$cl_w,\n edges=as.matrix(phull[,1:2]),\n edges.col=phull$cl_w),\n gif_file = \"gifs/penguins_chull.gif\",\n frames = 500, \n width = 400,\n height = 400)\n```\n:::\n\n::: {.cell}\n\n:::\n\n\n\n\n::: {.content-visible when-format=\"html\"}\n\n::: {#fig-penguins-chull-html layout-ncol=2}\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![2D](12-summary-clust_files/figure-pdf/fig-penguin-hull-2D-html-1.pdf){#fig-penguin-hull-2D-html width=80%}\n:::\n:::\n\n\n\n![4D](gifs/penguins_chull.gif){#fig-penguins-chull-pD}\n:::\n\nConvex hulls summarising the extent of Wards linkage clustering in 2D and 4D.\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\n::: {#fig-penguins-chull-pdf layout-ncol=2}\n\n\n![2D](images/fig-penguin-hull-2D-1.png){#fig-penguins-chull-2D-pdf}\n\n![4D](images/penguins_chull_105.png){#fig-penguins-chull-pD}\n\nConvex hulls summarising the extent of Wards linkage clustering in 2D and 4D.\n:::\n:::\n\n## Comparing two clusterings\n\nEach cluster analysis will result in a vector of class labels for the data. To compare two results we would tabulate and plot the pair of integer variables. The labels given to each cluster will likely differ. If the two methods agree, there will be just a few cells with large counts among mostly empty cells. \n\nBelow is a comparison between the three cluster results of Wards linkage hierarchical clustering (rows) and model-based clustering (columns). The two methods mostly agree, as seen from the three cells with large counts, and most cells with zeros. They disagree only on eight penguins. These eight penguins would be considered to be part of cluster 1 by Wards, but model-based considers them to be members of cluster 2.\n\nThe two methods label them clusters differently: what Wards labels as cluster 3, model-based labels as cluster 2. The labels given by any algorithm are arbitrary, and can easily be changed to coordinate between methods. \n\n\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code for confusion table\"}\np_cl %>% \n count(cl_w, cl_mc) %>% \n pivot_wider(names_from = cl_mc, \n values_from = n, \n values_fill = 0) %>%\n gt() %>%\n tab_spanner(label = \"cl_mc\", columns=c(`2`, `3`, `1`)) %>%\n cols_width(everything() ~ px(60))\n```\n\n::: {.cell-output-display}\n\\begin{longtable}{rrrr}\n\\toprule\n & \\multicolumn{3}{c}{cl\\_mc} \\\\ \n\\cmidrule(lr){2-4}\ncl\\_w & 2 & 3 & 1 \\\\ \n\\midrule\n1 & 8 & 0 & 149 \\\\ \n2 & 0 & 119 & 0 \\\\ \n3 & 57 & 0 & 0 \\\\ \n\\bottomrule\n\\end{longtable}\n\n:::\n:::\n\n\n\nWe can examine the disagreement by linking a plot of the table, with a tour plot. Here is how to do this with `liminal`. @fig-compare-clusters1 and @fig-compare-clusters2 show screenshots of the exploration of the eight penguins on which the methods disagree. It makes sense that there is some confusion. These penguins are part of the large clump of observations that don't separate cleanly into two clusters. The eight penguins are in the middle of this clump. Realistically, both methods result in a plausible clustering, and it is not clear how these penguins should be grouped. \n\n\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to do linked brushing with liminal\"}\nlibrary(liminal)\nlimn_tour_link(\n p_cl[,3:4],\n penguins_cl,\n cols = bl:bm,\n color = cl_w\n)\n```\n:::\n\n\n\n![Linking the confusion table with a tour using liminal. Points are coloured according to Wards linkage. The disagreement on eight penguins is with cluster 1 from Wards and cluster 2 from model-based.](images/compare-clusters1.png){#fig-compare-clusters1}\n\n![Highlighting the penguins where the methods disagree so we can see where these observations are located relative to the two clusters.](images/compare-clusters2.png){#fig-compare-clusters2}\n\nLinking the confusion matrix with the tour can also be accomplished with `crosstalk` and `detourr`. \n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(crosstalk)\nlibrary(plotly)\nlibrary(viridis)\np_cl_shared <- SharedData$new(penguins_cl)\n\ndetour_plot <- detour(p_cl_shared, tour_aes(\n projection = bl:bm,\n colour = cl_w)) |>\n tour_path(grand_tour(2), \n max_bases=50, fps = 60) |>\n show_scatter(alpha = 0.7, axes = FALSE,\n width = \"100%\", height = \"450px\")\n\nconf_mat <- plot_ly(p_cl_shared, \n x = ~cl_mc_j,\n y = ~cl_w_j,\n color = ~cl_w,\n colors = viridis_pal(option = \"D\")(3),\n height = 450) |>\n highlight(on = \"plotly_selected\", \n off = \"plotly_doubleclick\") %>%\n add_trace(type = \"scatter\", \n mode = \"markers\")\n \nbscols(\n detour_plot, conf_mat,\n widths = c(5, 6)\n ) \n```\n:::\n\n\n\n## Exercises {-}\n\n1. Compare the results of the four cluster model-based clustering with that of the four cluster Wards linkage clustering of the penguins data.\n2. Compare the results from clustering of the `fake_trees` data for two different choices of $k$. (This follows from the exercise in @sec-kmeans.) Which choice of $k$ is best? And what choice of $k$ best captures the 10 known branches?\n3. Compare and contrast the cluster solutions for the first four PCs of the `aflw` data, conducted in @sec-hclust and @sec-kmeans. Which provides the most useful clustering of this data?\n4. Pick your two clusterings on one of the challenge data sets, `c1`-`c7` from the `mulgar` package, that give very different results. Compare and contrast the two solutions, and decide which is the better solution.\n\n## Project {-}\n\nMost of the time your data will not neatly separate into clusters, but partitioning it into groups of similar observations can still be useful. In this case our toolbox will be useful in comparing and contrasting different methods, understanding to what extend a cluster mean can describe the observations in the cluster, and also how the boundaries between clusters have been drawn. To explore this we will use survey data that examines the risk taking behavior of tourists, this is the `risk_MSA` data, see the Appendix for details.\n\n1. We first examine the data in a grand tour. Do you notice that each variable was measured on a discrete scale?\n2. Next we explore different solutions from hierarchical clustering of the data. For comparison we will keep the number of clusters fixed to 6 and we will perform the hierarchical clustering with different combinations of distance functions (Manhattan distance and Euclidean distance) and linkage (single, complete and Ward linkage). Which combinations make sense based on what we know about the method and the data?\n3. For each of the hierarchical clustering solutions draw the dendrogram in 2D and also in the data space. You can also map the grouping into 6 clusters to different colors. How would you describe the different solutions?\n4. Using the method introduced in this chapter, compare the solution using Manhattan distance and complete linkage to one using Euclidean distance and Ward linkage. First compute a confusion table and then use `liminal` to explore some of the differences. For example, you should be able to see how small subsets where the two clustering solutions disagree can be outlying and are grouped differently depending on the choices we make.\n5. Selecting your preferred solution from hierarchical clustering, we will now compare it to what is found using $k$-means clustering with $k=6$. Use a tour to show the cluster means together with the data points (make sure to pick an appropriate symbol for the data points to avoid too much overplotting). What can you say about the variation within the clusters? Can you match some of the clusters with the most relevant variables from following the movement of the cluster means during the tour?\n6. Use a projection pursuit guided tour to best separate the clusters identified with $k$-means clustering. How are the clusters related to the different types of risk?\n7. Use the approaches from this chapter to summarize and compare the $k$-means solution to your selected hierarchical clustering results. Are the groupings mostly similar?\nYou can also use convex hulls to better compare what part of the space is occupied. Either look at subsets (selected from the liminal display) or you could facet the display using `tourr::animate_groupxy`.\n8. Some other possible activities include examining how model-based methods would cluster the data. We expect it should be similar to Wards hierarchical or $k$-means, that it will partition into roughly equal chunks with an EII variance-covariance model being optimal. Also examining an SOM fit. SOM is not ideal for this data because the data fills the space. If the SOM model is fitted properly it should be a tangled net where the nodes (cluster means) are fairly evenly spread out. Thus the result should again be similar to Wards hierarchical or $k$-means. A common problem with fitting an SOM is that optimisation stops early, before fully capturing the data set. This is the reasons to use the tour for SOM. If the net is bunched in one part of the data space, it means that the optimisation wasn't successful.\n\n\n\n\n::: {.cell}\n\n:::\n", "supporting": [ "12-summary-clust_files/figure-pdf" ], diff --git a/_freeze/12-summary-clust/figure-html/fig-penguin-hull-2D-html-1.png b/_freeze/12-summary-clust/figure-html/fig-penguin-hull-2D-html-1.png new file mode 100644 index 0000000..fd14d9b Binary files /dev/null and b/_freeze/12-summary-clust/figure-html/fig-penguin-hull-2D-html-1.png differ diff --git a/_freeze/12-summary-clust/figure-pdf/fig-penguin-hull-2D-html-1.pdf b/_freeze/12-summary-clust/figure-pdf/fig-penguin-hull-2D-html-1.pdf new file mode 100644 index 0000000..f0e1fed Binary files /dev/null and b/_freeze/12-summary-clust/figure-pdf/fig-penguin-hull-2D-html-1.pdf differ diff --git a/_freeze/12-summary-clust/figure-pdf/fig-penguin-hull-2D-pdf-1.pdf b/_freeze/12-summary-clust/figure-pdf/fig-penguin-hull-2D-pdf-1.pdf new file mode 100644 index 0000000..22ca50b Binary files /dev/null and b/_freeze/12-summary-clust/figure-pdf/fig-penguin-hull-2D-pdf-1.pdf differ diff --git a/_freeze/13-intro-class/execute-results/html.json b/_freeze/13-intro-class/execute-results/html.json new file mode 100644 index 0000000..7277fd7 --- /dev/null +++ b/_freeze/13-intro-class/execute-results/html.json @@ -0,0 +1,17 @@ +{ + "hash": "22d6e34389ae89caefba99f4da7b8260", + "result": { + "engine": "knitr", + "markdown": "# Overview\n\n\n\nMethods for supervised classification originated in the field of Statistics in the early nineteenth century, under the moniker *discriminant analysis* (see, for example, @Fi36). An increase in the collection of data, and storage in databases, in the late twentieth century has inspired a growing desire to extract knowledge from data, particularly to be able accurately predict the class labels. This has contributed to an explosion of research on new methods, especially on algorithms that focus on accurate prediction of new data based on training samples.\n\n\\index{classification!supervised}\n\nIn contrast to unsupervised classification, the class label (categorical response variable) is known, in the training sample. The training sample is used to build the prediction model, and also to estimate the accuracy, or inversely error, of the model for future data. It is also important to understand the model and to interpret it, so that we can know how predictions are made. High-dimensional visualisation can help with this, and helps to tackle questions like:\n\n- Are the classes well separated in the data space, so that they\ncorrespond to distinct clusters? If so, what are the shapes of the clusters? Is each cluster sufficiently ellipsoidal so that we can assume that the data arises from a mixture of multivariate normal distributions? Do the clusters exhibit characteristics that suggest one algorithm in preference to others?\n- Where does the boundary between classes fall? Are the classes\nlinearly separable, or does the difference between classes suggest\na non-linear boundary? How do changes in the input parameters affect these boundaries? How do the boundaries generated by different methods vary?\n- What cases are misclassified, or have more uncertain predictions? Are there places in the data space where predictions are especially good or bad?\n- Which predictors most contribute to the model predictions? Is it possible to reduce the set of explanatory variables?\n \nAddressing these types of queries also motivate the emerging field called explainable artificial intelligence (XAI), which goes beyond predictive accuracy to more completely satisfy the *desire to extract knowledge from data*. \n\nAlthough we focus on categorical response, some of the techniques here can be modified or adapted for problems with a numeric, or continuous, response variable. With a categorical response, and numerical predictors, we map colour to the response variable and use the tour to examine the relationship between predictors, and the different classes. \n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Examples of supervised classification patterns: (a) linearly separable, (b) linear but not completely separable, (c) non-linearly separable, (d) non-linear, but not completely separable.](13-intro-class_files/figure-html/fig-sup-example-1.png){#fig-sup-example width=672}\n:::\n:::\n\n\n@fig-sup-example shows some 2D examples where the two classes are (a) linearly separable, (b) not completely separable but linearly different, (c) non-linearly separable and (d) not completely separable but with a non-linear difference. We can also see that in (a) only the horizontal variable would be important for the model because the two classes are completely separable in this direction. Although the pattern in (c) is separable classes, most models would have difficulty capturing the separation. It is for this reason that it is important to understand the boundary between classes produced by a fitted model. In each of b, c, d it is likely that some observations would be misclassified. Identifying these cases, and inspecting where they are in the data space is important for understanding the model's future performance. \n", + "supporting": [ + "13-intro-class_files/figure-html" + ], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": {}, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/13-intro-class/execute-results/tex.json b/_freeze/13-intro-class/execute-results/tex.json index 2dddc53..b7e332e 100644 --- a/_freeze/13-intro-class/execute-results/tex.json +++ b/_freeze/13-intro-class/execute-results/tex.json @@ -1,8 +1,8 @@ { - "hash": "ee005f3a74c760a8e4483823b4185bd8", + "hash": "0565c8397795711a0b3f9751925858e2", "result": { "engine": "knitr", - "markdown": "# Introduction to supervised classification\n\n\n\nMethods for supervised classification originated in the field of Statistics in the early nineteenth century, under the moniker *discriminant analysis* (see, for example, @Fi36). An increase in the collection of data, and storage in databases, in the late twentieth century has inspired a growing desire to extract knowledge from data, particularly to be able accurately predict the class labels. This has contributed to an explosion of research on new methods, especially on algorithms that focus on accurate prediction of new data based on training samples.\n\n\\index{classification!supervised}\n\nIn contrast to unsupervised classification, the class label (categorical response variable) is known, in the training sample. The training sample is used to build the prediction model, and also to estimate the accuracy, or inversely error, of the model for future data. It is also important to understand the model and to interpret it, so that we can know how predictions are made. High-dimensional visualisation can help with this, and helps to tackle questions like:\n\n- Are the classes well separated in the data space, so that they\ncorrespond to distinct clusters? If so, what are the shapes of the clusters? Is each cluster sufficiently ellipsoidal so that we can assume that the data arises from a mixture of multivariate normal distributions? Do the clusters exhibit characteristics that suggest one algorithm in preference to others?\n- Where does the boundary between classes fall? Are the classes\nlinearly separable, or does the difference between classes suggest\na non-linear boundary? How do changes in the input parameters affect these boundaries? How do the boundaries generated by different methods vary?\n- What cases are misclassified, or have more uncertain predictions? Are there places in the data space where predictions are especially good or bad?\n- Which predictors most contribute to the model predictions? Is it possible to reduce the set of explanatory variables?\n \nAddressing these types of queries also motivate the emerging field called explainable artificial intelligence (XAI), which goes beyond predictive accuracy to more completely satisfy the *desire to extract knowledge from data*. \n\nAlthough we focus on categorical response, some of the techniques here can be modified or adapted for problems with a numeric, or continuous, response variable. With a categorical response, and numerical predictors, we map colour to the response variable and use the tour to examine the relationship between predictors, and the different classes. \n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Examples of supervised classification patterns: (a) linearly separable, (b) linear but not completely separable, (c) non-linearly separable, (d) non-linear, but not completely separable.](13-intro-class_files/figure-pdf/fig-sup-example-1.pdf){#fig-sup-example width=80%}\n:::\n:::\n\n\n\n@fig-sup-example shows some 2D examples where the two classes are (a) linearly separable, (b) not completely separable but linearly different, (c) non-linearly separable and (d) not completely separable but with a non-linear difference. We can also see that in (a) only the horizontal variable would be important for the model because the two classes are completely separable in this direction. Although the pattern in (c) is separable classes, most models would have difficulty capturing the separation. It is for this reason that it is important to understand the boundary between classes produced by a fitted model. In each of b, c, d it is likely that some observations would be misclassified. Identifying these cases, and inspecting where they are in the data space is important for understanding the model's future performance. \n", + "markdown": "# Introduction to supervised classification\n\n\n\nMethods for supervised classification originated in the field of Statistics in the early nineteenth century, under the moniker *discriminant analysis* (see, for example, @Fi36). An increase in the collection of data, and storage in databases, in the late twentieth century has inspired a growing desire to extract knowledge from data, particularly to be able accurately predict the class labels. This has contributed to an explosion of research on new methods, especially on algorithms that focus on accurate prediction of new data based on training samples.\n\n\\index{classification!supervised}\n\nIn contrast to unsupervised classification, the class label (categorical response variable) is known, in the training sample. The training sample is used to build the prediction model, and also to estimate the accuracy, or inversely error, of the model for future data. It is also important to understand the model and to interpret it, so that we can know how predictions are made. High-dimensional visualisation can help with this, and helps to tackle questions like:\n\n- Are the classes well separated in the data space, so that they\ncorrespond to distinct clusters? If so, what are the shapes of the clusters? Is each cluster sufficiently ellipsoidal so that we can assume that the data arises from a mixture of multivariate normal distributions? Do the clusters exhibit characteristics that suggest one algorithm in preference to others?\n- Where does the boundary between classes fall? Are the classes\nlinearly separable, or does the difference between classes suggest\na non-linear boundary? How do changes in the input parameters affect these boundaries? How do the boundaries generated by different methods vary?\n- What cases are misclassified, or have more uncertain predictions? Are there places in the data space where predictions are especially good or bad?\n- Which predictors most contribute to the model predictions? Is it possible to reduce the set of explanatory variables?\n \nAddressing these types of queries also motivate the emerging field called explainable artificial intelligence (XAI), which goes beyond predictive accuracy to more completely satisfy the *desire to extract knowledge from data*. \n\nAlthough we focus on categorical response, some of the techniques here can be modified or adapted for problems with a numeric, or continuous, response variable. With a categorical response, and numerical predictors, we map colour to the response variable and use the tour to examine the relationship between predictors, and the different classes. \n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Examples of supervised classification patterns: (a) linearly separable, (b) linear but not completely separable, (c) non-linearly separable, (d) non-linear, but not completely separable.](13-intro-class_files/figure-pdf/fig-sup-example-1.pdf){#fig-sup-example width=80%}\n:::\n:::\n\n\n\n@fig-sup-example shows some 2D examples where the two classes are (a) linearly separable, (b) not completely separable but linearly different, (c) non-linearly separable and (d) not completely separable but with a non-linear difference. We can also see that in (a) only the horizontal variable would be important for the model because the two classes are completely separable in this direction. Although the pattern in (c) is separable classes, most models would have difficulty capturing the separation. It is for this reason that it is important to understand the boundary between classes produced by a fitted model. In each of b, c, d it is likely that some observations would be misclassified. Identifying these cases, and inspecting where they are in the data space is important for understanding the model's future performance. \n\n## Exercises {-}\n\n1. For the penguins data, use the tour to decide if the species are separable, and if the boundaries between species is linear or non-linear.\n2. Using just the variables `se`, `maxt`, `mint`, `log_dist_road`, and \"accident\" or \"lightning\" causes, use the tour to decide whether the two classes are separable, and whether the boundary might be linear or non-linear.\n\n\n\n::: {.cell}\n\n:::\n\n\n\n::: {.content-hidden}\nQ1 answer: Not separable, but boundary could be linear.\n\nQ2 answer: Gentoo and others are separable. Chinstrap and Adelie are not separable. All bounaries are linear.\n:::\n", "supporting": [ "13-intro-class_files/figure-pdf" ], diff --git a/_freeze/13-intro-class/figure-html/fig-sup-example-1.png b/_freeze/13-intro-class/figure-html/fig-sup-example-1.png new file mode 100644 index 0000000..647dc04 Binary files /dev/null and b/_freeze/13-intro-class/figure-html/fig-sup-example-1.png differ diff --git a/_freeze/13-intro-class/figure-pdf/fig-sup-example-1.pdf b/_freeze/13-intro-class/figure-pdf/fig-sup-example-1.pdf index 869d76a..b56d3e4 100644 Binary files a/_freeze/13-intro-class/figure-pdf/fig-sup-example-1.pdf and b/_freeze/13-intro-class/figure-pdf/fig-sup-example-1.pdf differ diff --git a/_freeze/14-lda/execute-results/html.json b/_freeze/14-lda/execute-results/html.json new file mode 100644 index 0000000..ab4deb2 --- /dev/null +++ b/_freeze/14-lda/execute-results/html.json @@ -0,0 +1,17 @@ +{ + "hash": "1ff1061ecf5a9acf11f32de6c11ae55a", + "result": { + "engine": "knitr", + "markdown": "# Linear discriminant analysis {#sec-lda}\n\\index{classification!linear discriminant analysis (LDA)}\n\nLinear discriminant analysis (LDA) dates to the early 1900s. It's one of the most elegant and simple techniques for both modeling separation between groups, and as an added bonus, producing a low-dimensional representation of the differences between groups. LDA has two strong assumptions: the groups are samples from multivariate normal distributions, and each have the same variance-covariance. If the latter assumption is relaxed, a slightly less elegant solution results from quadratic discriminant analysis.\n\nUseful explanations can be found in @VR02 and @Ri96. A good general treatment of parametric methods for supervised classification can be found in @JW02 or another similar multivariate analysis textbook. It's also useful to know that hypothesis testing for the difference in multivariate means using multivariate analysis of variance (MANOVA) has similar assumptions to LDA. Also model-based clustering assumes that each cluster arises from a multivariate normal distribution, and is related to LDA. The methods described here can be used to check these assumptions when applying these methods, too.\\index{classification!multivariate analysis of variance (MANOVA)} \\index{cluster analysis!model-based} \n\n::: {.content-visible when-format=\"html\"}\n::: info\nBecause LDA is a parametric model it is important to check that these assumptions are reasonably satisfied:\n\n- shape of clusters are elliptical.\n- spread of the observations are the same.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\infobox{Because LDA is a parametric model it is important to check that these assumptions are reasonably satisfied:\n\\begin{itemize} \\itemsep 0in\n\\item shape of clusters are elliptical.\n\\item spread of the observations are the same.\n\\end{itemize}\n}\n:::\n\n\n\n## Extracting the key elements of the model\n\nLDA builds the model on the between-group sum-of-square matrix\n\n$$B=\\sum_{k=1}^g n_k(\\bar{X}_k-\\bar{X})(\\bar{X}_k-\\bar{X})^\\top$$\nwhich measures the differences between the class means, \ncompared with the overall data mean $\\bar{X}$ and the within-group sum-of-squares matrix,\n\n$$\nW =\n\\sum_{k=1}^g\\sum_{i=1}^{n_k}\n(X_{ki}-\\bar{X}_k)(X_{ki}-\\bar{X}_k)^\\top\n$$\n\nwhich measures the variation of values around each class mean. The linear discriminant space is generated by computing the eigenvectors (canonical coordinates) of $W^{-1}B$, and this is the $(g-1)$-D space where the group means are most separated with respect to the\npooled variance-covariance. For each class we compute\n\\index{classification!discriminant space}\n\n$$\n\\delta_k(x) = (x-\\mu_k)^\\top W^{-1}\\mu_k + \\log \\pi_k\n$$\n\nwhere $\\pi_k$ is a prior probability for class $k$ that might be based on unequal sample sizes, or cost of misclassification. The LDA classifier rule is to *assign a new observation to the class with the largest value* of $\\delta_k(x)$.\n\nWe can fit an LDA model using the `lda()` function from the `MASS` package. Here we have used the `penguins` data, assuming equal prior probability, to illustrate. \n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\n# Code to fit the model\nlibrary(dplyr)\nlibrary(mulgar)\nlibrary(MASS)\nload(\"data/penguins_sub.rda\")\n\np_lda <- lda(species~bl+bd+fl+bm, \n data=penguins_sub,\n prior=c(1/3, 1/3, 1/3))\noptions(digits=2)\n# p_lda\n```\n:::\n\n\nBecause there are three classes the dimension of the discriminant space is 2D. We can easily extract the group means from the model. \n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\n# Extract the sample means\np_lda$means\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n bl bd fl bm\nAdelie -0.95 0.60 -0.78 -0.62\nChinstrap 0.89 0.64 -0.37 -0.59\nGentoo 0.65 -1.10 1.16 1.10\n```\n\n\n:::\n:::\n\n\nThe coefficients to project the data into the discriminant space, that is the eigenvectors of $W^{-1}B$ are: \n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\n# Extract the discriminant space\np_lda$scaling\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n LD1 LD2\nbl -0.24 -2.31\nbd 2.04 0.19\nfl -1.20 0.08\nbm -1.22 1.24\n```\n\n\n:::\n:::\n\n\nand the predicted values, which include class predictions, and coordinates in the discriminant space are generated as:\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\n# Extract the fitted values\np_lda_pred <- predict(p_lda, penguins_sub)\n```\n:::\n\n\nThe best separation between classes can be viewed from this object, which can be shown to match the original data projected using the `scaling` component of the model object (see @fig-p-lda).\n\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to generate LDA plots\"}\n# Check calculations from the fitted model, and equations\nlibrary(colorspace)\nlibrary(ggplot2)\nlibrary(ggpubr)\n# Using the predicted values from the model object\np_lda_pred_x1 <- data.frame(p_lda_pred$x)\np_lda_pred_x1$species <- penguins_sub$species\np_lda1 <- ggplot(p_lda_pred_x1, \n aes(x=LD1, y=LD2, \n colour=species)) + \n geom_point() +\n xlim(-6, 8) + ylim(-6.5, 5.5) +\n scale_color_discrete_divergingx(\"Zissou 1\") +\n ggtitle(\"(a)\") +\n theme_minimal() +\n theme(aspect.ratio = 1, legend.title = element_blank()) \n\n# matches the calculations done manually\np_lda_pred_x2 <- data.frame(as.matrix(penguins_sub[,1:4]) %*%\n p_lda$scaling)\np_lda_pred_x2$species <- penguins_sub$species\np_lda2 <- ggplot(p_lda_pred_x2, \n aes(x=LD1, y=LD2, \n colour=species)) + \n geom_point() +\n xlim(-6, 8) + ylim(-7, 5.5) +\n scale_color_discrete_divergingx(\"Zissou 1\") +\n ggtitle(\"(b)\") +\n theme_minimal() +\n theme(aspect.ratio = 1, legend.title = element_blank()) \nggarrange(p_lda1, p_lda2, ncol=2, \n common.legend = TRUE, legend = \"bottom\")\n```\n\n::: {.cell-output-display}\n![Penguins projected into the 2D discriminant space, done two ways: (a) using the predicted values, (b) directly projecting using the model component. The scale is not quite the same but the projected data is identical in shape.](14-lda_files/figure-html/fig-p-lda-1.png){#fig-p-lda width=768}\n:::\n:::\n\n\nThe $W$ and $B$ matrices cannot be extracted from the model object, so we need to compute these separately. We only need $W$ actually. It is useful to think of this as the pooled variance-covariance matrix. Because the assumption for LDA is that the population group variance-covariances are identical, we estimate this by computing them for each class and then averaging them to get the pooled variance-covariance matrix. It's laborious, but easy.\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\n# Compute pooled variance-covariance\np_vc_pool <- mulgar::pooled_vc(penguins_sub[,1:4],\n penguins_sub$species)\np_vc_pool\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n bl bd fl bm\nbl 0.31 0.18 0.13 0.18\nbd 0.18 0.32 0.14 0.20\nfl 0.13 0.14 0.23 0.16\nbm 0.18 0.20 0.16 0.31\n```\n\n\n:::\n:::\n\n\nThis can be used to draw an ellipse corresponding to the pooled variance-covariance that is used by the LDA model.\n\n\n::: {.cell}\n\n:::\n\n\n## Checking assumptions\n\nThis LDA approach is widely applicable, but it is useful\nto check the underlying assumptions on which it depends: (1) that the cluster structure corresponding to each class forms an ellipse, showing that the class is consistent with a sample from a multivariate normal distribution, and (2) that the variance of values around each mean is nearly the same. @fig-lda-assumptions1 and @fig-lda-assumptions2 illustrates two datasets, of which only one is consistent with these assumptions. Other parametric models, such as quadratic discriminant analysis or logistic regression, also depend on assumptions about the data which should be validated. \\index{classification!quadratic discriminant analysis (QDA)} \\index{classification!logistic regression}\n\n::: {.content-visible when-format=\"html\"}\n::: info\nTo check the equal and elliptical variance-covariance assumption, generate points on the surface of an ellipse corresponding to the variance-covariance for each group. When watching these ellipses in a tour, they should similar in all projections.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\infobox{To check the equal and elliptical variance-covariance assumption, generate points on the surface of an ellipse corresponding to the variance-covariance for each group. When watching these ellipses in a tour, they should similar in all projections.\n}\n:::\n\n\\index{classification!variance-covariance}\n\\index{variance-covariance}\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Generate ellipses for each group's variance-covariance\np_ell <- NULL\nfor (i in unique(penguins_sub$species)) {\n x <- penguins_sub %>% dplyr::filter(species == i)\n e <- gen_xvar_ellipse(x[,1:2], n=150, nstd=1.5)\n e$species <- i\n p_ell <- bind_rows(p_ell, e)\n}\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"true\" code-summary=\"Code for penguins data and ellipse plots\"}\nlda1 <- ggplot(penguins_sub, aes(x=bl, \n y=bd, \n colour=species)) +\n geom_point() +\n scale_color_discrete_divergingx(\"Zissou 1\") +\n xlim(-2.5, 3) + ylim(-2.5, 2.5) +\n ggtitle(\"(a)\") +\n theme_minimal() +\n theme(aspect.ratio = 1) \nlda2 <- ggplot(p_ell, aes(x=bl, \n y=bd, \n colour=species)) +\n geom_point() +\n scale_color_discrete_divergingx(\"Zissou 1\") +\n xlim(-2.5, 3) + ylim(-2.5, 2.5) +\n ggtitle(\"(b)\") +\n theme_minimal() +\n theme(aspect.ratio = 1)\nggarrange(lda1, lda2, ncol=2, \n common.legend = TRUE, legend = \"bottom\")\n```\n\n::: {.cell-output-display}\n![Scatterplot of flipper length by bill length of the penguins data, and corresponding variance-covariance ellipses. There is a small amount of difference between the ellipses, but they are similar enough to be confident in assuming the population variance-covariances are equal.](14-lda_files/figure-html/fig-lda-assumptions1-1.png){#fig-lda-assumptions1 width=768}\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"true\" code-summary=\"Code for bushfires data and ellipse plots\"}\n# Now repeat for a data set that violates assumptions\ndata(bushfires)\nlda3 <- ggplot(bushfires, aes(x=log_dist_cfa, \n y=log_dist_road, \n colour=cause)) +\n geom_point() +\n scale_color_discrete_divergingx(\"Zissou 1\") +\n xlim(6, 11) + ylim(-1, 10.5) +\n ggtitle(\"(a)\") +\n theme_minimal() +\n theme(aspect.ratio = 1)\nb_ell <- NULL\nfor (i in unique(bushfires$cause)) {\n x <- bushfires %>% dplyr::filter(cause == i)\n e <- gen_xvar_ellipse(x[,c(57, 59)], n=150, nstd=2)\n e$cause <- i\n b_ell <- bind_rows(b_ell, e)\n}\nlda4 <- ggplot(b_ell, aes(x=log_dist_cfa, \n y=log_dist_road, \n colour=cause)) +\n geom_point() +\n scale_color_discrete_divergingx(\"Zissou 1\") +\n xlim(6, 11) + ylim(-1, 10.5) +\n ggtitle(\"(b)\") +\n theme_minimal() +\n theme(aspect.ratio = 1)\nggarrange(lda3, lda4, ncol=2, \n common.legend = TRUE, legend = \"bottom\")\n```\n\n::: {.cell-output-display}\n![Scatterplot of distance to cfa and road for the bushfires data, and corresponding variance-covariance ellipses. There is a lot of difference between the ellipses, so it cannot be assumed that the population variance-covariances are equal.](14-lda_files/figure-html/fig-lda-assumptions2-1.png){#fig-lda-assumptions2 width=768}\n:::\n:::\n\n\\index{data!bushfires}\n\n::: {.content-visible when-format=\"html\"}\n::: insight\nThe equal and elliptical variance-covariance assumption is reasonable for the penguins data because the ellipse shapes roughly match the spread of the data. It is not a suitable assumption for the bushfires data, because the spread is not elliptically-shaped and varies in size between groups.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\n\\insightbox{The equal and elliptical variance-covariance assumption is reasonable for the penguins data because the ellipse shapes roughly match the spread of the data. It is not a suitable assumption for the bushfires data, because the spread is not elliptically-shaped and varies in size between groups.}\n:::\n\nThis approach extends to any dimension. We would use the same projection sequence to view both the data and the variance-covariance ellipses, as in @fig-penguins-lda-ellipses-html. It can be seen that there is some difference in the shape and size of the ellipses between species, in some projections, and also with the spread of points in the projected data. However, it is the differences are small, so it would be safe to assume that the population variance-covariances are equal.\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"true\" code-summary=\"Code for making animated gifs\"}\nlibrary(tourr)\np_ell <- NULL\nfor (i in unique(penguins_sub$species)) {\n x <- penguins_sub %>% dplyr::filter(species == i)\n e <- gen_xvar_ellipse(x[,1:4], n=150, nstd=1.5)\n e$species <- i\n p_ell <- bind_rows(p_ell, e)\n}\np_ell$species <- factor(p_ell$species)\nload(\"data/penguins_tour_path.rda\")\nanimate_xy(p_ell[,1:4], col=factor(p_ell$species))\nrender_gif(penguins_sub[,1:4], \n planned_tour(pt1), \n display_xy(half_range=0.9, axes=\"off\", col=penguins_sub$species),\n gif_file=\"gifs/penguins_lda1.gif\",\n frames=500,\n loop=FALSE)\nrender_gif(p_ell[,1:4], \n planned_tour(pt1), \n display_xy(half_range=0.9, axes=\"off\", col=p_ell$species),\n gif_file=\"gifs/penguins_lda2.gif\",\n frames=500,\n loop=FALSE)\n```\n:::\n\n\n::: {.content-visible when-format=\"html\"}\n\n::: {#fig-penguins-lda-ellipses-html layout-ncol=2}\n\n![Data](gifs/penguins_lda1.gif){#fig-lda-4D-assumptions1 fig-alt=\"Animation showing a tour of the penguins data, with colour indicating species. The spread of points in each group is reasonably similar regardless of projection.\" width=300}\n\n![Variance-covariance ellipses](gifs/penguins_lda2.gif){#fig-lda-4D-assumptions2 fig-alt=\"Animation showing a tour of the ellipses corresponding to variance-covariance matrices for each species. The shape of the ellipse for each group is reasonably similar regardless of projection.\" width=300}\n\nChecking the assumption of equal variance-covariance matrices for the 4D penguins data. Each ellipse corresponds to the sample variance-covariance for each species.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\n::: {#fig-penguins-lda-ellipses-pdf layout-ncol=2}\n![Data](images/penguins_lda1.png){#fig-lda-4D-assumptions1 fig-alt=\"Single frame from a tour of the penguins data, with colour indicating species. Each group has slightly negatively elliptical shaped spread.\"}\n\n![Variance-covariance ellipses](images/penguins_lda2.png){#fig-lda-4D-assumptions2 fig-alt=\"Single frame from a tour of the ellipses corresponding to variance-covariance matrices for each species. Each ellipse has a north-west to south-east oriention. The red group is slightly smaller and more elliptical than the other two groups.\"}\n\nChecking the assumption of equal variance-covariance matrices for the 4D penguins data. Each ellipse corresponds to the sample variance-covariance for each species.\n:::\n:::\n\nAs a further check, we could generate three ellipses corresponding to the pooled variance-covariance matrix, as would be used in the model, centered at each of the means. Overlay this with the data, as done in @fig-penguins-lda-pooled-html. Now you will compare the spread of the observations in the data, with the elliptical shape of the pooled variance-covariance. If it matches reasonably we can safely use LDA. This can also be done group by group when multiple groups make it difficult to view all together. \n\n::: {.content-visible when-format=\"html\"}\n\n::: info\nTo check the fit of the equal variance-covariance assumption, simulate points on the ellipse corresponding to the **pooled sample variance-covariance matrix**. Generate one for each group centered at the group mean, and compare with the data.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\infobox{To check the fit of the equal variance-covariance assumption, simulate points on the ellipse corresponding to the \\emph{pooled sample variance-covariance matrix}. Generate one for each group centered at the group mean, and compare with the data.}\n:::\n\n\\index{classification!pooled variance-covariance}\n\\index{pooled variance-covariance}\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"true\" code-summary=\"Code for adding ellipses to data\"}\n# Create an ellipse corresponding to pooled vc\npool_ell <- gen_vc_ellipse(p_vc_pool, \n xm=rep(0, ncol(p_vc_pool)))\n\n# Add means to produce ellipses for each species\np_lda_pool <- data.frame(rbind(\n pool_ell +\n matrix(rep(p_lda$means[1,],\n each=nrow(pool_ell)), ncol=4),\n pool_ell +\n matrix(rep(p_lda$means[2,],\n each=nrow(pool_ell)), ncol=4),\n pool_ell +\n matrix(rep(p_lda$means[3,],\n each=nrow(pool_ell)), ncol=4)))\n# Create one data set with means, data, ellipses\np_lda_pool$species <- factor(rep(levels(penguins_sub$species),\n rep(nrow(pool_ell), 3)))\np_lda_pool$type <- \"ellipse\"\np_lda_means <- data.frame(\n p_lda$means,\n species=factor(rownames(p_lda$means)),\n type=\"mean\")\np_data <- data.frame(penguins_sub[,1:5], \n type=\"data\")\np_lda_all <- bind_rows(p_lda_means,\n p_data,\n p_lda_pool)\np_lda_all$type <- factor(p_lda_all$type, \n levels=c(\"mean\", \"data\", \"ellipse\"))\nshapes <- c(3, 4, 20)\np_pch <- shapes[p_lda_all$type]\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"true\" code-summary=\"Code to generate animated gifs\"}\n# Code to run the tour\nanimate_xy(p_lda_all[,1:4], col=p_lda_all$species, pch=p_pch)\nload(\"data/penguins_tour_path.rda\")\nrender_gif(p_lda_all[,1:4], \n planned_tour(pt1), \n display_xy(col=p_lda_all$species, pch=p_pch, \n axes=\"off\", half_range = 0.7),\n gif_file=\"gifs/penguins_lda_pooled1.gif\",\n frames=500,\n loop=FALSE)\n\n# Focus on one species\nrender_gif(p_lda_all[p_lda_all$species == \"Gentoo\",1:4], \n planned_tour(pt1), \n display_xy(col=\"#F5191C\", \n pch=p_pch[p_lda_all$species == \"Gentoo\"], \n axes=\"off\", half_range = 0.7),\n gif_file=\"gifs/penguins_lda_pooled2.gif\",\n frames=500,\n loop=FALSE)\n```\n:::\n\n\n::: {.content-visible when-format=\"html\"}\n\n::: {#fig-penguins-lda-pooled-html layout-ncol=2}\n\n![All species](gifs/penguins_lda_pooled1.gif){#fig-lda-pooled1 fig-alt=\"Animation showing a tour of the pooled variance-covariance ellipse, computed for each species, overlaid on the data. The shape of the ellipse for each group is reasonably similar to the spread of the points in all projections.\" width=300}\n\n\n![Gentoo](gifs/penguins_lda_pooled2.gif){#fig-lda-pooled1 fig-alt=\"Animation showing a tour of the pooled variance-covariance ellipse overlaid on the data for the Gentoo penguins (red). The shape of the ellipse is reasonably similar to the spread of the points in all projections.\" width=300}\n\nChecking how the pooled variance-covariance matches the spread of points in each group.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\n::: {#fig-penguins-lda-pooled-pdf layout-ncol=2}\n\n![All species](images/penguins_lda_pooled1.png){#fig-lda-pooled1 fig-alt=\"A single frame from a tour of the pooled variance-covariance ellipse, computed for each species, overlaid on the data. The ellipse has a north-east to south-west orientation, and is almost circular, and reasonably matched the spread of the points in each group.\"}\n\n![Gentoo](images/penguins_lda_pooled2.png){#fig-lda-pooled1 fig-alt=\"A single frame from a tour of the pooled variance-covariance ellipse overlaid on the data for the Gentoo penguins (red). The ellipse has a north-west to south-east orientation, and is relatively skinny. This roughly matches the spread of the points.\"}\n\nChecking how the pooled variance-covariance matches the spread of points in each group.\n:::\n\n:::\n\n::: {.content-visible when-format=\"html\"}\n\n::: insight\nFrom the tour, we can see that the assumption of equal elliptical variance-covariance is a reasonable assumption for the penguins data. In all projections the ellipse is reasonably matching the spread of the observations.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\n\\insightbox{From the tour, we can see that the assumption of equal elliptical variance-covariance is a reasonable assumption for the penguins data. In all projections the ellipse is reasonably matching the spread of the observations.\n}\n:::\n\n## Examining results\n\nThe boundaries for a classification model can be examined by: \n\n1. generating a large number of test points in the domain of the data\n2. predicting the class for each test point\n\nWe'll look at this for 2D using the LDA model fitted to `bl`, and `bd` of the `penguins` data.\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\np_bl_bd_lda <- lda(species~bl+bd, data=penguins_sub, \n prior = c(1/3, 1/3, 1/3))\n```\n:::\n\n\nThe fitted model means \n$\\bar{x}_{Adelie} = ($ -0.95, 0.6$)^\\top$, $\\bar{x}_{Chinstrap} = ($ 0.89, 0.64$)^\\top$, and $\\bar{x}_{Gentoo} = ($ 0.65, -1.1$)^\\top$ can be added to the plots. \n\nThe boundaries can be examined using the `explore()` function from the `classifly` package, which generates observations in the range of all values of`bl` and `bd` and predicts their class. @fig-lda-2D-boundary shows the resulting prediction regions, with the observed data and the sample means overlaid.\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\n# Compute points in domain of data and predict\nlibrary(classifly)\n\np_bl_bd_lda_boundaries <- explore(p_bl_bd_lda, penguins_sub)\np_bl_bd_lda_m1 <- ggplot(p_bl_bd_lda_boundaries) +\n geom_point(aes(x=bl, y=bd, \n colour=species, \n shape=.TYPE), alpha=0.8) + \n scale_color_discrete_divergingx(\"Zissou 1\") +\n scale_shape_manual(values=c(46, 16)) +\n theme_minimal() +\n theme(aspect.ratio = 1, legend.position = \"none\")\n\np_bl_bd_lda_means <- data.frame(p_bl_bd_lda$means,\n species=rownames(p_bl_bd_lda$means))\np_bl_bd_lda_m1 + \n geom_point(data=p_bl_bd_lda_means, \n aes(x=bl, y=bd), \n colour=\"black\", \n shape=3,\n size=3) \n```\n\n::: {.cell-output-display}\n![Prediction regions of the LDA model for two variables of the three species of penguins indicated by the small points. Large points are the observations, and the sample mean of each species is represented by the plus. The boundaries between groups can be seen to be roughly half-way between the means, taking the elliptical spread into account, and mostly distinguishes the three species.](14-lda_files/figure-html/fig-lda-2D-boundary-1.png){#fig-lda-2D-boundary fig-alt='Square divided into three regions by the coloured points, primarily a wedge of yellow (Chinstrap) extending from the top right divides the other two groups. The regions mostly contain the observed values, with just a few over the boundary in the wrong region.' width=70%}\n:::\n:::\n\n\n\\index{classification!boundaries}\n\\index{tour!slice}\n\nThis approach can be readily extended to higher dimensions. One first fits the model with all four variables, and uses the `explore()` to generate points in the 4D space with predictions, generating a representation of the prediction regions. @fig-penguins-lda-boundaries-html(a) shows the results using a slice tour [@slicetour]. Points inside the slice are shown in larger size. The slice is made in the centre of the data, to show the boundaries in this neighbourhood. As the tour progresses we see a thin slice through the centre of the data, parallel with the projection plane. In most projections there is some small overlap of points between groups, which happens because we are examining a 4D object with 2D. The slice helps ot alleviate this, allowing a focus on the boundaries in the centre of the cube. In all projections the boundaries between groups is linear, as would be expected when using LDA. We can also see that the model roughly divides the cube into three relatively equally-sized regions.\n\n@fig-penguins-lda-boundaries-html(b) shows the three prediction regions, represented by points in 4D, projected into the discriminant space. Linear boundaries neatly divide the full space, which is to be expected because the LDA model computes it's classification rules in this 2D space. \n\\index{classification!discriminant space}\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\np_lda <- lda(species ~ ., penguins_sub[,1:5], prior = c(1/3, 1/3, 1/3))\np_lda_boundaries <- explore(p_lda, penguins_sub)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code for generating slice tour\"}\n# Code to run the tour\np_lda_boundaries$species\nanimate_slice(p_lda_boundaries[p_lda_boundaries$.TYPE == \"simulated\",1:4], col=p_lda_boundaries$species[p_lda_boundaries$.TYPE == \"simulated\"], v_rel=0.02, axes=\"bottomleft\")\nrender_gif(p_lda_boundaries[p_lda_boundaries$.TYPE == \"simulated\",1:4],\n planned_tour(pt1),\n display_slice(v_rel=0.02, \n col=p_lda_boundaries$species[p_lda_boundaries$.TYPE == \"simulated\"], \n axes=\"bottomleft\"), gif_file=\"gifs/penguins_lda_boundaries.gif\",\n frames=500,\n loop=FALSE\n )\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code for projecting into LDA space\"}\n# Project the boundaries into the 2D discriminant space\np_lda_b_sub <- p_lda_boundaries[\n p_lda_boundaries$.TYPE == \"simulated\", \n c(1:4, 6)]\np_lda_b_sub_ds <- data.frame(as.matrix(p_lda_b_sub[,1:4]) %*%\n p_lda$scaling)\np_lda_b_sub_ds$species <- p_lda_b_sub$species\np_lda_b_sub_ds_p <- ggplot(p_lda_b_sub_ds, \n aes(x=LD1, y=LD2, \n colour=species)) +\n geom_point(alpha=0.5) + \n geom_point(data=p_lda_pred_x1, aes(x=LD1, \n y=LD2, \n shape=species),\n inherit.aes = FALSE) +\n scale_color_discrete_divergingx(\"Zissou 1\") +\n scale_shape_manual(values=c(1, 2, 3)) +\n theme_minimal() +\n theme(aspect.ratio = 1, \n legend.position = \"bottom\",\n legend.title = element_blank()) \n```\n:::\n\n\n::: {.content-visible when-format=\"html\"}\n\n::: {#fig-penguins-lda-boundaries-html layout-ncol=2}\n\n![4D](gifs/penguins_lda_boundaries.gif){#fig-lda-4D-boundaries fig-alt=\"Animation plot, where three groups of coloured points can be seen. They roughly break the square into three regions with linear boundaries, which varies in each projection.\" width=300}\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Discriminant space](14-lda_files/figure-html/fig-lda-2D-boundaries-1.png){#fig-lda-2D-boundaries fig-alt='Scatterplot of points divided neatly into three colour groups by linear boundaries. The regions are split at the middle, and with the boundary between blue and red, blue and yellow, and yellow and red, starting near 11, 4, and o\\'clock, respectively.' width=480}\n:::\n:::\n\n\nExamining the boundaries produced by the LDA model in the full 4D with a slice tour (left) and in the discriminant space (right). Large points indicate observations within the slice, and dots are observations outside the slice. Focusing on the within-slice points, there is some overlap of points between regions in most views which represents the occlusion of 4D shapes when examining projections with thin slices. The linear boundaries are seen exactly in the discriminant space, that is they are orthogonal to these two dimensions.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\n::: {#fig-penguins-lda-boundaries-pdf layout-ncol=2 fig-alt=\"FIX-ME\"}\n\n![4D](images/penguins_lda_boundaries.png){#fig-lda-4D-boundaries fig-alt=\"A scatterplot with large points showing roughly linear boundaries between three groups.\" width=200}\n\n![Discriminant space](images/fig-lda-2D-boundaries-1.png){#fig-lda-4D-ds fig-alt=\"Scatterplot of points divided neatly into three colour groups by linear boundaries. The regions are split at the middle, and with the boundary between blue and red, blue and yellow, and yellow and red, starting near 11, 4, and o'clock, respectively.\" width=200}\n\n\n::: {.cell}\n\n:::\n\n\nExamining the boundaries produced by the LDA model in the full 4D with a slice tour (left shows a single frame) and in the discriminant space (right). Large points indicate observations within the slice, and dots are observations outside the slice. Focusing on the within-slice points, there is some overlap of points between regions in most views which represents the occlusion of 4D shapes when examining projections with thin slices. The linear boundaries are seen exactly in the discriminant space, that is they are orthogonal to these two dimensions.\n:::\n:::\n\n\n::: {.content-visible when-format=\"html\"}\n::: insight\nFrom the tour, we can see that the LDA boundaries divide the classes only in the discriminant space. It is not using the space orthogonal to the 2D discriminant space. You can see this because the boundary is sharp in just one 2D projection, while most of the projections show some overlap of regions.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\n\\insightbox{From the tour, we can see that the LDA boundaries divide the classes only in the discriminant space. It is not using the space orthogonal to the 2D discriminant space. You can see this because the boundary is sharp in just one 2D projection, while most of the projections show some overlap of regions.}\n:::\n\n## Exercises {-}\n\n1. For the `simple_clusters` compute the LDA model, and make a plot of the data, with points coloured by the class. Overlay variance-covariance ellipses, and a $+$ indicating the sample mean for each class. Is it reasonable to assume that the two classes are sampled from populations with the same variance-covariance?\n2. Examine the clusters corresponding to the classes in the `clusters` data set, using a tour. Based on the shape of the data is the assumption of equal variance-covariance reasonable?\n3. Examine the pooled variance-covariance for the `clusters` data, overlaid on the data in a tour on the 5D. Does it fit the variance of each cluster nicely?\n4. Fit an LDA model to the `simple_clusters` data. Examine the boundaries produced by the model, in 2D. \n5. Fit an LDA model to the `clusters` data. Examine the boundaries produced by the model in 5D.\n6. Assess the LDA assumptions for the `multicluster` data. Is LDA an appropriate model?\n7. Compute the first 12 PCs of the `sketches` data. Check the assumption of equal, elliptical variance-covariance of the 6 groups. Regardless of whether you decide that the assumption is satisfied or not, fit an LDA to the 12 PCs. Extract the discriminant space (the `x` component of the `predict` object), and examine the separation (or not) of the 6 groups in this 5D space. Is LDA providing a good classification model for this data?\n8. Even though the `bushfires` data does not satisfy the assumptions for LDA, fit LDA to the first five PCs. Examine the class differences in the 3D discriminant space. \n9. Compute the boundary between classes, for the LDA model where the prior probability reflects the sample size, and the LDA model where the priors are equal for all groups. How does the boundary between lightning caused fires and the other groups change?\n\n\n::: {.cell}\n\n:::\n", + "supporting": [ + "14-lda_files/figure-html" + ], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": {}, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/14-lda/figure-html/fig-lda-2D-boundaries-1.png b/_freeze/14-lda/figure-html/fig-lda-2D-boundaries-1.png new file mode 100644 index 0000000..cf80e47 Binary files /dev/null and b/_freeze/14-lda/figure-html/fig-lda-2D-boundaries-1.png differ diff --git a/_freeze/14-lda/figure-html/fig-lda-2D-boundary-1.png b/_freeze/14-lda/figure-html/fig-lda-2D-boundary-1.png new file mode 100644 index 0000000..2f555ba Binary files /dev/null and b/_freeze/14-lda/figure-html/fig-lda-2D-boundary-1.png differ diff --git a/_freeze/14-lda/figure-html/fig-lda-assumptions1-1.png b/_freeze/14-lda/figure-html/fig-lda-assumptions1-1.png new file mode 100644 index 0000000..6870e59 Binary files /dev/null and b/_freeze/14-lda/figure-html/fig-lda-assumptions1-1.png differ diff --git a/_freeze/14-lda/figure-html/fig-lda-assumptions2-1.png b/_freeze/14-lda/figure-html/fig-lda-assumptions2-1.png new file mode 100644 index 0000000..4f37e6b Binary files /dev/null and b/_freeze/14-lda/figure-html/fig-lda-assumptions2-1.png differ diff --git a/_freeze/14-lda/figure-html/fig-p-lda-1.png b/_freeze/14-lda/figure-html/fig-p-lda-1.png new file mode 100644 index 0000000..403a781 Binary files /dev/null and b/_freeze/14-lda/figure-html/fig-p-lda-1.png differ diff --git a/_freeze/14-lda/figure-pdf/fig-lda-2D-boundaries-1.pdf b/_freeze/14-lda/figure-pdf/fig-lda-2D-boundaries-1.pdf index 75d5189..4024e31 100644 Binary files a/_freeze/14-lda/figure-pdf/fig-lda-2D-boundaries-1.pdf and b/_freeze/14-lda/figure-pdf/fig-lda-2D-boundaries-1.pdf differ diff --git a/_freeze/14-lda/figure-pdf/fig-lda-2D-boundary-1.pdf b/_freeze/14-lda/figure-pdf/fig-lda-2D-boundary-1.pdf index 0d2a2b6..16c5ec8 100644 Binary files a/_freeze/14-lda/figure-pdf/fig-lda-2D-boundary-1.pdf and b/_freeze/14-lda/figure-pdf/fig-lda-2D-boundary-1.pdf differ diff --git a/_freeze/14-lda/figure-pdf/fig-lda-assumptions1-1.pdf b/_freeze/14-lda/figure-pdf/fig-lda-assumptions1-1.pdf index d4b1d63..c1a1772 100644 Binary files a/_freeze/14-lda/figure-pdf/fig-lda-assumptions1-1.pdf and b/_freeze/14-lda/figure-pdf/fig-lda-assumptions1-1.pdf differ diff --git a/_freeze/14-lda/figure-pdf/fig-lda-assumptions2-1.pdf b/_freeze/14-lda/figure-pdf/fig-lda-assumptions2-1.pdf index c067819..74247a5 100644 Binary files a/_freeze/14-lda/figure-pdf/fig-lda-assumptions2-1.pdf and b/_freeze/14-lda/figure-pdf/fig-lda-assumptions2-1.pdf differ diff --git a/_freeze/14-lda/figure-pdf/fig-p-lda-1.pdf b/_freeze/14-lda/figure-pdf/fig-p-lda-1.pdf index 68750a7..4c224f4 100644 Binary files a/_freeze/14-lda/figure-pdf/fig-p-lda-1.pdf and b/_freeze/14-lda/figure-pdf/fig-p-lda-1.pdf differ diff --git a/_freeze/15-forests/execute-results/html.json b/_freeze/15-forests/execute-results/html.json new file mode 100644 index 0000000..91a5d2d --- /dev/null +++ b/_freeze/15-forests/execute-results/html.json @@ -0,0 +1,17 @@ +{ + "hash": "e4a0043cf7b524ba5a6b0c73116cad6d", + "result": { + "engine": "knitr", + "markdown": "# Trees and forests {#sec-trees-forests}\n\n## Trees {#sec-trees}\n\\index{classification!trees}\n\nThe tree algorithm [@BFOS84] is a simple and versatile algorithmic method for supervised classification. The basic tree algorithm generates a classification rule by sequentially splitting the data into two buckets. Splits are made between sorted data values of individual variables, with the goal of obtaining pure classes on each side of the split. The inputs for a simple tree classifier commonly include (1) an impurity measure, an indication of the relative diversity among the cases in the terminal nodes; (2) a parameter that sets the minimum number of cases in a node, or the minimum number of observations in a terminal node of the tree; and (3) a complexity measure that controls the growth of a tree, balancing the use of a simple generalizable tree against a more accurate tree\ntailored to the sample. When applying tree methods, exploring the effects of the input parameters on the tree is instructive; for example, it helps us to assess the stability of the tree model.\n\nAlthough algorithmic models do not depend on distributional assumptions, that does not mean that every algorithm is suitable for all data. For example, the tree model works best when all variables are independent within each class, because it does not take such dependencies into account. Visualization can help us to determine whether a particular model should be applied. In classification problems, it is useful to explore the cluster structure, comparing the clusters with the classes and looking for evidence of correlation within each class. \nThe plots in @fig-lda-assumptions1 and @fig-penguins-lda-ellipses-html shows a strong correlation between the variables within each species, which indicates that the tree model may not give good results for the penguins data. We'll show how this is the case with two variables initially, and then extend to the four variables.\n\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Draw tree and model boundaries\"}\nlibrary(mulgar)\nlibrary(rpart)\nlibrary(rpart.plot)\nlibrary(colorspace)\nlibrary(classifly)\nlibrary(ggplot2)\nlibrary(ggdendro)\nlibrary(patchwork)\nlibrary(ggthemes)\n\nload(\"data/penguins_sub.rda\")\np_bl_bd_tree <- rpart(species~bl+bd, data=penguins_sub)\n#f1 <- rpart.plot(p_bl_bd_tree, box.palette=\"Grays\")\nd <- dendro_data(p_bl_bd_tree)\nf1 <- ggplot() +\n geom_segment(data = d$segments, \n aes(x = x, y = y, \n xend = xend, yend = yend)) +\n geom_text(data = d$labels, \n aes(x = x, y = y, \n label = label), size = 2.7, \n vjust = 1.2) +\n geom_text(data = d$leaf_labels,\n aes(x = x, y = y, \n label = label), size = 2.5, \n vjust = 2, hjust=c(0,0.5,0,0.5)) + \n expand_limits(x=0.9, y=0) +\n theme_dendro()\n\np_bl_bd_tree_boundaries <- explore(p_bl_bd_tree, penguins_sub)\nf2 <- ggplot(p_bl_bd_tree_boundaries) +\n geom_point(aes(x=bl, y=bd, colour=species, shape=.TYPE)) + \n scale_color_discrete_divergingx(palette=\"Zissou 1\") +\n scale_shape_manual(values=c(46, 16)) +\n theme_minimal() +\n theme(aspect.ratio = 1, legend.position = \"none\")\n\nf1 + f2 + plot_layout(ncol=2)\n```\n\n::: {.cell-output-display}\n![The association between variables in the penguins data causes problems for fitting a tree model. Although the model, computed using only bl and bd, is simple (left), the fit is poor (right) because it doesn't adequately utilise combinations of variables.](15-forests_files/figure-html/fig-p-bl-bd-tree-1.png){#fig-p-bl-bd-tree fig-alt='Tree diagram with top split bl<0.3004, leading to Adelie branch, second split at bd >= -0.4138, leading to Gentoo branch, and final split at bl< 0.1476, leading to Adelie and Chinstrap branches. The scatterplot at right shows bd vs bl, with three predictive region partitions, and the data is overplotted. The elliptical spreads of data points crosses the rectangular partitions in places.' width=100%}\n:::\n:::\n\n\nThe plots in @fig-p-bl-bd-tree show the inadequacies of the tree fit. The background color indicates the class predictions, and thus boundaries produced by the tree fit. They can be seen to be boxy, and missing the elliptical nature of the penguin clusters. This produces errors in the classification of observations which are indefensible. One could always force the tree to fit the data more closely by adjusting the parameters, but the main problem persists: that one is trying to fit elliptical shapes using boxes.\n\n::: {.content-visible when-format=\"html\"}\n::: info\nThere are less strict assumptions for a non-parametric model but it is still important to understand the model fit relative to the data. \n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\infobox{There are less strict assumptions for a non-parametric model but it is still important to understand the model fit relative to the data. \n}\n:::\n\nThe boundaries for the tree model on all four variables of the penguins data can be viewed similarly, by predicting a set of points randomly generated in the 4D domain of observed values. @fig-penguins-lda-tree-html shows the prediction regions for LDA and a default tree in a slice tour [@slicetour]. The slice tour is used to help see into the middle of the 4D cube. It slices the cube through the centre of the data, where the boundaries of the regions should meet. \n\nThe prediction regions of the default fitted tree are shown in comparison to those from the LDA model. We don't show the tree diagram here, but it makes only six splits of the tree model, which is delightfully simple. However, just like the model fitted to two variables, the result is not adequate for the penguins data. The tree model generates boxy boundaries, whereas the LDA model splits the 4D cube obliquely. The boxy regions don't capture the differences between the elliptically-shaped clusters. Overlaying the observed data on this display would make this clearer, but the boundaries are easier to examine without them.\n\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to make animated gifs of slice tour of boundaries\"}\np_tree <- rpart(species~., data=penguins_sub[,1:5])\nrpart.plot(p_tree, box.palette=\"Grays\")\n\np_tree_boundaries <- explore(p_tree, penguins_sub)\nanimate_slice(p_tree_boundaries[p_tree_boundaries$.TYPE == \"simulated\",1:4], col=p_tree_boundaries[p_tree_boundaries$.TYPE == \"simulated\",6], v_rel=0.02, axes=\"bottomleft\")\nload(\"data/penguins_tour_path.rda\")\nrender_gif(p_tree_boundaries[p_tree_boundaries$.TYPE == \"simulated\",1:4],\n planned_tour(pt1),\n display_slice(v_rel=0.02, \n col=p_tree_boundaries[p_tree_boundaries$.TYPE == \"simulated\",6], \n axes=\"bottomleft\"), gif_file=\"gifs/penguins_tree_boundaries.gif\",\n frames=500,\n loop=FALSE\n )\n```\n:::\n\n\n\\index{tour!slice} \n\n::: {.content-visible when-format=\"html\"}\n\n::: {#fig-penguins-lda-tree-html layout-ncol=2}\n\n![LDA model](gifs/penguins_lda_boundaries.gif){#fig-lda-boundary fig-alt=\"FIX ME\" width=300}\n\n![Tree model](gifs/penguins_tree_boundaries.gif){#fig-tree-boundary fig-alt=\"FIX ME\" width=300}\n\nComparison of the boundaries produced by the LDA (a) and the tree (b) model, using a slice tour. The tree boundaries are more box-shaped than the LDA boundaries, which does not adequately capture the differences between the elliptically-shaped clusters of the penguins data.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\n::: {#fig-penguins-lda-tree-pdf layout-ncol=2}\n\n![LDA model](images/penguins_lda_boundaries.png){#fig-lda-boundary fig-alt=\"FIX ME\"}\n\n![Tree model](images/penguins_tree_boundaries.png){#fig-tree-boundary fig-alt=\"FIX ME\"}\n\nComparison of the boundaries produced by the LDA (a) and the tree (b) model, using a slice tour. (Here only a single frame is shown.) The tree boundaries are more box-shaped than the LDA boundaries, which does not adequately capture the differences between the elliptically-shaped clusters of the penguins data.\n:::\n:::\n\n\n::: {.content-visible when-format=\"html\"}\n::: insight\nWith the penguins data, a tree model may not be a good choice due to the strong correlation between variables. The best separation is in combinations of variables, not the single variable tree splits. \n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\insightbox{With the penguins data, a tree model may not be a good choice due to the strong correlation between variables. The best separation is in combinations of variables, not the single variable tree splits.}\n:::\n\n\n\n## Random forests \n\\index{classification!random forest}\n\nA random forest [@Br01] is a classifier that is built from multiple trees generated by randomly sampling the cases and the variables. The random sampling (with replacement) of cases has the fortunate effect of creating a training (\"in-bag\") and a test (\"out-of-bag\") sample for each tree computed. The class of each case in the out-of-bag sample for each tree is predicted, and the predictions for all trees are combined into a vote for the class identity. \n\nA random forest is a computationally intensive method, a \"black box\" classifier, but it produces several diagnostics that make the outcome less mysterious. Some diagnostics that help us to assess the model are the votes, the measure of variable importance, and the proximity matrix.\n\n### Examining the votes matrix {#sec-votes}\n\nHere we show how to use the `randomForest` [@randomForest2002] votes matrix for the penguins data to investigate confusion between classes, and observations which are problematic to classify. The votes matrix can be considered to be predictive probability distribution, where the values for each observation sum to 1. With only three classes the votes matrix is only a 2D object, and thus easy to examine. With four or more classes the votes matrix needs to be examined in a tour. \n\\index{classification!vote matrix}\n\\index{classification!predictive probability distribution}\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nlibrary(randomForest)\nlibrary(dplyr)\npenguins_rf <- randomForest(species~.,\n data=penguins_sub[,1:5],\n importance=TRUE)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\npenguins_rf\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\n randomForest(formula = species ~ ., data = penguins_sub[, 1:5], importance = TRUE) \n Type of random forest: classification\n Number of trees: 500\nNo. of variables tried at each split: 2\n\n OOB estimate of error rate: 2.4%\nConfusion matrix:\n Adelie Chinstrap Gentoo class.error\nAdelie 143 2 1 0.020547945\nChinstrap 4 64 0 0.058823529\nGentoo 0 1 118 0.008403361\n```\n\n\n:::\n:::\n\n\n\nTo examine the votes matrix, we extract the `votes` element from the random forest model object. The first five rows are:\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nhead(penguins_rf$votes, 5)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n Adelie Chinstrap Gentoo\n1 1.0000 0.00000 0\n2 0.9737 0.02632 0\n3 0.9884 0.01156 0\n4 1.0000 0.00000 0\n5 1.0000 0.00000 0\n```\n\n\n:::\n:::\n\n\nThis has three columns corresponding to the three species, but because each row is a set of proportions it is only a 2D object. To reduce the dimension from 3D to the 2D we use a Helmert matrix [@helmert]. A Helmert matrix has a first row of all 1's. The remaining components of the matrix are 1's in the lower triangle, and 0's in the upper triangle and the diagonal elements are the negative row sum. The rows are usually normalised to have length 1. They are used to create contrasts to test combinations of factor levels for post-testing after Analysis of Variance (ANOVA). For compositional data, like the votes matrix, when the first row is removed a Helmert matrix can be used to reduce the dimension appropriately. For three classes, this will generate the common 2D ternary diagram, but for higher dimensions it will reduce to a $(g-1)$-dimensional simplex. For the penguins data, the Helmert matrix for 3D is \n\\index{ternary diagram}\n\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to compute Helmert matrix\"}\ngeozoo::f_helmert(3)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [,1] [,2] [,3]\nhelmert 0.5774 0.5774 0.5774\nx 0.7071 -0.7071 0.0000\nx 0.4082 0.4082 -0.8165\n```\n\n\n:::\n:::\n\n\nWe drop the first row, transpose it, and use matrix multiplication with the votes matrix to get the ternary diagram.\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\n# Project 4D into 3D\nlibrary(geozoo)\nproj <- t(geozoo::f_helmert(3)[-1,])\np_rf_v_p <- as.matrix(penguins_rf$votes) %*% proj\ncolnames(p_rf_v_p) <- c(\"x1\", \"x2\")\np_rf_v_p <- p_rf_v_p %>%\n as.data.frame() %>%\n mutate(species = penguins_sub$species)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\n# Add simplex\nsimp <- simplex(p=2)\nsp <- data.frame(cbind(simp$points), simp$points[c(2,3,1),])\ncolnames(sp) <- c(\"x1\", \"x2\", \"x3\", \"x4\")\nsp$species = sort(unique(penguins_sub$species))\np_ternary <- ggplot() +\n geom_segment(data=sp, aes(x=x1, y=x2, xend=x3, yend=x4)) +\n geom_text(data=sp, aes(x=x1, y=x2, label=species),\n nudge_x=c(-0.06, 0.07, 0),\n nudge_y=c(0.05, 0.05, -0.05)) +\n geom_point(data=p_rf_v_p, aes(x=x1, y=x2, \n colour=species), \n size=2, alpha=0.5) +\n scale_color_discrete_divergingx(palette=\"Zissou 1\") +\n theme_map() +\n theme(aspect.ratio=1, legend.position=\"none\")\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to generate animated gifs\"}\n# Look at the votes matrix, in its 3D space\nanimate_xy(penguins_rf$votes, col=penguins_sub$species)\n\n# Save an animated gif\nrender_gif(penguins_rf$votes,\n grand_tour(),\n display_xy(v_rel=0.02, \n col=penguins_sub$species, \n axes=\"bottomleft\"), \n gif_file=\"gifs/penguins_rf_votes.gif\",\n frames=500,\n loop=FALSE\n)\n```\n:::\n\n\n::: {.content-visible when-format=\"html\"}\n::: {#fig-penguins-votes-html layout-ncol=2}\n\n![3D](gifs/penguins_rf_votes.gif){#fig-p-votes-tour fig-alt=\"FIX ME\" width=300}\n\n\n::: {.cell}\n::: {.cell-output-display}\n![2D ternary diagram](15-forests_files/figure-html/fig-p-votes-ggplot-html-1.png){#fig-p-votes-ggplot-html width=384}\n:::\n:::\n\n\nExamining the votes matrix from a random forest fit to the penguins: (a) from a tour of the 3D, (b) projected into 2D, to make a ternary diagram. In 3D the points can be seen to lie along a 2D plane, which is due to the constraint that the values sum to 1. From the ternary diagram, the classification can be seen to be reasonably well distinguished because points mostly lie at the vertex. There are a few penguins that are confused with a different species, as seen from the few points spread between vertices.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n::: {#fig-penguins-votes-pdf layout-ncol=2}\n\n![3D](images/penguins_rf_votes.png){#fig-p-votes-tour fig-alt=\"FIX ME\"}\n\n![2D ternary diagram](images/fig-p-votes-ggplot-1.png){#fig-p-votes-ggplot-pdf fig-alt=\"FIX ME\"}\n\nExamining the votes matrix from a random forest fit to the penguins: (a) in a frame from a tour of the 3D, (b) projected into 2D, to make a ternary diagram. In 3D the points can be seen to lie along a 2D plane, which is due to the constraint that the values sum to 1. From the ternary diagram, the classification can be seen to be reasonably well distinguished because points mostly lie at the vertex. There are a few penguins that are confused with a different species, as seen from the few points spread between vertices.\n:::\n:::\n\n\nWe can use the `geozoo` package to generate the surrounding simplex, which for 2D is a triangle.\n\nThe votes matrix reports the proportion of trees each observation is classified as each class. From the tour of the votes matrix, as in @fig-penguins-votes-html(a), it can be seen to be 2D in 3D space. This is due to the constraint that the three proportions for each observation sum to 1. Using a Helmert matrix, this data can be projected into the 2D space, or more generally the $(g-1)$-dimensional space where it resides, shown in @fig-penguins-votes-html(b). In 2D this is called a ternary diagram, and in higher dimensions the bounding shapes might be considered to be a simplex. The vertices of this shape correspond to $(1,0,0), (0,1,0), (0,0,1)$ (and analogously for higher dimensions), which represent perfect confidence, that an observation is classified into that group all the time.\n\nWhat we can see here is a concentration of points in the corners of the triangle indicates that most of the penguins are confidently classified into their correct class. Then there is more separation between the Gentoo and the others, than between Chinstrap and Adelie. That means that as a group Gentoo are more distinguishable. Only one of the Gentoo penguins has substantial confusion, mostly confused as a Chinstrap, but occasionally confused as an Adelie -- if it was only ever confused as a Chinstrap it would fall on the edge between Gentoo and Chinstrap. There are quite a few Chinstrap and Adelie penguins confused as each other, with a couple of each more confidently predicted to be the other class. This can be seen because there are points of the wrong colour close to those vertices. \n\nThe votes matrix is useful for investigating the fit, but one should remember that there are some structural elements of the penguins data that don't lend themselves to tree models. Although a forest has the capacity to generate non-linear boundaries by combining predictions from multiple trees, it is still based on the boxy boundaries of trees. This makes it less suitable for the penguins data with elliptical classes. You could use the techniques from the previous section to explore the boundaries produced by the forest, and you will find that the are more boxy than the LDA models.\n\\index{classification!vote matrix}\n\n\n::: {.content-visible when-format=\"pdf\"}\n\\infobox{By visualising the votes matrix we can understand which observations are harder to classify, which of the classes are more easily confused with each other.}\n:::\n\n::: {.content-visible when-format=\"html\"}\n::: info\nBy visualising the votes matrix we can understand which observations are harder to classify, which of the classes are more easily confused with each other.\n:::\n:::\n\nTo examine a vote matrix for a problem with more classes, we will examine the 10 class `fake_trees` data example. The full data has 100 variables, and we have seen from @sec-clust-graphics that reducing to 10 principal components allows the linear branching structure in the data to be seen. Given that the branches correspond to the classes, it will be interesting to see how well the random forest model performs.\n\\index{data!fake trees}\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nlibrary(mulgar)\nlibrary(dplyr)\nlibrary(liminal)\nft_pca <- prcomp(fake_trees[,1:100], \n scale=TRUE, retx=TRUE)\nft_pc <- as.data.frame(ft_pca$x[,1:10])\nft_pc$branches <- fake_trees$branches\nlibrary(randomForest)\nft_rf <- randomForest(branches~., data=ft_pc, \n importance=TRUE)\n```\n:::\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nhead(ft_rf$votes, 5)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n 0 1 2 3 4 5 6 7 8 9\n1 0.9 0 0.02 0.000 0.01 0.11 0 0 0.006 0.00\n2 0.7 0 0.02 0.000 0.02 0.31 0 0 0.000 0.00\n3 0.8 0 0.04 0.000 0.11 0.02 0 0 0.000 0.02\n4 0.9 0 0.01 0.000 0.00 0.08 0 0 0.000 0.00\n5 0.6 0 0.04 0.005 0.03 0.28 0 0 0.005 0.00\n```\n\n\n:::\n:::\n\n\n::: {.content-visible when-format=\"html\"}\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nft_rf_votes <- ft_rf$votes %>%\n as_tibble() %>%\n mutate(branches = fake_trees$branches)\n\nproj <- t(geozoo::f_helmert(10)[-1,])\nf_rf_v_p <- as.matrix(ft_rf_votes[,1:10]) %*% proj\ncolnames(f_rf_v_p) <- c(\"x1\", \"x2\", \"x3\", \"x4\", \"x5\", \"x6\", \"x7\", \"x8\", \"x9\")\nf_rf_v_p <- f_rf_v_p %>%\n as.data.frame() %>%\n mutate(branches = fake_trees$branches)\n\nsimp <- geozoo::simplex(p=9)\nsp <- data.frame(simp$points)\ncolnames(sp) <- c(\"x1\", \"x2\", \"x3\", \"x4\", \"x5\", \"x6\", \"x7\", \"x8\", \"x9\")\nsp$branches = \"\"\nf_rf_v_p_s <- bind_rows(sp, f_rf_v_p) %>%\n mutate(branches = factor(branches))\nlabels <- c(\"0\" , \"1\", \"2\", \"3\", \"4\", \"5\", \"6\", \"7\", \"8\", \"9\",\n rep(\"\", 3000))\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to make animated gifs\"}\nanimate_xy(f_rf_v_p_s[,1:9], col = f_rf_v_p_s$branches, \n axes = \"off\", half_range = 0.8,\n edges = as.matrix(simp$edges),\n obs_labels = labels, palette = \"Viridis\")\n\nrender_gif(f_rf_v_p_s[,1:9],\n grand_tour(),\n display_xy(col = f_rf_v_p_s$branches, \n axes = \"off\", half_range = 0.8,\n edges = as.matrix(simp$edges),\n obs_labels = labels, palette=\"Viridis\"),\n gif_file=\"gifs/ft_votes.gif\",\n frames=500) \n```\n:::\n\n:::\n\n::: {.content-visible when-format=\"html\"}\n\n::: {#fig-ft-votes-html layout-ncol=2}\n\n![The 9D votes matrix for the 10 class fake_trees data in a tour.](gifs/ft_votes.gif){#fig-ft-votes-tour fig-alt=\"FIX ME\" width=300}\n\n![Several static views from the tour revealing how clusters connect.](images/ft-votes.png){#fig-ft-votes-prj fig-alt=\"FIX ME\" width=300}\n\nA tour and several static views of the votes matrix. Lines are the edges of the 8D simplex, which bounds the shape. Points mostly concentrate in the vertices, or spread along one of the edges, which means that most observations are clearly belonging to one group, or confused with a single other group. The exception to this is class 0, which spreads in many directions.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\n::: {#fig-ft-votes-pdf layout-ncol=2}\n\n![](images/ft-votes.png){fig-alt=\"FIX ME\" width=400}\n\nSeveral static views from the tour of the votes matrix. Lines are the edges of the 8D simplex, which bounds the shape. Points mostly concentrate in the vertices, or spread along one of the edges, which means that most observations are clearly belonging to one group, or confused with a single other group. The exception to this is class 0, which spreads in many directions.\n:::\n:::\n\n\\index{data!fake trees}\nThe votes matrix is 9D, due to the 9 groups. With this many dimensions, if the cluster structure is weak, it will look messy in a tour. However, what we can see in @fig-ft-votes-html is that the structure is relatively simple, and very interesting in that it suggests a strong clustering of classes. Points are coloured by their true class. The lines represent the 8D simplex that bounds the observations, akin to the triangle in the ternary diagram.\n\nPoints concentrate at the vertices, which means that most are confidently predicted to be their true class. The most spread of points is along single edges, between pairs of vertices. This means that when there is confusion it is mostly with just one other group. One vertex (0) which has connections to all other vertexes. That is, there are points stretching from this vertex to every other. It means that some observations in every other class can be confused with class 0, and class 0 observations can be confused with every other class. This information suggests that cluster 0 is central to all the other clusters. \n\n\nSome of this information could also be inferred from the confusion matrix for the model. However visualising the votes matrix provides more intricate details. Here we have seen that the points spread out from a vertex, with fewer and fewer the further one gets. It allows us to see the distribution of points, which is not possible from the confusion matrix alone. The same misclaassification rate could be due to a variety of distributions. The visual pattern in the votes matrix is striking, and gives additional information about how the clustering distribution, and shapes of clusters, matches the class labels. It reinforces the clusters are linear extending into different dimensions in the 100D space, but really only into about 8D (as we'll see from the variable importance explanation below). We also see that nine of the clusters are all connected to a single cluster.\n\n::: {.content-visible when-format=\"html\"}\n::: insight\nThe votes matrix for the fake trees has a striking geometric structure, with one central cluster connected to all other clusters, each of which is distinct from each other.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\insightbox{The votes matrix for the fake trees has a striking geometric structure, with one central cluster connected to all other clusters, each of which is distinct from each other.}\n:::\n\n### Using variable importance {#sec-forest-var-imp}\n\\index{classification!variable importance}\n\nThe variable importance score across all classes, and for each class is useful for choosing variables to enter into a tour, to explore class differences. This is particularly so when there are many variables, as in the fake_trees data. We would also expect that this data will have a difference between importance for some classes.\n\n\n\n::: {#tbl-ft-importance .cell tbl-cap='Variable importance from the random forest fit to the fake trees data, for each of the 9 classes, and using the accuracy and Gini metrics.'}\n\n```{.r .cell-code code-fold=\"false\"}\nlibrary(gt)\nft_rf$importance %>% \n as_tibble(rownames=\"Var\") %>% \n rename(Acc=MeanDecreaseAccuracy,\n Gini=MeanDecreaseGini) %>%\n #arrange(desc(Gini)) %>%\n gt() %>%\n fmt_number(columns = c(`0`,`1`,`2`,`3`,`4`,\n `5`,`6`,`7`,`8`,`9`),\n decimals = 1) %>% \n fmt_number(columns = Acc,\n decimals = 2) %>%\n fmt_number(columns = Gini,\n decimals = 0)\n```\n\n::: {.cell-output-display}\n\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n\n\n\n\n\n\n\n\n\n\n\n \n\n\n\n\n\n\n\n\n\n\n\n\n \n\n\n\n\n\n\n\n\n\n\n\n\n \n\n\n\n\n\n\n\n\n\n\n\n\n \n\n\n\n\n\n\n\n\n\n\n\n\n \n\n\n\n\n\n\n\n\n\n\n\n\n \n\n\n\n\n\n\n\n\n\n\n\n\n \n\n\n\n\n\n\n\n\n\n\n\n\n \n\n\n\n\n\n\n\n\n\n\n\n\n \n\n\n\n\n\n\n\n\n\n\n\n\n \n \n \n
Var0123456789AccGini
PC10.10.30.50.30.20.50.40.20.30.30.31484
PC20.10.20.20.50.30.30.20.40.20.30.28376
PC30.10.10.10.10.50.10.10.10.20.20.16304
PC40.10.50.10.00.10.00.40.10.10.10.14342
PC50.10.10.30.10.20.20.10.10.30.20.18337
PC60.10.20.20.20.00.10.00.30.10.20.15282
PC70.10.00.20.00.10.10.10.30.10.10.11258
PC80.00.10.00.20.10.10.00.00.10.30.09216
PC90.10.00.00.00.00.00.00.00.00.00.0258
PC100.00.00.00.00.00.00.00.00.00.00.0143
\n
\n```\n\n:::\n:::\n\n\nFrom the variable importance (@tbl-ft-importance), we can see that PC9 and PC10 do not substantially contribute. That means the 100D data can be reduced to 8D without losing the information about the cluster structure. PC1 is most important overall, and the order matches the PC order, as might be expected because highest variance corresponds to the most spread clusters. Each cluster has a different set of variables that are important. For example, the variables important for distinguishing cluster 1 are PC1 and PC4, and for cluster 2 they are PC1 and PC5. \n\n\n::: {.content-visible when-format=\"html\"}\n::: info\nClass-wise variable importance helps to find a subspace on which to tour to examine how this class cluster differs from the others.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\infobox{Class-wise variable importance helps to find a subspace on which to tour to examine how this class cluster differs from the others.}\n:::\n\nWe can use the accuracy information to choose variables to provide to the tour. Overall, one would sequentially add the variables into a tour based on their accuracy or Gini value. Here it is simply starting with the first three PCs, and then sequentially adding the PCs to examine how distinct the clusters are with ot without the extra variable. It can be helpful to focus on a single class against all the others. To do this create a new binary class variable, indicating that the observation belongs to class $k$ or not, as follows: \n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nft_pc <- ft_pc %>%\n mutate(cl1 = factor(case_when(\n branches == \"0\" ~ \"0\",\n branches == \"1\" ~ \"1\",\n .default = \"other\"\n )))\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to make animated gifs\"}\nanimate_xy(ft_pc[,c(\"PC1\", \"PC2\", \"PC4\", \"PC6\")], col=ft_pc$cl1, palette=\"Viridis\")\nrender_gif(ft_pc[,c(\"PC1\", \"PC2\", \"PC4\", \"PC6\")],\n grand_tour(),\n display_xy(col=ft_pc$cl1, palette=\"Viridis\"),\n gif_file=\"gifs/ft_cl1.gif\",\n frames=500)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to make plot\"}\nft_pc_cl1 <- ggplot(ft_pc, aes(x=PC4, y=PC1, col=cl1)) +\n geom_point(alpha=0.7, size=1) +\n scale_color_discrete_sequential(palette=\"Viridis\", rev=FALSE) +\n theme_minimal() +\n theme(aspect.ratio = 1)\n```\n:::\n\n\nFrom @fig-ft-cl-html we can see how cluster 1 is distinct from all of the other observations, albeit with a close connection to the trunk of the tree (cluster 0). The distinction is visible whenever PC4 contributes to the projection, but can be seen clearly with only PC1 and PC4.\n\n::: {.content-visible when-format=\"html\"}\n\n::: {#fig-ft-cl-html layout-ncol=2}\n\n![Tour of most important variables for class 1.](gifs/ft_cl1.gif){#fig-ft-cl1 fig-alt=\"FIX ME\" width=300}\n\n\n::: {.cell}\n::: {.cell-output-display}\n![PC1 and PC4 together reveal cluster 1.](15-forests_files/figure-html/fig-ft-cl1-pc-1.png){#fig-ft-cl1-pc width=100%}\n:::\n:::\n\n\nFocusing on class 1 in the `fake_trees` data. The most important variables were PC1 and PC4. A combination of PC2 and PC4 reveals the difference between cluster 1 and all the other clusters.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\n::: {#fig-ft-cl-pdf layout-ncol=2}\n\n![](images/ft_cl1.png){fig-alt=\"FIX ME\"}\n\n\n![](images/fig-ft-cl1-pc-1.png){fig-alt=\"FIX ME\"} \n\nFocusing on class 1 in the `fake_trees` data. The most important variables were PC1 and PC4. A combination of PC2 and PC4 reveals the difference between cluster 1 and all the other clusters.\n:::\n:::\n\nFor a problem like this, it can be useful to several classes together. We've chosen to start with class 8 (light green), because from @fig-ft-votes-html it appears to have less connection with class 0, and closer connection with another class. This is class 6 (medium green). A good guess because it has one observation confused with class 8 according to the confusion matrix (printed below). \n\\index{classification!confusion matrix}\n\nWhen we examine these two clusters in association with class 0, we can see that there is a third cluster that is connected with clusters 6 and 8. It turns out to be cluster 1. It's confusing, because the confusion matrix would suggest that the overlap from all is with cluster 0, but not each other. \n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nft_rf$confusion\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n 0 1 2 3 4 5 6 7 8 9 class.error\n0 263 7 2 3 3 6 5 2 6 3 0.123\n1 14 286 0 0 0 0 0 0 0 0 0.047\n2 8 0 290 0 2 0 0 0 0 0 0.033\n3 5 0 0 289 0 0 0 5 0 1 0.037\n4 13 0 0 0 287 0 0 0 0 0 0.043\n5 11 0 0 0 0 289 0 0 0 0 0.037\n6 13 0 0 0 0 0 286 0 1 0 0.047\n7 6 0 0 4 0 0 0 290 0 0 0.033\n8 7 0 0 0 0 0 0 0 293 0 0.023\n9 6 0 0 0 0 0 0 1 0 293 0.023\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nft_pc <- ft_pc %>%\n mutate(cl8 = factor(case_when(\n branches == \"0\" ~ \"0\",\n branches == \"6\" ~ \"6\",\n branches == \"1\" ~ \"1\",\n branches == \"8\" ~ \"8\",\n .default = \"other\"\n )))\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to make animated gif\"}\nanimate_xy(ft_pc[,c(\"PC1\", \"PC2\", \"PC4\", \"PC5\", \"PC6\")], col=ft_pc$cl8, palette=\"Viridis\")\nrender_gif(ft_pc[,c(\"PC1\", \"PC2\", \"PC4\", \"PC5\", \"PC6\")],\n grand_tour(),\n display_xy(col=ft_pc$cl8, palette=\"Viridis\"),\n gif_file=\"gifs/ft_cl8.gif\",\n frames=500)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to make plot\"}\nft_pc_cl8 <- ggplot(ft_pc, aes(x=PC1, y=PC5, col=cl8)) +\n geom_point(alpha=0.7, size=1) +\n scale_color_discrete_sequential(palette=\"Viridis\", rev=FALSE) +\n theme_minimal() +\n theme(aspect.ratio = 1)\n```\n:::\n\n\nFrom the tour in @fig-ft-cl2 we can see that clusters 1, 6, and 8 share one end of the trunk (cluster 0). Cluster 8 is almost more closely connected with cluster 6, though, than cluster 0. PC1 and PC5 mostly show the distinction between cluster 8 and the rest of the points, but it is clearer if more variables are used.\n\n::: {.content-visible when-format=\"html\"}\n\n::: {#fig-ft-cl2 layout-ncol=2}\n\n![Tour of most important variables for class 1.](gifs/ft_cl8.gif){#fig-ft-cl8 fig-alt=\"FIX ME\" width=300}\n\n\n::: {.cell}\n::: {.cell-output-display}\n![PC1 and PC5 together mostly reveal cluster 8.](15-forests_files/figure-html/fig-ft-cl8-pc-1.png){#fig-ft-cl8-pc width=100%}\n:::\n:::\n\n\nFocusing on class 8 in the `fake_trees` data, relative to nearby clusters 1 and 6. The most important variables for cluster 8 are PC1, PC2, PC5, but to explore in association with clusters 1 and 6, we include PC4 and PC6. A combination of PC1 and PC5 reveals the difference between cluster 8, 6, 1 and 0.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\n::: {#fig-ft-cl2 layout-ncol=2}\n\n![](images/ft_cl8.png){fig-alt=\"FIX ME\"}\n\n![](images/fig-ft-cl8-pc-1.png){fig-alt=\"FIX ME\"}\n\nFocusing on class 8 in the `fake_trees` data using a tour (left) reveals that it shares an end of cluster 0 with clusters 1 and 6. A combination of PC1 and PC5 reveals that there is a difference between the observations in class 8 relative to 6, 1 and 0 is largely due to PC5 (right).\n:::\n:::\n\n::: {.content-visible when-format=\"html\"}\n::: insight\nAlthough the confusion matrix suggests that class clusters are separated except for class 0, focusing on a few classes and using the variable importance to examine smaller subspaces, reveals they are connected in groups of three to class 0.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\insightbox{Although the confusion matrix suggests that class clusters are separated except for class 0, focusing on a few classes and using the variable importance to examine smaller subspaces, reveals they are connected in groups of three to class 0.}\n:::\n\n## Exercises {-}\n\n1. Using a grand tour compare the boundaries from the random forest model on the `penguins` data to that of (a) a default tree model, (b) an LDA model. Is it less boxy than the tree model, but still more boxy than that of the LDA model?\n2. Tinker with the parameters of the tree model to force it to fit a tree more closely to the data. Compare the boundaries from this with the default tree, and with the forest model. Is it less boxy than the default tree, but more boxy than the forest model?\n3. Fit a random forest model to the `bushfires` data using the `cause` variable as the class. It is a highly imbalanced classification problem. What is the out-of-bag error rate for the forest? Are there some classes that have lower error rate than others? Examine the 4D votes matrix with a tour, and describe the confusion between classes. This is interesting because it is difficult to accurately classify the fire ignition cause, and only some groups are often confused with each other. You should be able to see this from the 3D votes matrix. \n4. Fit a forest model to the first 21 PCs of the `sketches` data. Explore the 5D votes matrix. Why does it look star-shaped?\n5. Choose a cluster (or group of clusters) from the `fake_trees` data (2, 3, 4, 5, 7, 9) to explore in detail like done in @sec-forest-var-imp. Be sure to choose which PCs are the most useful using a tour, and follow-up by making a scatterplot showing the best distinction between your chosen cluster and the other observations. \n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n:::\n", + "supporting": [ + "15-forests_files/figure-html" + ], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": {}, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/15-forests/figure-html/fig-ft-cl1-pc-1.png b/_freeze/15-forests/figure-html/fig-ft-cl1-pc-1.png new file mode 100644 index 0000000..c97a654 Binary files /dev/null and b/_freeze/15-forests/figure-html/fig-ft-cl1-pc-1.png differ diff --git a/_freeze/15-forests/figure-html/fig-ft-cl8-pc-1.png b/_freeze/15-forests/figure-html/fig-ft-cl8-pc-1.png new file mode 100644 index 0000000..64919ce Binary files /dev/null and b/_freeze/15-forests/figure-html/fig-ft-cl8-pc-1.png differ diff --git a/_freeze/15-forests/figure-html/fig-p-bl-bd-tree-1.png b/_freeze/15-forests/figure-html/fig-p-bl-bd-tree-1.png new file mode 100644 index 0000000..100fa84 Binary files /dev/null and b/_freeze/15-forests/figure-html/fig-p-bl-bd-tree-1.png differ diff --git a/_freeze/15-forests/figure-html/fig-p-votes-ggplot-html-1.png b/_freeze/15-forests/figure-html/fig-p-votes-ggplot-html-1.png new file mode 100644 index 0000000..aebc779 Binary files /dev/null and b/_freeze/15-forests/figure-html/fig-p-votes-ggplot-html-1.png differ diff --git a/_freeze/16-svm/execute-results/html.json b/_freeze/16-svm/execute-results/html.json new file mode 100644 index 0000000..3aae964 --- /dev/null +++ b/_freeze/16-svm/execute-results/html.json @@ -0,0 +1,17 @@ +{ + "hash": "38cd7644c384e474c417f79971ba1345", + "result": { + "engine": "knitr", + "markdown": "# Support vector machines\n\\index{classification!support vector machines (SVM)}\n\nA support vector machine (SVM) [@Va99] looks for gaps between clusters in the data, based on the extreme observations in each class. In this sense it mirrors the graphical approach described @sec-clust-graphics, in which we searched for gaps between groups. It can be viewed as similar to LDA, in that the boundary between classes is a hyperplane. The difference between LDA and SVM is the placement of the boundary. LDA uses the means and covariance matrices of the classes to place the boundary, but SVM uses extreme observations.\n\n::: {.content-visible when-format=\"html\"}\n::: info\nThe key elements of the SVM model to extract are:\n\n- support vectors\n- separating hyperplane.\n\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\infobox{\nThe key elements of the SVM model to extract are:\n\\begin{itemize} \\itemsep 0in\n\\item support vectors\n\\item separating hyperplane.\n\\end{itemize}\n}\n:::\n\nSVM is widely used for it's ability to fit non-linear classification models in a simple fashion using kernels in the boundary equation. We are focusing on linear methods here because it makes for a useful comparison with how the models differ from those provided by SVM. SVM tends to place the boundary between groups in a gap, if it exists. This is nice from a visual perspective because when we look at differences between classes using a tour, we naturally focus on the gaps. SVM better fits this perception than LDA. \n\nNon-linear SVM models are interesting to examine also. Mostly one would examine the boundaries between classes which can be done in the same way that is documented in the @sec-lda and @sec-trees-forests.\n\n## Components of the SVM model\n\nTo illustrate the approach, we use two simple simulated data examples. Both have only two variables, and two classes. Explaining SVM is easier when there are just two groups. In the first data set the two classes have different covariances matrices, which will cause trouble for LDA, but SVM should see the gap between the two clusters and place the separating hyperplane in the middle of the gap. In the second data set the two groups are concentric circles, with the inner one solid. A non-linear SVM should be fitted to this data, which should see circular gap between the two classes. \n\nNote that the `svm` function in the `e1071` package will automatically scale observations into the range $[0,1]$. To make it easier to examine the fitted model, it is best to scale your data first, and then fit the model.\n\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to simulate data examples\"}\n# Toy examples\nlibrary(mulgar)\nlibrary(ggplot2)\nlibrary(geozoo)\nlibrary(tourr)\n\nset.seed(1071)\nn1 <- 162\nvc1 <- matrix(c(1, -0.7, -0.7, 1), ncol=2, byrow=TRUE)\nc1 <- rmvn(n=n1, p=2, mn=c(-2, -2), vc=vc1)\nvc2 <- matrix(c(1, -0.4, -0.4, 1)*2, ncol=2, byrow=TRUE)\nn2 <- 138\nc2 <- rmvn(n=n2, p=2, mn=c(2, 2), vc=vc2)\ndf1 <- data.frame(x1=mulgar:::scale2(c(c1[,1], c2[,1])), \n x2=mulgar:::scale2(c(c1[,2], c2[,2])), \n cl = factor(c(rep(\"A\", n1), \n rep(\"B\", n2))))\nc1 <- sphere.hollow(p=2, n=n1)$points*3 + \n c(rnorm(n1, sd=0.3), rnorm(n1, sd=0.3))\nc2 <- sphere.solid.random(p=2, n=n2)$points\ndf2 <- data.frame(x1=mulgar:::scale2(c(c1[,1], c2[,1])), \n x2=mulgar:::scale2(c(c1[,2], c2[,2])), \n cl = factor(c(rep(\"A\", n1), \n rep(\"B\", n2))))\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nlibrary(classifly)\nlibrary(e1071)\ndf1_svm <- svm(cl~., data=df1, \n probability=TRUE, \n kernel=\"linear\", \n scale=FALSE)\ndf1_svm_e <- explore(df1_svm, df1)\n\ndf2_svm <- svm(cl~., data=df2, \n probability=TRUE, \n kernel=\"radial\")\ndf2_svm_e <- explore(df2_svm, df2)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to make plots\"}\nlibrary(patchwork)\nlibrary(colorspace)\ns1 <- ggplot() + \n geom_point(data=df1, aes(x=x1, y=x2, colour=cl),\n shape=20) +\n scale_colour_discrete_divergingx(palette=\"Zissou 1\") +\n geom_point(data=df1_svm_e[(!df1_svm_e$.BOUNDARY)&(df1_svm_e$.TYPE==\"simulated\"),], \n aes(x=x1, y=x2, colour=cl), shape=3) +\n geom_point(data=df1[df1_svm$index,], \n aes(x=x1, y=x2, colour=cl), \n shape=4, size=4) +\n theme_minimal() +\n theme(aspect.ratio=1, legend.position = \"none\") +\n ggtitle(\"(a)\")\n\ns2 <- ggplot() + \n geom_point(data=df2, aes(x=x1, y=x2, colour=cl), shape=20) +\n scale_colour_discrete_divergingx(palette=\"Zissou 1\") +\n geom_point(data=df2_svm_e[(!df2_svm_e$.BOUNDARY)&(df2_svm_e$.TYPE==\"simulated\"),], \n aes(x=x1, y=x2, colour=cl), \n shape=3) +\n geom_point(data=df2[df2_svm$index,], \n aes(x=x1, y=x2, colour=cl), \n shape=4, size=4) +\n theme_minimal() +\n theme(aspect.ratio=1, legend.position = \"none\") +\n ggtitle(\"(b)\")\n\ns1+s2\n```\n\n::: {.cell-output-display}\n![SVM classifier fit overlaid on two simulated data examples: (a) groups with different variance-covariance, fitted using a linear kernel, (b) groups with non-linear separation, fitted using a radial kernel. The band of points shown as '+' mark the SVM boundary, and points marked by 'x' are the support vectors used to define the boundary. ](16-svm_files/figure-html/fig-svm-toy-1.png){#fig-svm-toy width=100%}\n:::\n:::\n\n\n@fig-svm-toy shows the two data sets and the important aspects of the fitted SVM model for each. The observations are represented by dots, the separating hyperplane (just a line for 2D) is represented by '+'. Where the two colours merge is the actual location of the boundary between classes. It can be seen that this is located right down the middle of the gap, for both data sets. Even though the boundary is circular for the second data set, in a transformed high-dimensional space it would be linear.\n\nSVMs use a subset of the observations to define the boundary, and these are called the support vectors. For each of the data sets these are marked with 'x'. For the linear boundary, there are nine support vectors, five in one group and four in the other. There is one interesting observation in the red group, which falls on the other side of the boundary. It is marked as a support vector, but its contribution to the fitted hyperplane is limited by a control parameter in the model fitting process. \n\nLinear SVMs can be assessed similarly to regression models. The components of the model are:\n\n1. The points that are the support vectors:\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\ndf1_svm$index\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 15 45 123 135 155 180 202 239 292\n```\n\n\n:::\n:::\n\n\n2. Their coefficients:\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\ndf1_svm$coefs\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [,1]\n [1,] 0.3771240\n [2,] 0.1487726\n [3,] 1.0000000\n [4,] 1.0000000\n [5,] 1.0000000\n [6,] -0.5258966\n [7,] -1.0000000\n [8,] -1.0000000\n [9,] -1.0000000\n```\n\n\n:::\n:::\n\n\nwhich indicate that all but 15, 45 and 180 are actually bounded support vectors (their coefficients are bounded to magnitude 1). \n\n3. that when used with the intercept:\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\ndf1_svm$rho\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0.3520001\n```\n\n\n:::\n:::\n\n\ncan be used to compute the equation of the fitted hyperplane. \n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nw = t(df1_svm$SV) %*% df1_svm$coefs\nw\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [,1]\nx1 -1.501086\nx2 -1.356237\n```\n\n\n:::\n:::\n\n\nGiving the equation to be -1.5 $x_1 +$ -1.36 $x_2 +$ -0.35 $=0$, or alternatively, $x_2 =$ -1.11 $x_1 +$ -0.26.\n\nwhich can be used to generate a line to show the boundary with the data. \n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\ns1 + geom_abline(intercept=df1_svm$rho/w[2],\n slope=-w[1]/w[2])\n```\n:::\n\n\n**Note that** care in scaling of data is important to get the intercept calculated exactly. We have standardised the data, and set the `scale=FALSE` parameter in the `svm` function. The slope calculation is quite robust to the data scaling.\n\n::: {.content-visible when-format=\"html\"}\n::: info\nLike LDA, a linear SVM model for two groups can be written using the equation of a hyperplane. The fitted model coefficients are then used to generate points on this plane, to examine the boundary between groups.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\infobox{Like LDA, a linear SVM model for two groups can be written using the equation of a hyperplane. The fitted model coefficients are then used to generate points on this plane, to examine the boundary between groups.\n}\n:::\n\n## Examining the model components in high-dimensions\n\nFor higher dimensions, the procedures are similar, with the hyperplane and support vectors being examined using a tour. Here we examine the model for differentiating male and female Chinstrap penguins. The Chinstrap penguins have a noticeable difference in size of the sexes, unlike the other two species. Working with a two-class problem is easier for explaining SVM, but multi-class calculations can also follow this approach.\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nlibrary(dplyr)\nload(\"data/penguins_sub.rda\")\nchinstrap <- penguins_sub %>%\n filter(species == \"Chinstrap\") %>%\n select(-species) %>%\n mutate_if(is.numeric, mulgar:::scale2)\nchinstrap_svm <- svm(sex~., data=chinstrap, \n kernel=\"linear\",\n probability=TRUE, \n scale=FALSE)\nchinstrap_svm_e <- explore(chinstrap_svm, chinstrap)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to make the tours\"}\n# Tour raw data\nanimate_xy(chinstrap[,1:4], col=chinstrap$sex)\n# Add all SVs, including bounded\nc_pch <- rep(20, nrow(chinstrap))\nc_pch[chinstrap_svm$index] <- 4\nanimate_xy(chinstrap[,1:4], col=chinstrap$sex, pch=c_pch)\n# Only show the SVs with |coefs| < 1\nc_pch <- rep(20, nrow(chinstrap))\nc_pch[chinstrap_svm$index[abs(chinstrap_svm$coefs)<1]] <- 4\nc_cex <- rep(1, nrow(chinstrap))\nc_cex[chinstrap_svm$index[abs(chinstrap_svm$coefs)<1]] <- 2\nanimate_xy(chinstrap[,1:4], col=chinstrap$sex, \n pch=c_pch, cex=c_cex)\nrender_gif(chinstrap[,1:4],\n grand_tour(),\n display_xy(col=chinstrap$sex, pch=c_pch, cex=c_cex),\n gif_file=\"gifs/chinstrap_svs.gif\",\n width=400,\n height=400,\n frames=500)\n\n# Tour the separating hyperplane also\nsymbols <- c(3, 20)\nc_pch <- symbols[as.numeric(chinstrap_svm_e$.TYPE[!chinstrap_svm_e$.BOUNDARY])]\nanimate_xy(chinstrap_svm_e[!chinstrap_svm_e$.BOUNDARY,1:4], \n col=chinstrap_svm_e$sex[!chinstrap_svm_e$.BOUNDARY],\n pch=c_pch)\nrender_gif(chinstrap_svm_e[!chinstrap_svm_e$.BOUNDARY,1:4],\n grand_tour(),\n display_xy(col=chinstrap_svm_e$sex[!chinstrap_svm_e$.BOUNDARY], pch=c_pch),\n gif_file=\"gifs/chinstrap_svm.gif\",\n width=400,\n height=400,\n frames=500)\n```\n:::\n\n\n::: {.content-visible when-format=\"html\"}\n::: {#fig-p-svm-html layout-ncol=2}\n\n![Support vectors](gifs/chinstrap_svs.gif){#fig-chinstrap-svs fig-alt=\"FIX ME\" width=300}\n\n![SVM boundary](gifs/chinstrap_svm.gif){#fig-chinstrap-svm fig-alt=\"FIX ME\" width=300}\n\nSVM model for distinguishing the sexes of the Chinstrap penguins. The separating hyperplane is 3D, and separates primarily on variables `bl` and `bd`, as seen because these two axes extend out from the plane when it is seen on its side, separating the two groups.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n::: {#fig-p-svm-pdf layout-ncol=2}\n\n![Support vectors](images/chinstrap_svs.png){#fig-chinstrap-svs fig-alt=\"FIX ME\" width=200}\n\n![SVM boundary](images/chinstrap_svm.png){#fig-chinstrap-svm fig-alt=\"FIX ME\" width=200}\n\n\nSVM model for distinguishing the sexes of the Chinstrap penguins. The separating hyperplane is 3D, and separates primarily on variables `bl` and `bd`, as seen because these two axes extend out from the plane when it is seen on its side, separating the two groups.\n:::\n:::\n\\index{classification!separating hyperplane}\n\\index{classification!support vectors}\n\n::: {.content-visible when-format=\"html\"}\n::: info\nMark the support vectors by point shape, and examine where these are relative to the difference between groups. \n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\infobox{Mark the support vectors by point shape, and examine where these are relative to the difference between groups.\n}\n:::\n\n\nExamining the hyperplane in a grand tour display (@fig-p-svm-html) indicates that two of the variables, `bl` and `bd`, are important for separating the two classes. We can check this interpretation using the radial tour. Using the components from the model, the coefficients of the hyperplane are: \n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nt(chinstrap_svm$SV) %*% chinstrap_svm$coefs\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [,1]\nbl -0.9102439\nbd -1.1073475\nfl -0.5223364\nbm -0.2846370\n```\n\n\n:::\n:::\n\n\nThe coefficients for `bl` and `bd` are the largest (in magnitude) which supports the the interpretation that they are most important. This vector can be used to set the starting point for radial tour, once it is normalised. Any orthonormal vector serves to turn this into a 2D projection, to visualise the boundary. \n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nset.seed(1022)\nprj1 <- mulgar::norm_vec(t(chinstrap_svm$SV) %*%\n chinstrap_svm$coefs)\nprj2 <- basis_random(4, 1)\nprj <- orthonormalise(cbind(prj1, prj2))\nprj\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [,1] [,2]\nbl -0.5865081 -0.06412875\nbd -0.7135101 0.51192498\nfl -0.3365631 -0.77713899\nbm -0.1834035 -0.36038216\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to conduct the radial tours\"}\nanimate_xy(chinstrap_svm_e[!chinstrap_svm_e$.BOUNDARY,1:4], \n tour_path = radial_tour(start=prj, mvar = 2),\n col=chinstrap_svm_e$sex[!chinstrap_svm_e$.BOUNDARY],\n pch=c_pch)\nrender_gif(chinstrap_svm_e[!chinstrap_svm_e$.BOUNDARY,1:4],\n radial_tour(start=prj, mvar = 2),\n display_xy(col=chinstrap_svm_e$sex[!chinstrap_svm_e$.BOUNDARY], pch=c_pch),\n gif_file=\"gifs/chinstrap_rad_bd.gif\",\n apf = 1/30,\n width=400,\n height=400,\n frames=500)\nrender_gif(chinstrap_svm_e[!chinstrap_svm_e$.BOUNDARY,1:4],\n radial_tour(start=prj, mvar = 1),\n display_xy(col=chinstrap_svm_e$sex[!chinstrap_svm_e$.BOUNDARY], pch=c_pch),\n gif_file=\"gifs/chinstrap_rad_bl.gif\",\n apf = 1/30,\n width=400,\n height=400,\n frames=500)\nrender_gif(chinstrap_svm_e[!chinstrap_svm_e$.BOUNDARY,1:4],\n radial_tour(start=prj, mvar = 3),\n display_xy(col=chinstrap_svm_e$sex[!chinstrap_svm_e$.BOUNDARY], pch=c_pch),\n gif_file=\"gifs/chinstrap_rad_fl.gif\",\n apf = 1/30,\n width=400,\n height=400,\n frames=500)\nrender_gif(chinstrap_svm_e[!chinstrap_svm_e$.BOUNDARY,1:4],\n radial_tour(start=prj, mvar = 4),\n display_xy(col=chinstrap_svm_e$sex[!chinstrap_svm_e$.BOUNDARY], pch=c_pch),\n gif_file=\"gifs/chinstrap_rad_bm.gif\",\n apf = 1/30,\n width=400,\n height=400,\n frames=500)\n```\n:::\n\n\nThis projection is show in @fig-chinstrap-radial-html. You can see the boundary between the two sexes as a clear line, marked by a sample of points on either side. We use the radial tour to remove each of the variables from the projection using the radial tour to examine it's importance on the model, and hence the boundary. If the clear view of the boundary gets jumbled when a variable is removed we infer that this variable is very important for the model (as seen for `bl` and `bd`). If there is little change in the clarity when a variable is removed, then it is less important (as seen for `fl` and `bm`). \n\\index{tour!radial}\n\n::: {.content-visible when-format=\"html\"}\n::: {#fig-chinstrap-radial-html layout-ncol=2}\n\n![bl](gifs/chinstrap_rad_bl.gif){#fig-chinstrap-radial-bl}\n\n![bd](gifs/chinstrap_rad_bd.gif){#fig-chinstrap-radial-bd}\n\n![fl](gifs/chinstrap_rad_fl.gif){#fig-chinstrap-radial-fl}\n\n![bm](gifs/chinstrap_rad_bm.gif){#fig-chinstrap-radial-bm}\n\nExploring the importance of the four variables to the separating hyperplane using a radial tour where the contribution of each variable is reduced to 0, and then increased to it's original value. You can see that `bl` and `bd` contribute most to the plane, because when they are removed the plane is no longer on it side marking the boundary. Variables `fl` and `bm` contribute a small amount to the separating hyperplane, but it is possible that these two could be removed without affecting the strength of the separation between the sexes. \n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n::: {#fig-chinstrap-radial-pdf layout-ncol=2}\n\n![bl in](images/chinstrap_rad_bl1.png){#fig-chinstrap-radial-bl1 width=200}\n\n![bl reduced](images/chinstrap_rad_bl2.png){#fig-chinstrap-radial-bl2 width=200}\n\n![bd reduced](images/chinstrap_rad_bd.png){#fig-chinstrap-radial-bd width=200}\n\n![bm out](images/chinstrap_rad_bm.png){#fig-chinstrap-radial-bm width=200}\n\n\nExploring the importance of the four variables to the separating hyperplane using a radial tour to reduce the contribution of each variable to 0, and then back to it's original value: (a) separating hyperplane visible, (b) `bl` contribution reduced, (c) `bd` contribution decreased, (d) `bm` contribution removed. You can see that `bl` and `bd` contribute most to the plane, because when they are removed the plane is no longer on it side marking the boundary. Variables `fl` (not shown) and `bm` contribute a small amount to the separating hyperplane, but it is possible that these two could be removed with only a small effect on the strength of the separation between the sexes. \n:::\n:::\n\n\n::: {.content-visible when-format=\"html\"}\n::: info\nUse a radial tour to zero out coefficients defining the separating hyperplane to explore the variable importance. \n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\infobox{Use a radial tour to zero out coefficients defining the separating hyperplane to explore the variable importance. \n}\n:::\n\nIn this example, we can see that clarity of the boundary changes substantially when either `bl` and `bd` are removed. There is a small change when `fl` and `bm` are removed, so they are less important. This interpretation matches the interpretation that would be made from the magnitude of the coefficients of the hyperplane (printed earlier). They reinforce each other. It is possible that the interpretation of the coefficients could differ after using the radial tour, most likely in terms of simplifying the vector, supporting the forcing some coefficients to zero. \n\n::: {.content-visible when-format=\"html\"}\n::: insight\nWhen we use the radial tour to examine how the different variables contribute to the separating hyperplane between the sexes, we learn that `bl` and `bd` are the most important variables. We could (almost) ignore `fl` and `bm` for this classification.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\insightbox{When we use the radial tour to examine how the different variables contribute to the separating hyperplane between the sexes, we learn that {\\textsf bl} and {\\textsf bd} are the most important variables. We could (almost) ignore {\\textsf fl} and {\\textsf bm} for this classification.}\n:::\n\n\n\n\n## Exercises {-}\n\n1. Generate a small subset from the `bushfire` data: we keep the variables `log_dist_cfa`, `log_dist_road` and `cause`, and we select only observations where `cause` is either lightning or arson. Fit a linear SVM model to separate the two classes and show the decision boundary together with the data. Compare to the boundary obtained by LDA and argue how the two models place the separating hyperplane in different ways.\n2. We extend this into a multivariate setting by also including `amaxt180` and `amaxt720` as predictors. Fit a linear SVM model and calculate the hyperplane to judge which of these variables are important.\n3. Calculate the decision boundary and look at it with a radial tour to see the effect from including individual predictors in a projection. Also explore what happens when rotating out multiple variables together. What can you learn from this?\n4. From the `sketches_train` data select all observations of class banana or boomerang For this subset use PCA to find the first 5 PCs. Fit two SVM models: once with linear kernel and once with radial kernel and default value for the gamma parameter. Compare the number of missclassified observations in the training data for the two models.\n5. Compute the model predictions and compare the decision boundaries between the linear and radial SVM using a slice tour. Does the shape match what you expect given the respective kernel function?\n6. SVM models are defined for separating two classes, but and ensemble of such models can be used when we want to distinguish more than two classes. Look up the documentation of the `svm` function to learn how this works, then fit an SVM model to separate the three penguin species. In this case we primarily use the model predictions to investigate the decision boundaries, you can use `explore` together with the slice tour to do this. You can use different kernels and compare the resulting decision boundaries.\n\n\n\n::: {.cell}\n\n:::\n", + "supporting": [ + "16-svm_files/figure-html" + ], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": {}, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/16-svm/execute-results/tex.json b/_freeze/16-svm/execute-results/tex.json index 29d901f..26e77f6 100644 --- a/_freeze/16-svm/execute-results/tex.json +++ b/_freeze/16-svm/execute-results/tex.json @@ -1,8 +1,8 @@ { - "hash": "694e27722cb6cf807ff5d2fe760afee8", + "hash": "56fc1abba90993f9d58ee5ecaa549b07", "result": { "engine": "knitr", - "markdown": "# Support vector machines\n\\index{classification!support vector machines (SVM)}\n\nA support vector machine (SVM) [@Va99] looks for gaps between clusters in the data, based on the extreme observations in each class. In this sense it mirrors the graphical approach described @sec-clust-graphics, in which we searched for gaps between groups. It can be viewed as similar to LDA, in that the boundary between classes is a hyperplane. The difference between LDA and SVM is the placement of the boundary. LDA uses the means and covariance matrices of the classes to place the boundary, but SVM uses extreme observations.\n\n::: {.content-visible when-format=\"html\"}\n::: info\nThe key elements of the SVM model to extract are:\n\n- support vectors\n- separating hyperplane.\n\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\infobox{\nThe key elements of the SVM model to extract are:\n\\begin{itemize} \\itemsep 0in\n\\item support vectors\n\\item separating hyperplane.\n\\end{itemize}\n}\n:::\n\nSVM is widely used for it's ability to fit non-linear classification models in a simple fashion using kernels in the boundary equation. We are focusing on linear methods here because it makes for a useful comparison with how the models differ from those provided by SVM. SVM tends to place the boundary between groups in a gap, if it exists. This is nice from a visual perspective because when we look at differences between classes using a tour, we naturally focus on the gaps. SVM better fits this perception than LDA. \n\nNon-linear SVM models are interesting to examine also. Mostly one would examine the boundaries between classes which can be done in the same way that is documented in the @sec-lda and @sec-trees-forests.\n\n## Components of the SVM model\n\nTo illustrate the approach, we use two simple simulated data examples. Both have only two variables, and two classes. Explaining SVM is easier when there are just two groups. In the first data set the two classes have different covariances matrices, which will cause trouble for LDA, but SVM should see the gap between the two clusters and place the separating hyperplane in the middle of the gap. In the second data set the two groups are concentric circles, with the inner one solid. A non-linear SVM should be fitted to this data, which should see circular gap between the two classes. \n\nNote that the `svm` function in the `e1071` package will automatically scale observations into the range $[0,1]$. To make it easier to examine the fitted model, it is best to scale your data first, and then fit the model.\n\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nlibrary(classifly)\nlibrary(e1071)\ndf1_svm <- svm(cl~., data=df1, \n probability=TRUE, \n kernel=\"linear\", \n scale=FALSE)\ndf1_svm_e <- explore(df1_svm, df1)\n\ndf2_svm <- svm(cl~., data=df2, \n probability=TRUE, \n kernel=\"radial\")\ndf2_svm_e <- explore(df2_svm, df2)\n```\n:::\n\n::: {.cell}\n::: {.cell-output-display}\n![SVM classifier fit overlaid on two simulated data examples: (a) groups with different variance-covariance, fitted using a linear kernel, (b) groups with non-linear separation, fitted using a radial kernel. The band of points shown as '+' mark the SVM boundary, and points marked by 'x' are the support vectors used to define the boundary. ](16-svm_files/figure-pdf/fig-svm-toy-1.pdf){#fig-svm-toy width=100%}\n:::\n:::\n\n\n\n@fig-svm-toy shows the two data sets and the important aspects of the fitted SVM model for each. The observations are represented by dots, the separating hyperplane (just a line for 2D) is represented by '+'. Where the two colours merge is the actual location of the boundary between classes. It can be seen that this is located right down the middle of the gap, for both data sets. Even though the boundary is circular for the second data set, in a transformed high-dimensional space it would be linear.\n\nSVMs use a subset of the observations to define the boundary, and these are called the support vectors. For each of the data sets these are marked with 'x'. For the linear boundary, there are nine support vectors, five in one group and four in the other. There is one interesting observation in the red group, which falls on the other side of the boundary. It is marked as a support vector, but its contribution to the fitted hyperplane is limited by a control parameter in the model fitting process. \n\nLinear SVMs can be assessed similarly to regression models. The components of the model are:\n\n1. The points that are the support vectors:\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\ndf1_svm$index\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 15 45 123 135 155 180 202 239 292\n```\n\n\n:::\n:::\n\n\n\n2. Their coefficients:\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\ndf1_svm$coefs\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [,1]\n [1,] 0.3771240\n [2,] 0.1487726\n [3,] 1.0000000\n [4,] 1.0000000\n [5,] 1.0000000\n [6,] -0.5258966\n [7,] -1.0000000\n [8,] -1.0000000\n [9,] -1.0000000\n```\n\n\n:::\n:::\n\n\n\nwhich indicate that all but 15, 45 and 180 are actually bounded support vectors (their coefficients are bounded to magnitude 1). \n\n3. that when used with the intercept:\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\ndf1_svm$rho\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0.3520001\n```\n\n\n:::\n:::\n\n\n\ncan be used to compute the equation of the fitted hyperplane. \n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nw = t(df1_svm$SV) %*% df1_svm$coefs\nw\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [,1]\nx1 -1.501086\nx2 -1.356237\n```\n\n\n:::\n:::\n\n\n\nGiving the equation to be -1.5 $x_1 +$ -1.36 $x_2 +$ -0.35 $=0$, or alternatively, $x_2 =$ -1.11 $x_1 +$ -0.26.\n\nwhich can be used to generate a line to show the boundary with the data. \n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\ns1 + geom_abline(intercept=df1_svm$rho/w[2],\n slope=-w[1]/w[2])\n```\n:::\n\n\n\n**Note that** care in scaling of data is important to get the intercept calculated exactly. We have standardised the data, and set the `scale=FALSE` parameter in the `svm` function. The slope calculation is quite robust to the data scaling.\n\n::: {.content-visible when-format=\"html\"}\n::: info\nLike LDA, a linear SVM model for two groups can be written using the equation of a hyperplane. The fitted model coefficients are then used to generate points on this plane, to examine the boundary between groups.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\infobox{Like LDA, a linear SVM model for two groups can be written using the equation of a hyperplane. The fitted model coefficients are then used to generate points on this plane, to examine the boundary between groups.\n}\n:::\n\n## Examining the model components in high-dimensions\n\nFor higher dimensions, the procedures are similar, with the hyperplane and support vectors being examined using a tour. Here we examine the model for differentiating male and female Chinstrap penguins. The Chinstrap penguins have a noticeable difference in size of the sexes, unlike the other two species. Working with a two-class problem is easier for explaining SVM, but multi-class calculations can also follow this approach.\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nlibrary(dplyr)\nload(\"data/penguins_sub.rda\")\nchinstrap <- penguins_sub %>%\n filter(species == \"Chinstrap\") %>%\n select(-species) %>%\n mutate_if(is.numeric, mulgar:::scale2)\nchinstrap_svm <- svm(sex~., data=chinstrap, \n kernel=\"linear\",\n probability=TRUE, \n scale=FALSE)\nchinstrap_svm_e <- explore(chinstrap_svm, chinstrap)\n```\n:::\n\n::: {.cell}\n\n:::\n\n\n\n::: {.content-visible when-format=\"html\"}\n::: {#fig-p-svm-html layout-ncol=2}\n\n![Support vectors](gifs/chinstrap_svs.gif){#fig-chinstrap-svs fig-alt=\"FIX ME\" width=300}\n\n![SVM boundary](gifs/chinstrap_svm.gif){#fig-chinstrap-svm fig-alt=\"FIX ME\" width=300}\n\nSVM model for distinguishing the sexes of the Chinstrap penguins. The separating hyperplane is 3D, and separates primarily on variables `bl` and `bd`, as seen because these two axes extend out from the plane when it is seen on its side, separating the two groups.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n::: {#fig-p-svm-pdf layout-ncol=2}\n\n![Support vectors](images/chinstrap_svs.png){#fig-chinstrap-svs fig-alt=\"FIX ME\" width=200}\n\n![SVM boundary](images/chinstrap_svm.png){#fig-chinstrap-svm fig-alt=\"FIX ME\" width=200}\n\n\nSVM model for distinguishing the sexes of the Chinstrap penguins. The separating hyperplane is 3D, and separates primarily on variables `bl` and `bd`, as seen because these two axes extend out from the plane when it is seen on its side, separating the two groups.\n:::\n:::\n\\index{classification!separating hyperplane}\n\\index{classification!support vectors}\n\n::: {.content-visible when-format=\"html\"}\n::: info\nMark the support vectors by point shape, and examine where these are relative to the difference between groups. \n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\infobox{Mark the support vectors by point shape, and examine where these are relative to the difference between groups.\n}\n:::\n\n\nExamining the hyperplane in a grand tour display (@fig-p-svm-pdf) indicates that two of the variables, `bl` and `bd`, are important for separating the two classes. We can check this interpretation using the radial tour. Using the components from the model, the coefficients of the hyperplane are: \n\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nt(chinstrap_svm$SV) %*% chinstrap_svm$coefs\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [,1]\nbl -0.9102439\nbd -1.1073475\nfl -0.5223364\nbm -0.2846370\n```\n\n\n:::\n:::\n\n\n\nThe coefficients for `bl` and `bd` are the largest (in magnitude) which supports the the interpretation that they are most important. This vector can be used to set the starting point for radial tour, once it is normalised. Any orthonormal vector serves to turn this into a 2D projection, to visualise the boundary. \n\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nset.seed(1022)\nprj1 <- mulgar::norm_vec(t(chinstrap_svm$SV) %*%\n chinstrap_svm$coefs)\nprj2 <- basis_random(4, 1)\nprj <- orthonormalise(cbind(prj1, prj2))\nprj\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [,1] [,2]\nbl -0.5865081 -0.06412875\nbd -0.7135101 0.51192498\nfl -0.3365631 -0.77713899\nbm -0.1834035 -0.36038216\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n:::\n\n\n\nThis projection is show in @fig-chinstrap-radial-pdf. You can see the boundary between the two sexes as a clear line, marked by a sample of points on either side. We use the radial tour to remove each of the variables from the projection using the radial tour to examine it's importance on the model, and hence the boundary. If the clear view of the boundary gets jumbled when a variable is removed we infer that this variable is very important for the model (as seen for `bl` and `bd`). If there is little change in the clarity when a variable is removed, then it is less important (as seen for `fl` and `bm`). \n\\index{tour!radial}\n\n::: {.content-visible when-format=\"html\"}\n::: {#fig-chinstrap-radial-html layout-ncol=2}\n\n![bl](gifs/chinstrap_rad_bl.gif){#fig-chinstrap-radial-bl}\n\n![bd](gifs/chinstrap_rad_bd.gif){#fig-chinstrap-radial-bd}\n\n![fl](gifs/chinstrap_rad_fl.gif){#fig-chinstrap-radial-fl}\n\n![bm](gifs/chinstrap_rad_bm.gif){#fig-chinstrap-radial-bm}\n\nExploring the importance of the four variables to the separating hyperplane using a radial tour where the contribution of each variable is reduced to 0, and then increased to it's original value. You can see that `bl` and `bd` contribute most to the plane, because when they are removed the plane is no longer on it side marking the boundary. Variables `fl` and `bm` contribute a small amount to the separating hyperplane, but it is possible that these two could be removed without affecting the strength of the separation between the sexes. \n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n::: {#fig-chinstrap-radial-pdf layout-ncol=2}\n\n![bl in](images/chinstrap_rad_bl1.png){#fig-chinstrap-radial-bl1 width=200}\n\n![bl reduced](images/chinstrap_rad_bl2.png){#fig-chinstrap-radial-bl2 width=200}\n\n![bd reduced](images/chinstrap_rad_bd.png){#fig-chinstrap-radial-bd width=200}\n\n![bm out](images/chinstrap_rad_bm.png){#fig-chinstrap-radial-bm width=200}\n\n\nExploring the importance of the four variables to the separating hyperplane using a radial tour to reduce the contribution of each variable to 0, and then back to it's original value: (a) separating hyperplane visible, (b) `bl` contribution reduced, (c) `bd` contribution decreased, (d) `bm` contribution removed. You can see that `bl` and `bd` contribute most to the plane, because when they are removed the plane is no longer on it side marking the boundary. Variables `fl` (not shown) and `bm` contribute a small amount to the separating hyperplane, but it is possible that these two could be removed with only a small effect on the strength of the separation between the sexes. \n:::\n:::\n\n\n::: {.content-visible when-format=\"html\"}\n::: info\nUse a radial tour to zero out coefficients defining the separating hyperplane to explore the variable importance. \n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\infobox{Use a radial tour to zero out coefficients defining the separating hyperplane to explore the variable importance. \n}\n:::\n\nIn this example, we can see that clarity of the boundary changes substantially when either `bl` and `bd` are removed. There is a small change when `fl` and `bm` are removed, so they are less important. This interpretation matches the interpretation that would be made from the magnitude of the coefficients of the hyperplane (printed earlier). They reinforce each other. It is possible that the interpretation of the coefficients could differ after using the radial tour, most likely in terms of simplifying the vector, supporting the forcing some coefficients to zero. \n\n::: {.content-visible when-format=\"html\"}\n::: insight\nWhen we use the radial tour to examine how the different variables contribute to the separating hyperplane between the sexes, we learn that `bl` and `bd` are the most important variables. We could (almost) ignore `fl` and `bm` for this classification.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\insightbox{When we use the radial tour to examine how the different variables contribute to the separating hyperplane between the sexes, we learn that {\\textsf bl} and {\\textsf bd} are the most important variables. We could (almost) ignore {\\textsf fl} and {\\textsf bm} for this classification.}\n:::\n\n\n\n\n## Exercises {-}\n\n1. Generate a small subset from the `bushfire` data: we keep the variables `log_dist_cfa`, `log_dist_road` and `cause`, and we select only observations where `cause` is either lightning or arson. Fit a linear SVM model to separate the two classes and show the decision boundary together with the data. Compare to the boundary obtained by LDA and argue how the two models place the separating hyperplane in different ways.\n2. We extend this into a multivariate setting by also including `amaxt180` and `amaxt720` as predictors. Fit a linear SVM model and calculate the hyperplane to judge which of these variables are important.\n3. Calculate the decision boundary and look at it with a radial tour to see the effect from including individual predictors in a projection. Also explore what happens when rotating out multiple variables together. What can you learn from this?\n4. From the `sketches_train` data select all observations of class banana or boomerang For this subset use PCA to find the first 5 PCs. Fit two SVM models: once with linear kernel and once with radial kernel and default value for the gamma parameter. Compare the number of misclassified observations in the training data for the two models.\n5. Compute the model predictions and compare the decision boundaries between the linear and radial SVM using a slice tour. Does the shape match what you expect given the respective kernel function?\n6. SVM models are defined for separating two classes, but and ensemble of such models can be used when we want to distinguish more than two classes. Look up the documentation of the `svm` function to learn how this works, then fit an SVM model to separate the three penguin species. In this case we primarily use the model predictions to investigate the decision boundaries, you can use `explore` together with the slice tour to do this. You can use different kernels and compare the resulting decision boundaries.\n\n\n\n\n::: {.cell}\n\n:::\n", + "markdown": "# Support vector machines\n\\index{classification!support vector machines (SVM)}\n\nA support vector machine (SVM) [@Va99] looks for gaps between clusters in the data, based on the extreme observations in each class. In this sense it mirrors the graphical approach described @sec-clust-graphics, in which we searched for gaps between groups. It can be viewed as similar to LDA, in that the boundary between classes is a hyperplane. The difference between LDA and SVM is the placement of the boundary. LDA uses the means and covariance matrices of the classes to place the boundary, but SVM uses extreme observations.\n\n::: {.content-visible when-format=\"html\"}\n::: info\nThe key elements of the SVM model to extract are:\n\n- support vectors\n- separating hyperplane.\n\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\infobox{\nThe key elements of the SVM model to extract are:\n\\begin{itemize} \\itemsep 0in\n\\item support vectors\n\\item separating hyperplane.\n\\end{itemize}\n}\n:::\n\nSVM is widely used for it's ability to fit non-linear classification models in a simple fashion using kernels in the boundary equation. We are focusing on linear methods here because it makes for a useful comparison with how the models differ from those provided by SVM. SVM tends to place the boundary between groups in a gap, if it exists. This is nice from a visual perspective because when we look at differences between classes using a tour, we naturally focus on the gaps. SVM better fits this perception than LDA. \n\nNon-linear SVM models are interesting to examine also. Mostly one would examine the boundaries between classes which can be done in the same way that is documented in the @sec-lda and @sec-trees-forests.\n\n## Components of the SVM model\n\nTo illustrate the approach, we use two simple simulated data examples. Both have only two variables, and two classes. Explaining SVM is easier when there are just two groups. In the first data set the two classes have different covariances matrices, which will cause trouble for LDA, but SVM should see the gap between the two clusters and place the separating hyperplane in the middle of the gap. In the second data set the two groups are concentric circles, with the inner one solid. A non-linear SVM should be fitted to this data, which should see circular gap between the two classes. \n\nNote that the `svm()` function in the `e1071` package will automatically scale observations into the range $[0,1]$. To make it easier to examine the fitted model, it is best to scale your data first, and then fit the model.\n\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nlibrary(classifly)\nlibrary(e1071)\ndf1_svm <- svm(cl~., data=df1, \n probability=TRUE, \n kernel=\"linear\", \n scale=FALSE)\ndf1_svm_e <- explore(df1_svm, df1)\n\ndf2_svm <- svm(cl~., data=df2, \n probability=TRUE, \n kernel=\"radial\")\ndf2_svm_e <- explore(df2_svm, df2)\n```\n:::\n\n::: {.cell}\n::: {.cell-output-display}\n![SVM classifier fit overlaid on two simulated data examples: (a) groups with different variance-covariance, fitted using a linear kernel, (b) groups with non-linear separation, fitted using a radial kernel. The band of points shown as '+' mark the SVM boundary, and points marked by 'x' are the support vectors used to define the boundary. ](16-svm_files/figure-pdf/fig-svm-toy-1.pdf){#fig-svm-toy width=100%}\n:::\n:::\n\n\n\n@fig-svm-toy shows the two data sets and the important aspects of the fitted SVM model for each. The observations are represented by dots, the separating hyperplane (just a line for 2D) is represented by '+'. Where the two colours merge is the actual location of the boundary between classes. It can be seen that this is located right down the middle of the gap, for both data sets. Even though the boundary is circular for the second data set, in a transformed high-dimensional space it would be linear.\n\nSVMs use a subset of the observations to define the boundary, and these are called the support vectors. For each of the data sets these are marked with 'x'. For the linear boundary, there are nine support vectors, five in one group and four in the other. There is one interesting observation in the red group, which falls on the other side of the boundary. It is marked as a support vector, but its contribution to the fitted hyperplane is limited by a control parameter in the model fitting process. \n\nLinear SVMs can be assessed similarly to regression models. The components of the model are:\n\n1. The points that are the support vectors:\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\ndf1_svm$index\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 15 45 123 135 155 180 202 239 292\n```\n\n\n:::\n:::\n\n\n\n2. Their coefficients:\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\ndf1_svm$coefs\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [,1]\n [1,] 0.3771240\n [2,] 0.1487726\n [3,] 1.0000000\n [4,] 1.0000000\n [5,] 1.0000000\n [6,] -0.5258966\n [7,] -1.0000000\n [8,] -1.0000000\n [9,] -1.0000000\n```\n\n\n:::\n:::\n\n\n\nwhich indicate that all but 15, 45 and 180 are actually bounded support vectors (their coefficients are bounded to magnitude 1). \n\n3. that when used with the intercept:\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\ndf1_svm$rho\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0.3520001\n```\n\n\n:::\n:::\n\n\n\ncan be used to compute the equation of the fitted hyperplane. \n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nw = t(df1_svm$SV) %*% df1_svm$coefs\nw\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [,1]\nx1 -1.501086\nx2 -1.356237\n```\n\n\n:::\n:::\n\n\n\nGiving the equation to be -1.5 $x_1 +$ -1.36 $x_2 +$ -0.35 $=0$, or alternatively, $x_2 =$ -1.11 $x_1 +$ -0.26.\n\nwhich can be used to generate a line to show the boundary with the data. \n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\ns1 + geom_abline(intercept=df1_svm$rho/w[2],\n slope=-w[1]/w[2])\n```\n:::\n\n\n\n**Note that** care in scaling of data is important to get the intercept calculated exactly. We have standardised the data, and set the `scale=FALSE` parameter in the `svm()` function. The slope calculation is quite robust to the data scaling.\n\n::: {.content-visible when-format=\"html\"}\n::: info\nLike LDA, a linear SVM model for two groups can be written using the equation of a hyperplane. The fitted model coefficients are then used to generate points on this plane, to examine the boundary between groups.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\infobox{Like LDA, a linear SVM model for two groups can be written using the equation of a hyperplane. The fitted model coefficients are then used to generate points on this plane, to examine the boundary between groups.\n}\n:::\n\n## Examining the model components in high-dimensions\n\nFor higher dimensions, the procedures are similar, with the hyperplane and support vectors being examined using a tour. Here we examine the model for differentiating male and female Chinstrap penguins. The Chinstrap penguins have a noticeable difference in size of the sexes, unlike the other two species. Working with a two-class problem is easier for explaining SVM, but multi-class calculations can also follow this approach.\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nlibrary(dplyr)\nload(\"data/penguins_sub.rda\")\nchinstrap <- penguins_sub %>%\n filter(species == \"Chinstrap\") %>%\n select(-species) %>%\n mutate_if(is.numeric, mulgar:::scale2)\nchinstrap_svm <- svm(sex~., data=chinstrap, \n kernel=\"linear\",\n probability=TRUE, \n scale=FALSE)\nchinstrap_svm_e <- explore(chinstrap_svm, chinstrap)\n```\n:::\n\n::: {.cell}\n\n:::\n\n\n\n::: {.content-visible when-format=\"html\"}\n::: {#fig-p-svm-html layout-ncol=2}\n\n![Support vectors](gifs/chinstrap_svs.gif){#fig-chinstrap-svs fig-alt=\"FIX ME\" width=300}\n\n![SVM boundary](gifs/chinstrap_svm.gif){#fig-chinstrap-svm fig-alt=\"FIX ME\" width=300}\n\nSVM model for distinguishing the sexes of the Chinstrap penguins. The separating hyperplane is 3D, and separates primarily on variables `bl` and `bd`, as seen because these two axes extend out from the plane when it is seen on its side, separating the two groups.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n::: {#fig-p-svm-pdf layout-ncol=2}\n\n![Support vectors](images/chinstrap_svs.png){#fig-chinstrap-svs fig-alt=\"FIX ME\" width=200}\n\n![SVM boundary](images/chinstrap_svm.png){#fig-chinstrap-svm fig-alt=\"FIX ME\" width=200}\n\n\nSVM model for distinguishing the sexes of the Chinstrap penguins. The separating hyperplane is 3D, and separates primarily on variables `bl` and `bd`, as seen because these two axes extend out from the plane when it is seen on its side, separating the two groups.\n:::\n:::\n\\index{classification!separating hyperplane}\n\\index{classification!support vectors}\n\n::: {.content-visible when-format=\"html\"}\n::: info\nMark the support vectors by point shape, and examine where these are relative to the difference between groups. \n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\infobox{Mark the support vectors by point shape, and examine where these are relative to the difference between groups.\n}\n:::\n\n\nExamining the hyperplane in a grand tour display (@fig-p-svm-pdf) indicates that two of the variables, `bl` and `bd`, are important for separating the two classes. We can check this interpretation using the radial tour. Using the components from the model, the coefficients of the hyperplane are: \n\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nt(chinstrap_svm$SV) %*% chinstrap_svm$coefs\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [,1]\nbl -0.9102439\nbd -1.1073475\nfl -0.5223364\nbm -0.2846370\n```\n\n\n:::\n:::\n\n\n\nThe coefficients for `bl` and `bd` are the largest (in magnitude) which supports the the interpretation that they are most important. This vector can be used to set the starting point for radial tour, once it is normalised. Any orthonormal vector serves to turn this into a 2D projection, to visualise the boundary. \n\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nset.seed(1022)\nprj1 <- mulgar::norm_vec(t(chinstrap_svm$SV) %*%\n chinstrap_svm$coefs)\nprj2 <- basis_random(4, 1)\nprj <- orthonormalise(cbind(prj1, prj2))\nprj\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [,1] [,2]\nbl -0.5865081 -0.06412875\nbd -0.7135101 0.51192498\nfl -0.3365631 -0.77713899\nbm -0.1834035 -0.36038216\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n:::\n\n\n\nThis projection is show in @fig-chinstrap-radial-pdf. You can see the boundary between the two sexes as a clear line, marked by a sample of points on either side. We use the radial tour to remove each of the variables from the projection using the radial tour to examine it's importance on the model, and hence the boundary. If the clear view of the boundary gets jumbled when a variable is removed we infer that this variable is very important for the model (as seen for `bl` and `bd`). If there is little change in the clarity when a variable is removed, then it is less important (as seen for `fl` and `bm`). \n\\index{tour!radial}\n\n::: {.content-visible when-format=\"html\"}\n::: {#fig-chinstrap-radial-html layout-ncol=2}\n\n![bl](gifs/chinstrap_rad_bl.gif){#fig-chinstrap-radial-bl}\n\n![bd](gifs/chinstrap_rad_bd.gif){#fig-chinstrap-radial-bd}\n\n![fl](gifs/chinstrap_rad_fl.gif){#fig-chinstrap-radial-fl}\n\n![bm](gifs/chinstrap_rad_bm.gif){#fig-chinstrap-radial-bm}\n\nExploring the importance of the four variables to the separating hyperplane using a radial tour where the contribution of each variable is reduced to 0, and then increased to it's original value. You can see that `bl` and `bd` contribute most to the plane, because when they are removed the plane is no longer on it side marking the boundary. Variables `fl` and `bm` contribute a small amount to the separating hyperplane, but it is possible that these two could be removed without affecting the strength of the separation between the sexes. \n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n::: {#fig-chinstrap-radial-pdf layout-ncol=2}\n\n![bl in](images/chinstrap_rad_bl1.png){#fig-chinstrap-radial-bl1 width=200}\n\n![bl reduced](images/chinstrap_rad_bl2.png){#fig-chinstrap-radial-bl2 width=200}\n\n![bd reduced](images/chinstrap_rad_bd.png){#fig-chinstrap-radial-bd width=200}\n\n![bm out](images/chinstrap_rad_bm.png){#fig-chinstrap-radial-bm width=200}\n\n\nExploring the importance of the four variables to the separating hyperplane using a radial tour to reduce the contribution of each variable to 0, and then back to it's original value: (a) separating hyperplane visible, (b) `bl` contribution reduced, (c) `bd` contribution decreased, (d) `bm` contribution removed. You can see that `bl` and `bd` contribute most to the plane, because when they are removed the plane is no longer on it side marking the boundary. Variables `fl` (not shown) and `bm` contribute a small amount to the separating hyperplane, but it is possible that these two could be removed with only a small effect on the strength of the separation between the sexes. \n:::\n:::\n\n\n::: {.content-visible when-format=\"html\"}\n::: info\nUse a radial tour to zero out coefficients defining the separating hyperplane to explore the variable importance. \n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\infobox{Use a radial tour to zero out coefficients defining the separating hyperplane to explore the variable importance. \n}\n:::\n\nIn this example, we can see that clarity of the boundary changes substantially when either `bl` and `bd` are removed. There is a small change when `fl` and `bm` are removed, so they are less important. This interpretation matches the interpretation that would be made from the magnitude of the coefficients of the hyperplane (printed earlier). They reinforce each other. It is possible that the interpretation of the coefficients could differ after using the radial tour, most likely in terms of simplifying the vector, supporting the forcing some coefficients to zero. \n\n::: {.content-visible when-format=\"html\"}\n::: insight\nWhen we use the radial tour to examine how the different variables contribute to the separating hyperplane between the sexes, we learn that `bl` and `bd` are the most important variables. We could (almost) ignore `fl` and `bm` for this classification.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\insightbox{When we use the radial tour to examine how the different variables contribute to the separating hyperplane between the sexes, we learn that {\\textsf bl} and {\\textsf bd} are the most important variables. We could (almost) ignore {\\textsf fl} and {\\textsf bm} for this classification.}\n:::\n\n\n\n\n## Exercises {-}\n\n1. Generate a small subset from the `bushfire` data: we keep the variables `log_dist_cfa`, `log_dist_road` and `cause`, and we select only observations where `cause` is either lightning or arson. Fit a linear SVM model to separate the two classes and show the decision boundary together with the data. Compare to the boundary obtained by LDA and argue how the two models place the separating hyperplane in different ways.\n2. We extend this into a multivariate setting by also including `amaxt180` and `amaxt720` as predictors. Fit a linear SVM model and calculate the hyperplane to judge which of these variables are important.\n3. Calculate the decision boundary and look at it with a radial tour to see the effect from including individual predictors in a projection. Also explore what happens when rotating out multiple variables together. What can you learn from this?\n4. From the `sketches_train` data select all observations of class banana or boomerang For this subset use PCA to find the first 5 PCs. Fit two SVM models: once with linear kernel and once with radial kernel and default value for the gamma parameter. Compare the number of misclassified observations in the training data for the two models.\n5. Compute the model predictions and compare the decision boundaries between the linear and radial SVM using a slice tour. Does the shape match what you expect given the respective kernel function?\n6. SVM models are defined for separating two classes, but and ensemble of such models can be used when we want to distinguish more than two classes. Look up the documentation of the `svm` function to learn how this works, then fit an SVM model to separate the three penguin species. In this case we primarily use the model predictions to investigate the decision boundaries, you can use `explore` together with the slice tour to do this. You can use different kernels and compare the resulting decision boundaries.\n\n\n\n\n::: {.cell}\n\n:::\n", "supporting": [ "16-svm_files/figure-pdf" ], diff --git a/_freeze/16-svm/figure-html/fig-svm-toy-1.png b/_freeze/16-svm/figure-html/fig-svm-toy-1.png new file mode 100644 index 0000000..5380691 Binary files /dev/null and b/_freeze/16-svm/figure-html/fig-svm-toy-1.png differ diff --git a/_freeze/16-svm/figure-pdf/fig-svm-toy-1.pdf b/_freeze/16-svm/figure-pdf/fig-svm-toy-1.pdf index 38438c4..4825e8b 100644 Binary files a/_freeze/16-svm/figure-pdf/fig-svm-toy-1.pdf and b/_freeze/16-svm/figure-pdf/fig-svm-toy-1.pdf differ diff --git a/_freeze/17-nn/execute-results/html.json b/_freeze/17-nn/execute-results/html.json new file mode 100644 index 0000000..134e05e --- /dev/null +++ b/_freeze/17-nn/execute-results/html.json @@ -0,0 +1,17 @@ +{ + "hash": "3c79cb7ae6fbe9d39730143a508022ad", + "result": { + "engine": "knitr", + "markdown": "# Neural networks and deep learning\n\\index{classification!neural networks}\n\nNeural networks (NN) can be considered to be nested additive (or even ensemble) models where explanatory variables are combined, and transformed through an activation function like a logistic. These transformed combinations are added recursively to yield class predictions. They are considered to be black box models, but there is a growing demand for interpretability. Although interpretability is possible, it can be unappealing to understand a complex model constructed to tackle a difficult classification task. Nevertheless, this is the motivation for the explanation of visualisation for NN models in this chapter. \n\nIn the simplest form, we might write the equation for a NN as\n\n$$\n\\hat{y} = f(x) = a_0+\\sum_{h=1}^{s}\nw_{0h}\\phi(a_h+\\sum_{i=1}^{p} w_{ih}x_i)\n$$\nwhere $s$ indicates the number of nodes in the hidden (middle layer), and $\\phi$ is a choice of activation function. In a simple situation where $p=3$, $s=2$, and linear output layer, the model could be written as:\n\n$$\n\\begin{aligned}\n\\hat{y} = a_0+ & w_{01}\\phi(a_1+w_{11}x_1+w_{21}x_2+w_{31}x_3) +\\\\\n & w_{02}\\phi(a_2+w_{12}x_1+w_{22}x_2+w_{32}x_3)\n\\end{aligned}\n$$\nwhich is a combination of two (linear) models, each of which could be examined for their role in making predictions. \n\nIn practice, a model may have many nodes, and several hidden layers, a variety of activation functions, and regularisation modifications. One should keep in mind the principle of parsimony is important when applying NNs, because it is tempting to make an overly complex, and thus over-parameterised, construction. Fitting NNs is still problematic. One would hope that fitting produces a stable result, whatever the starting seed the same parameter estimates are returned. However, this is not the case, and different, sometimes radically different, results are routinely obtained after each attempted fit [@wickham2015]. \n\nFor these examples we use the software `keras` [@keras] following the installation and tutorial details at [https://tensorflow.rstudio.com/tutorials/](https://tensorflow.rstudio.com/tutorials/). Because it is an interface to python it can be tricky to install. If this is a problem, the example code should be possible to convert to use `nnet` [@VR02] or `neuralnet` [@neuralnet]. We will use the penguins data to illustrate the fitting, because it makes it easier to understand the procedures and the fit. However, a NN is like using a jackhammer instead of a trowel to plant a seedling, more complicated than necessary to build a good classification model for this data.\n\n## Setting up the model \n\\index{classification!ANN architecture}\n\nA first step is to decide how many nodes the NN architecture should have, and what activation function should be used. To make these decisions, ideally you already have some knowledge of the shapes of class clusters. For the penguins classification, we have seen that it contains three elliptically shaped clusters of roughly the same size. This suggests two nodes in the hidden layer would be sufficient to separate three clusters (@fig-nn-diagram). Because the shapes of the clusters are convex, using linear activation (\"relu\") will also be sufficient. The model specification is as follows:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(keras)\ntensorflow::set_random_seed(211)\n\n# Define model\np_nn_model <- keras_model_sequential()\np_nn_model %>% \n layer_dense(units = 2, activation = 'relu', \n input_shape = 4) %>% \n layer_dense(units = 3, activation = 'softmax')\np_nn_model %>% summary\n\nloss_fn <- loss_sparse_categorical_crossentropy(\n from_logits = TRUE)\n\np_nn_model %>% compile(\n optimizer = \"adam\",\n loss = loss_fn,\n metrics = c('accuracy')\n)\n```\n:::\n\n\nNote that `tensorflow::set_random_seed(211)` sets the seed for the model fitting so that we can obtain the same result to discuss later. It needs to be set before the model is defined in the code. The model will also be saved in order to diagnose and make predictions. \n\n![Network architecture for the model on the penguins data. The round nodes indicate original or transformed variables, and each arrow connecting these is represented as one of the weights $w_{ih}$ in the definition. The boxes indicate the additive constant entering the nodes, and the corresponding arrows represent the terms $a_h$. ](images/nn-diagram.png){#fig-nn-diagram align=\"center\"}\n\n\n::: {.cell}\n\n:::\n\n\n## Checking the training/test split\n\\index{classification!training/test split}\n\nSplitting the data into training and test is an essential way to protect against overfitting, for most classifiers, but especially so for the copiously parameterised NNs. The model specified for the penguins data with only two nodes is unlikely to be overfitted, but it is nevertheless good practice to use a training set for building and a test set for evaluation. \n\n@fig-p-split-html shows the tour being used to examine the split into training and test samples for the penguins data. Using random sampling, particularly stratified by group, should result the two sets being very similar, as can be seen here. It does happen that several observations in the test set are on the extremes of their class cluster, so it could be that the model makes errors in the neighbourhoods of these points.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Split the data intro training and testing\nlibrary(ggthemes)\nlibrary(dplyr)\nlibrary(tidyr)\nlibrary(rsample)\nlibrary(ggbeeswarm)\n\nload(\"data/penguins_sub.rda\") # from mulgar book\n\nset.seed(821)\np_split <- penguins_sub %>% \n select(bl:species) %>%\n initial_split(prop = 2/3, \n strata=species)\np_train <- training(p_split)\np_test <- testing(p_split)\n\n# Check training and test split\np_split_check <- bind_rows(\n bind_cols(p_train, type = \"train\"), \n bind_cols(p_test, type = \"test\")) %>%\n mutate(type = factor(type))\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"true\" code-summary=\"Code to run tours\"}\nanimate_xy(p_split_check[,1:4], \n col=p_split_check$species,\n pch=p_split_check$type)\nanimate_xy(p_split_check[,1:4], \n guided_tour(lda_pp(p_split_check$species)),\n col=p_split_check$species,\n pch=p_split_check$type)\nrender_gif(p_split_check[,1:4],\n grand_tour(),\n display_xy( \n col=p_split_check$species, \n pch=p_split_check$type,\n cex=1.5,\n axes=\"bottomleft\"), \n gif_file=\"gifs/p_split.gif\",\n frames=500,\n loop=FALSE\n)\nrender_gif(p_split_check[,1:4],\n guided_tour(lda_pp(p_split_check$species)),\n display_xy( \n col=p_split_check$species, \n pch=p_split_check$type,\n cex=1.5,\n axes=\"bottomleft\"), \n gif_file=\"gifs/p_split_guided.gif\",\n frames=500,\n loop=FALSE\n)\n```\n:::\n\n\n::: {.content-visible when-format=\"html\"}\n::: {#fig-p-split-html layout-ncol=2}\n\n![Grand tour](gifs/p_split.gif){#fig-split-grand fig-alt=\"FIX ME\" width=300}\n\n![Guided tour](gifs/p_split_guided.gif){#fig-split-guided fig-alt=\"FIX ME\" width=300}\n\nEvaluating the training/test split, where we expect that the two samples should roughly match. There are a few observations in the test set that are on the outer edges of the clusters, which will likely result in the model making an error in these regions. However, the two samples roughly match.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n::: {#fig-p-split-pdf layout-ncol=2}\n\n![Grand tour](images/p_split.png){#fig-split-grand fig-alt=\"FIX ME\" width=220}\n\n![Guided tour](images/p_split_guided.png){#fig-split-guided fig-alt=\"FIX ME\" width=220}\n\nEvaluating the training/test split, where we expect that the two samples should roughly match. There are a few observations in the test set that are on the outer edges of the clusters, which will likely result in the model making an error in these regions. However, the two samples roughly match.\n:::\n:::\n\n## Fit the model\n\\index{classification!Fitting a NN}\n\nThe data needs to be specially formatted for the model fitted using `keras`. The explanatory variables need to be provided as a `matrix`, and the categorical response needs to be separate, and specified as a `numeric` variable, beginning with 0. \n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Data needs to be matrix, and response needs to be numeric\np_train_x <- p_train %>%\n select(bl:bm) %>%\n as.matrix()\np_train_y <- p_train %>% pull(species) %>% as.numeric() \np_train_y <- p_train_y-1 # Needs to be 0, 1, 2\np_test_x <- p_test %>%\n select(bl:bm) %>%\n as.matrix()\np_test_y <- p_test %>% pull(species) %>% as.numeric() \np_test_y <- p_test_y-1 # Needs to be 0, 1, 2\n```\n:::\n\n\nThe specified model is reasonably simple, four input variables, two nodes in the hidden layer and a three column binary matrix for output. This corresponds to 5+5+3+3+3=19 parameters. \n\n\n::: {.cell}\n::: {.cell-output .cell-output-stdout}\n\n```\nModel: \"sequential\"\n________________________________________________________________________________\n Layer (type) Output Shape Param # \n================================================================================\n dense_1 (Dense) (None, 2) 10 \n dense (Dense) (None, 3) 9 \n================================================================================\nTotal params: 19 (76.00 Byte)\nTrainable params: 19 (76.00 Byte)\nNon-trainable params: 0 (0.00 Byte)\n________________________________________________________________________________\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\n# Fit model\np_nn_fit <- p_nn_model %>% keras::fit(\n x = p_train_x, \n y = p_train_y,\n epochs = 200,\n verbose = 0\n)\n```\n:::\n\n::: {.cell}\n\n:::\n\n\nBecause we set the random number seed we will get the same fit each time the code provided here is run. However, if the model is re-fit without setting the seed, you will see that there is a surprising amount of variability in the fits. Setting `epochs = 200` helps to usually get a good fit. One expects that `keras` is reasonably stable so one would not expect the huge array of fits as observed in @wickham2015 using `nnet`. That this can happen with the simple model used here reinforces the notion that fitting of NN models is fiddly, and great care needs to be taken to validate and diagnose the fit. \n\n::: {.content-visible when-format=\"html\"}\n::: info\nFitting NN models is fiddly, and very different fitted models can result from restarts, parameter choices, and architecture. \n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\infobox{Fitting NN models is fiddly, and very different fitted models can result from restarts, parameter choices, and architecture. \n}\n:::\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"true\"}\nlibrary(keras)\nlibrary(ggplot2)\nlibrary(colorspace)\n\n# load fitted model\np_nn_model <- load_model_tf(\"data/penguins_cnn\")\n```\n:::\n\n\nThe fitted model that we have chosen as the final one has reasonably small loss and high accuracy. Plots of loss and accuracy across epochs showing the change during fitting can be plotted, but we don't show them here, because they are generally not very interesting.\n\n\n::: {.cell}\n\n```{.r .cell-code}\np_nn_model %>% evaluate(p_test_x, p_test_y, verbose = 0)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n loss accuracy \n0.2563850 0.9553571 \n```\n\n\n:::\n:::\n\n\nThe model object can be saved for later use with:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsave_model_tf(p_nn_model, \"data/penguins_cnn\")\n```\n:::\n\n\n## Extracting model components\n\\index{classification!hidden layers}\n\n::: {.content-visible when-format=\"html\"}\n::: info\nView the individual node models to understand how they combine to produce the overall model.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\infobox{View the individual node models to understand how they combine to produce the overall model.\n}\n:::\n\nBecause nodes in the hidden layers of NNs are themselves (relatively simple regression) models, it can be interesting to examine these to understand how the model is making it's predictions. Although it's rarely easy, most software will allow the coefficients for the models at these nodes to be extracted. With the penguins NN model there are two nodes, so we can extract the coefficients and plot the resulting two linear combinations to examine the separation between classes.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Extract hidden layer model weights\np_nn_wgts <- keras::get_weights(p_nn_model, trainable=TRUE)\np_nn_wgts \n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[[1]]\n [,1] [,2]\n[1,] 0.6216676 1.33304155\n[2,] 0.1851478 -0.01596385\n[3,] -0.1680396 -0.30432791\n[4,] -0.8867414 -0.36627045\n\n[[2]]\n[1] 0.12708087 -0.09466381\n\n[[3]]\n [,1] [,2] [,3]\n[1,] -0.1646167 1.527644 -1.9215064\n[2,] -0.7547278 1.555889 0.3210194\n\n[[4]]\n[1] 0.4554813 -0.9371488 0.3577386\n```\n\n\n:::\n:::\n\n\nThe linear coefficients for the first node in the model are 0.62, 0.19, -0.17, -0.89, and the second node in the model are 1.33, -0.02, -0.3, -0.37. We can use these like we used the linear discriminants in LDA to make a 2D view of the data, where the model is separating the three species. The constants 0.13, -0.09 are not important for this. They are only useful for drawing the location of the boundaries between classes produced by the model.\n\nThese two sets of model coefficients provide linear combinations of the original variables. Together, they define a plane on which the data is projected to view the classification produced by the model. Ideally, though this plane should be defined using an orthonormal basis otherwise the shape of the data distribution might be warped. So we orthonormalise this matrix before computing the data projection.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Orthonormalise the weights to make 2D projection\np_nn_wgts_on <- tourr::orthonormalise(p_nn_wgts[[1]])\np_nn_wgts_on\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [,1] [,2]\n[1,] 0.5593355 0.7969849\n[2,] 0.1665838 -0.2145664\n[3,] -0.1511909 -0.1541475\n[4,] -0.7978314 0.5431528\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\n# Hidden layer\np_train_m <- p_train %>%\n mutate(nn1 = as.matrix(p_train[,1:4]) %*%\n as.matrix(p_nn_wgts_on[,1], ncol=1),\n nn2 = as.matrix(p_train[,1:4]) %*%\n matrix(p_nn_wgts_on[,2], ncol=1))\n\n# Now add the test points on.\np_test_m <- p_test %>%\n mutate(nn1 = as.matrix(p_test[,1:4]) %*%\n as.matrix(p_nn_wgts_on[,1], ncol=1),\n nn2 = as.matrix(p_test[,1:4]) %*%\n matrix(p_nn_wgts_on[,2], ncol=1))\np_train_m <- p_train_m %>%\n mutate(set = \"train\")\np_test_m <- p_test_m %>%\n mutate(set = \"test\")\np_all_m <- bind_rows(p_train_m, p_test_m)\nggplot(p_all_m, aes(x=nn1, y=nn2, \n colour=species, shape=set)) + \n geom_point() +\n scale_colour_discrete_divergingx(palette=\"Zissou 1\") +\n scale_shape_manual(values=c(16, 1)) +\n theme_minimal() +\n theme(aspect.ratio=1)\n```\n\n::: {.cell-output-display}\n![Plot of the data in the linear combinations from the two nodes in the hidden layer. The three species are clearly different, although with some overlap between all three. A main issue to notice is that there isn't a big gap between Gentoo and the other species, which we know is there based on our data exploration done in other chapters. This suggests this fitted model is sub-optimal.](17-nn_files/figure-html/fig-hidden-layer-1.png){#fig-hidden-layer fig-alt='FIXME' width=80%}\n:::\n:::\n\n\n@fig-hidden-layer shows the data projected into the plane determined by the two linear combinations of the two nodes in the hidden layer. Training and test sets are indicated by empty and solid circles. The three species are clearly different but there is some overlap or confusion for a few penguins. The most interesting aspect to learn is that there is no big gap between the Gentoo and other species, which we know exists in the data. The model has not found this gap, and thus is likely to unfortunately and erroneously confuse some Gentoo penguins, particularly with Adelie.\n\nWhat we have shown here is a process to use the models at the nodes of the hidden layer to produce a reduced dimensional space where the classes are best separated, at least as determined by the model. The process will work in higher dimensions also. \n\nWhen there are more nodes in the hidden layer than the number of original variables it means that the space is extended to achieve useful classifications that need more complicated non-linear boundaries. The extra nodes describe the non-linearity. @wickham2015 provides a good illustration of this in 2D. The process of examining each of the node models can be useful for understanding this non-linear separation, also in high dimensions.\n\n## Examining predictive probabilities\n\\index{classification!predictive probabilities}\n\nWhen the predictive probabilities are returned by a model, as is done by this NN, we can use a ternary diagram for three class problems, or high-dimensional simplex when there are more classes to examine the strength of the classification. This done in the same way that was used for the votes matrix from a random forest in @sec-votes. \n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Predict training and test set\np_train_pred <- p_nn_model %>% \n predict(p_train_x, verbose = 0)\np_train_pred_cat <- levels(p_train$species)[\n apply(p_train_pred, 1,\n which.max)]\np_train_pred_cat <- factor(\n p_train_pred_cat,\n levels=levels(p_train$species))\ntable(p_train$species, p_train_pred_cat)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n p_train_pred_cat\n Adelie Chinstrap Gentoo\n Adelie 92 4 1\n Chinstrap 0 45 0\n Gentoo 1 0 78\n```\n\n\n:::\n\n```{.r .cell-code}\np_test_pred <- p_nn_model %>% \n predict(p_test_x, verbose = 0)\np_test_pred_cat <- levels(p_test$species)[\n apply(p_test_pred, 1, \n which.max)]\np_test_pred_cat <- factor(\n p_test_pred_cat,\n levels=levels(p_test$species))\ntable(p_test$species, p_test_pred_cat)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n p_test_pred_cat\n Adelie Chinstrap Gentoo\n Adelie 45 3 1\n Chinstrap 0 23 0\n Gentoo 1 0 39\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"true\"}\n# Set up the data to make the ternary diagram\n# Join data sets\ncolnames(p_train_pred) <- c(\"Adelie\", \"Chinstrap\", \"Gentoo\")\ncolnames(p_test_pred) <- c(\"Adelie\", \"Chinstrap\", \"Gentoo\")\np_train_pred <- as_tibble(p_train_pred)\np_train_m <- p_train_m %>%\n mutate(pspecies = p_train_pred_cat) %>%\n bind_cols(p_train_pred) %>%\n mutate(set = \"train\")\np_test_pred <- as_tibble(p_test_pred)\np_test_m <- p_test_m %>%\n mutate(pspecies = p_test_pred_cat) %>%\n bind_cols(p_test_pred) %>%\n mutate(set = \"test\")\np_all_m <- bind_rows(p_train_m, p_test_m)\n\n# Add simplex to make ternary\nlibrary(geozoo)\nproj <- t(geozoo::f_helmert(3)[-1,])\np_nn_v_p <- as.matrix(p_all_m[,c(\"Adelie\", \"Chinstrap\", \"Gentoo\")]) %*% proj\ncolnames(p_nn_v_p) <- c(\"x1\", \"x2\")\np_nn_v_p <- p_nn_v_p %>%\n as.data.frame() %>%\n mutate(species = p_all_m$species,\n set = p_all_m$set)\n\nsimp <- geozoo::simplex(p=2)\nsp <- data.frame(cbind(simp$points), simp$points[c(2,3,1),])\ncolnames(sp) <- c(\"x1\", \"x2\", \"x3\", \"x4\")\nsp$species = sort(unique(penguins_sub$species))\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"true\"}\n# Plot it\nggplot() +\n geom_segment(data=sp, aes(x=x1, y=x2, xend=x3, yend=x4)) +\n geom_text(data=sp, aes(x=x1, y=x2, label=species),\n nudge_x=c(-0.1, 0.15, 0),\n nudge_y=c(0.05, 0.05, -0.05)) +\n geom_point(data=p_nn_v_p, aes(x=x1, y=x2, \n colour=species,\n shape=set), \n size=2, alpha=0.5) +\n scale_color_discrete_divergingx(palette=\"Zissou 1\") +\n scale_shape_manual(values=c(19, 1)) +\n theme_map() +\n theme(aspect.ratio=1, legend.position = \"right\")\n```\n\n::: {.cell-output-display}\n![Ternary diagram for the three groups of the predictive probabilities of both training ans test sets. From what we already know about the penguins data this fit is not good. Both Chinstrap and Gentoo penguins are confused with Adelie, or at risk of it. Gentoo is very well-separated from the other two species when several variables are used, and this fitted model is blind to it. One useful finding is that there is little difference between training and test sets, so the model has not been over-fitted.](17-nn_files/figure-html/unnamed-chunk-17-1.png){width=70%}\n:::\n:::\n\n\n\n::: {.content-visible when-format=\"html\"}\n::: info\nIf the training and test sets look similar when plotted in the model space then the model is not suffering from over-fitting.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\infobox{If the training and test sets look similar when plotted in the model space then the model is not suffering from over-fitting. \n}\n:::\n\n## Local explanations\n\\index{classification!local explanations}\n\\index{classification!XAI}\n\nIt especially important to be able to interpret or explain a model, even more so when the model is complex or black-box'y. A good resource for learning about the range of methods is @iml. Local explanations provide some information about variables that are important for making the prediction for a particular observation. The method that we use here is Shapley value, as computed using the `kernelshap` package [@kernelshap]. \n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Explanations\n# https://www.r-bloggers.com/2022/08/kernel-shap/\nlibrary(kernelshap)\nlibrary(shapviz)\np_explain <- kernelshap(\n p_nn_model,\n p_train_x, \n bg_X = p_train_x,\n verbose = FALSE\n )\np_exp_sv <- shapviz(p_explain)\nsave(p_exp_sv, file=\"data/p_exp_sv.rda\")\n```\n:::\n\n\nA Shapley value for an observation indicates how the variable contributes to the model prediction for that observation, relative to other variables. It is an average, computed from the change in prediction when all combinations of presence or absence of other variables. In the computation, for each combination, the prediction is computed by substituting absent variables with their average value, like one might do when imputing missing values. \n\n@fig-shapley-pcp shows the Shapley values for Gentoo observations (both training and test sets) in the penguins data, as a parallel coordinate plot. The values for the single misclassified Gentoo penguin (in the training set) is coloured orange. Overall, the Shapley values don't vary much on `bl`, `bd` and `fl` but they do on `bm`. The effect of other variables is seems to be only important for `bm`. \n\nFor the misclassified penguin, the effect of `bm` for all combinations of other variables leads to a decline in predicted value, thus less confidence in it being a Gentoo. In contrast, for this same penguin when considering the effect of `bl` the predicted value increases on average. \n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"true\"}\nload(\"data/p_exp_sv.rda\")\np_exp_gentoo <- p_exp_sv$Class_3$S\np_exp_gentoo <- p_exp_gentoo %>%\n as_tibble() %>%\n mutate(species = p_train$species,\n pspecies = p_train_pred_cat,\n ) %>%\n mutate(error = ifelse(species == pspecies, 0, 1))\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"true\"}\np_exp_gentoo %>%\n filter(species == \"Gentoo\") %>%\n pivot_longer(bl:bm, names_to=\"var\", values_to=\"shap\") %>%\n mutate(var = factor(var, levels=c(\"bl\", \"bd\", \"fl\", \"bm\"))) %>%\n ggplot(aes(x=var, y=shap, colour=factor(error))) +\n geom_quasirandom(alpha=0.8) +\n scale_colour_discrete_divergingx(palette=\"Geyser\") +\n #facet_wrap(~var) +\n xlab(\"\") + ylab(\"SHAP\") +\n theme_minimal() + \n theme(legend.position = \"none\")\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"true\"}\nlibrary(ggpcp)\np_exp_gentoo %>%\n filter(species == \"Gentoo\") %>%\n pcp_select(1:4) %>%\n ggplot(aes_pcp()) +\n geom_pcp_axes() + \n geom_pcp_boxes(fill=\"grey80\") + \n geom_pcp(aes(colour = factor(error)), \n linewidth = 2, alpha=0.3) +\n scale_colour_discrete_divergingx(palette=\"Geyser\") +\n xlab(\"\") + ylab(\"SHAP\") +\n theme_minimal() + \n theme(legend.position = \"none\")\n```\n\n::: {.cell-output-display}\n![SHAP values focused on Gentoo class, for each variable. The one misclassified penguin (orange) has a much lower value for body mass, suggesting that this variable is used differently for the prediction than for other penguins.](17-nn_files/figure-html/fig-shapley-pcp-1.png){#fig-shapley-pcp fig-alt='FIXME' width=80%}\n:::\n:::\n\n\nIf we examine the data [@fig-penguins-bl-bm-bd] the explanation makes some sense. The misclassified penguin has an unusually small value on `bm`. That the SHAP value for `bm` was quite different pointed to this being a potential issue with the model, particularly for this penguin. This penguin's prediction is negatively impacted by `bm` being in the model.\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"true\"}\nlibrary(patchwork)\n# Check position on bm\nshap_proj <- p_exp_gentoo %>%\n filter(species == \"Gentoo\", error == 1) %>%\n select(bl:bm)\nshap_proj <- as.matrix(shap_proj/sqrt(sum(shap_proj^2)))\np_exp_gentoo_proj <- p_exp_gentoo %>%\n rename(shap_bl = bl, \n shap_bd = bd,\n shap_fl = fl, \n shap_bm = bm) %>%\n bind_cols(as_tibble(p_train_x)) %>%\n mutate(shap1 = shap_proj[1]*bl+\n shap_proj[2]*bd+\n shap_proj[3]*fl+\n shap_proj[4]*bm)\nsp1 <- ggplot(p_exp_gentoo_proj, aes(x=bm, y=bl, \n colour=species, \n shape=factor(1-error))) +\n geom_point(alpha=0.8) +\n scale_colour_discrete_divergingx(palette=\"Zissou 1\") +\n scale_shape_manual(\"error\", values=c(19, 1)) +\n theme_minimal() + \n theme(aspect.ratio=1, legend.position=\"bottom\")\nsp2 <- ggplot(p_exp_gentoo_proj, aes(x=bm, y=shap1, \n colour=species, \n shape=factor(1-error))) +\n geom_point(alpha=0.8) +\n scale_colour_discrete_divergingx(palette=\"Zissou 1\") +\n scale_shape_manual(\"error\", values=c(19, 1)) +\n ylab(\"SHAP\") +\n theme_minimal() + \n theme(aspect.ratio=1, legend.position=\"bottom\")\nsp2 <- ggplot(p_exp_gentoo_proj, aes(x=shap1, \n fill=species, colour=species)) +\n geom_density(alpha=0.5) +\n geom_vline(xintercept = p_exp_gentoo_proj$shap1[\n p_exp_gentoo_proj$species==\"Gentoo\" &\n p_exp_gentoo_proj$error==1], colour=\"black\") +\n scale_fill_discrete_divergingx(palette=\"Zissou 1\") +\n scale_colour_discrete_divergingx(palette=\"Zissou 1\") +\n theme_minimal() + \n theme(aspect.ratio=1, legend.position=\"bottom\")\nsp2 <- ggplot(p_exp_gentoo_proj, aes(x=bm, y=bd, \n colour=species, \n shape=factor(1-error))) +\n geom_point(alpha=0.8) +\n scale_colour_discrete_divergingx(palette=\"Zissou 1\") +\n scale_shape_manual(\"error\", values=c(19, 1)) +\n theme_minimal() + \n theme(aspect.ratio=1, legend.position=\"bottom\")\nsp1 + sp2 + plot_layout(ncol=2, guides = \"collect\") &\n theme(legend.position=\"bottom\",\n legend.direction=\"vertical\")\n```\n\n::: {.cell-output-display}\n![Plots of the data to help understand what the SHAP values indicate. The misclassified Gentoo penguin has an unusually low body mass value which makes it appear to be more like an Adelie penguin, particularly when considered in relation to it's bill length.](17-nn_files/figure-html/fig-penguins-bl-bm-bd-1.png){#fig-penguins-bl-bm-bd fig-alt='FIXME' width=100%}\n:::\n:::\n\n\n## Examining boundaries\n\n\n@fig-penguins-lda-nn shows the boundaries for this NN model along with those of the LDA model. \n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"true\"}\n# Generate grid over explanatory variables\np_grid <- tibble(\n bl = runif(10000, min(penguins_sub$bl), max(penguins_sub$bl)),\n bd = runif(10000, min(penguins_sub$bd), max(penguins_sub$bd)),\n fl = runif(10000, min(penguins_sub$fl), max(penguins_sub$fl)),\n bm = runif(10000, min(penguins_sub$bm), max(penguins_sub$bm))\n)\n# Predict grid\np_grid_pred <- p_nn_model %>%\n predict(as.matrix(p_grid), verbose=0)\np_grid_pred_cat <- levels(p_train$species)[apply(p_grid_pred, 1, which.max)]\np_grid_pred_cat <- factor(p_grid_pred_cat,\n levels=levels(p_train$species))\n\n# Project into weights from the two nodes\np_grid_proj <- as.matrix(p_grid) %*% p_nn_wgts_on\ncolnames(p_grid_proj) <- c(\"nn1\", \"nn2\")\np_grid_proj <- p_grid_proj %>% \n as_tibble() %>%\n mutate(species = p_grid_pred_cat)\n\n# Plot\nggplot(p_grid_proj, aes(x=nn1, y=nn2, \n colour=species)) + \n geom_point(alpha=0.5) +\n geom_point(data=p_all_m, aes(x=nn1, \n y=nn2, \n shape=species),\n inherit.aes = FALSE) +\n scale_colour_discrete_divergingx(palette=\"Zissou 1\") +\n scale_shape_manual(values=c(1, 2, 3)) +\n theme_minimal() +\n theme(aspect.ratio=1, \n legend.position = \"bottom\",\n legend.title = element_blank())\n```\n:::\n\n\n\n::: {.content-visible when-format=\"html\"}\n\n::: {#fig-penguins-lda-nn-html layout-ncol=2}\n\n![LDA model](gifs/penguins_lda_boundaries.gif){#fig-lda-boundary fig-alt=\"FIX ME\" width=300}\n\n![NN model](gifs/penguins_nn_boundaries.gif){#fig-tree-boundary fig-alt=\"FIX ME\" width=300}\n\nComparison of the boundaries produced by the LDA (a) and the NN (b) model, using a slice tour. \n:::\n:::\n\n::: {#fig-penguins-lda-nn layout-ncol=2}\n\n![LDA model](images/fig-lda-2D-boundaries-1.png){#fig-lda-boundary2 fig-alt=\"FIX ME\" width=200}\n\n![NN model](images/penguins-nn-boundaries-1.png){#fig-nn-boundary fig-alt=\"FIX ME\" width=290}\n\nComparison of the boundaries produced by the LDA (a) and the NN (b) model, using a slice tour. \n:::\n\n\\index{tour!slice} \n\n## Application to a large dataset\n\nTo see how these methods apply in the setting where we have a large number of variables, observations and classes we will look at a neural network that predicts the category for the fashion MNIST data. The code for designing and fitting the model is following the tutorial available from https://tensorflow.rstudio.com/tutorials/keras/classification and you can find additional information there. Below we only replicate the steps needed to build the model from scratch. We also note that a similar investigation was presented in @li2020visualizing, with a focus on investigating the model at different epochs during the training.\n\\index{data!fashion MNIST}\n\nThe first step is to download and prepare the data. Here we scale the observations to range between zero and one, and we define the label names.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(keras)\n\n# download the data\nfashion_mnist <- dataset_fashion_mnist()\n\n# split into input variables and response\nc(train_images, train_labels) %<-% fashion_mnist$train\nc(test_images, test_labels) %<-% fashion_mnist$test\n\n# for interpretation we also define the category names\nclass_names = c('T-shirt/top',\n 'Trouser',\n 'Pullover',\n 'Dress',\n 'Coat',\n 'Sandal',\n 'Shirt',\n 'Sneaker',\n 'Bag',\n 'Ankle boot')\n\n# rescaling to the range (0,1)\ntrain_images <- train_images / 255\ntest_images <- test_images / 255\n```\n:::\n\n\nIn the next step we define the neural network and train the model. Note that because we have many observations, even a very simple structure returns a good model. And because this example is well-known, we do not need to tune the model or check the validation accuracy.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# defining the model\nmodel_fashion_mnist <- keras_model_sequential()\nmodel_fashion_mnist %>%\n # flatten the image data into a long vector\n layer_flatten(input_shape = c(28, 28)) %>%\n # hidden layer with 128 units\n layer_dense(units = 128, activation = 'relu') %>%\n # output layer for 10 categories\n layer_dense(units = 10, activation = 'softmax')\n\nmodel_fashion_mnist %>% compile(\n optimizer = 'adam',\n loss = 'sparse_categorical_crossentropy',\n metrics = c('accuracy')\n)\n\n# fitting the model, if we did not know the model yet we\n# would add a validation split to diagnose the training\nmodel_fashion_mnist %>% fit(train_images,\n train_labels,\n epochs = 5)\nsave_model_tf(model_fashion_mnist, \"data/fashion_nn\")\n```\n:::\n\n\nWe have defined a flat neural network with a single hidden layer with 128 nodes. To investigate the model we can start by comparing the activations to the original input data distribution. Since both the input space and the space of activations is large, and they are of different dimensionality, we will first use principal component analysis. This simplifies the analysis, and in general we do not need the original pixel or hidden node information for the interpretation here. The comparison is using the test-subset of the data.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# get the fitted model\nmodel_fashion_mnist <- load_model_tf(\"data/fashion_nn\")\n# observed response labels in the test set\ntest_tags <- factor(class_names[test_labels + 1],\n levels = class_names)\n\n# calculate activation for the hidden layer, this can be done\n# within the keras framework\nactivations_model_fashion <- keras_model(\n inputs = model_fashion_mnist$input,\n outputs = model_fashion_mnist$layers[[2]]$output\n)\nactivations_fashion <- predict(\n activations_model_fashion,\n test_images, verbose = 0)\n\n# PCA for activations\nactivations_pca <- prcomp(activations_fashion)\nactivations_pc <- as.data.frame(activations_pca$x)\n\n# PCA on the original data\n# we first need to flatten the image input\ntest_images_flat <- test_images\ndim(test_images_flat) <- c(nrow(test_images_flat), 784)\nimages_pca <- prcomp(as.data.frame(test_images_flat))\nimages_pc <- as.data.frame(images_pca$x)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"true\" code-summary=\"Code to run tours\"}\np2 <- ggplot(activations_pc,\n aes(PC1, PC2, color = test_tags)) +\n geom_point(size = 0.1) +\n ggtitle(\"Activations\") +\n scale_color_discrete_qualitative(palette = \"Dynamic\") +\n theme_bw() +\n theme(legend.position = \"none\", aspect.ratio = 1)\n\np1 <- ggplot(images_pc,\n aes(PC1, PC2, color = test_tags)) +\n geom_point(size = 0.1) +\n ggtitle(\"Input space\") +\n scale_color_discrete_qualitative(palette = \"Dynamic\") +\n theme_bw() +\n theme(legend.position = \"none\", aspect.ratio = 1)\n\nlegend_labels <- cowplot::get_legend(\n p1 + \n guides(color = guide_legend(nrow = 1)) +\n theme(legend.position = \"bottom\",\n legend.title = element_blank()) +\n guides(color = guide_legend(override.aes = list(size = 1))) \n)\n# hide plotting code\ncowplot::plot_grid(cowplot::plot_grid(p1, p2), legend_labels,\n rel_heights = c(1, .3), nrow = 2)\n```\n\n::: {.cell-output-display}\n![](17-nn_files/figure-html/unnamed-chunk-27-1.png){width=672}\n:::\n:::\n\n\nLooking only at the first two principal components we note some clear differences from the transformation in the hidden layer. The observations seem to be more evenly spread in the input space, while in the activations space we notice grouping along specific directions. In particular the category \"Bag\" appears to be most different from all other classes, and the non-linear transformation in the activations space shows that they are clearly different from the shoe categories, while in the input space we could note some overlap in the linear projection. To better identify differences between other groups we will use the tour on the first five principal components.\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"true\" code-summary=\"Code to run tours\"}\nanimate_xy(images_pc[,1:5], col = test_tags,\n cex=0.2, palette = \"Dynamic\")\nanimate_xy(activations_pc[,1:5], col = test_tags,\n cex=0.2, palette = \"Dynamic\")\n\nrender_gif(images_pc[,1:5],\n grand_tour(),\n display_xy( \n col=test_tags, \n cex=0.2,\n palette = \"Dynamic\",\n axes=\"bottomleft\"), \n gif_file=\"gifs/fashion_images_gt.gif\",\n frames=500,\n loop=FALSE\n)\nrender_gif(activations_pc[,1:5],\n grand_tour(),\n display_xy( \n col=test_tags, \n cex=0.2,\n palette = \"Dynamic\",\n axes=\"bottomleft\"), \n gif_file=\"gifs/fashion_activations_gt.gif\",\n frames=500,\n loop=FALSE\n)\n```\n:::\n\n\n::: {.content-visible when-format=\"html\"}\n::: {#fig-fashion-gt-html layout-ncol=2}\n\n![Input space](gifs/fashion_images_gt.gif){#fig-fashion-input fig-alt=\"FIX ME\" width=200}\n\n![Activations](gifs/fashion_activations_gt.gif){#fig-fashion-activation fig-alt=\"FIX ME\" width=200}\n\nComparison of the test observations in the first five principal components of the input space (left) and in the hidden layer activations (right). The activation function results in more clearly defined grouping of the different classes.\n:::\n:::\n\n\n::: {.content-visible when-format=\"pdf\"}\n::: {#fig-p-split-pdf layout-ncol=2}\n\n![Input space](images/fashion_images_gt_36.png){#fig-fashion-input fig-alt=\"FIX ME\" width=200}\n\n![Activations](images/fashion_activation_gt_126.png){#fig-fashion-activation fig-alt=\"FIX ME\" width=200}\n\nComparison of the test observations in the first five principal components of the input space (left) and in the hidden layer activations (right). The activation function results in more clearly defined grouping of the different classes.\n:::\n:::\n\nAs with the first two principal components we get a much more spread out distribution in the original space. Nevertheless we can see differences between the classes, and that some groups are varying along specific directions in that space. Overall the activations space shows tighter clusters as expected after including the ReLU activation function, but the picture is not as neat as the first two principal components would suggest. While certain groups appear very compact even in this larger subspace, others vary quite a bit within part of the space. For example we can clearly see the \"Bag\" observations as different from all other images, but also notice that there is a large variation within this class along certain directions.\n\nFinally we will investigate the model performance through the missclassifications and uncertainty between classes. We start with the error matrix for the test observations. To fit the error matrix we use the numeric labels, the ordering is as defined above for the labels.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfashion_test_pred <- predict(model_fashion_mnist,\n test_images, verbose = 0)\nfashion_test_pred_cat <- levels(test_tags)[\n apply(fashion_test_pred, 1,\n which.max)]\npredicted <- factor(\n fashion_test_pred_cat,\n levels=levels(test_tags)) %>%\n as.numeric() - 1\nobserved <- as.numeric(test_tags) -1\ntable(observed, predicted)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n predicted\nobserved 0 1 2 3 4 5 6 7 8 9\n 0 128 0 84 719 10 0 51 7 1 0\n 1 0 42 15 941 0 0 0 2 0 0\n 2 6 0 824 38 121 0 3 8 0 0\n 3 1 0 14 976 8 0 1 0 0 0\n 4 1 0 205 181 605 0 0 8 0 0\n 5 1 0 0 2 0 77 0 902 1 17\n 6 24 0 231 282 405 0 44 12 2 0\n 7 0 0 0 0 0 0 0 1000 0 0\n 8 17 1 78 102 49 0 5 730 16 2\n 9 0 0 0 2 0 0 0 947 0 51\n```\n\n\n:::\n:::\n\n\nFrom this we see that the model mainly confuses certain categories with each other, and within expected groups (e.g. different types of shoes can be confused with each other, or different types of shirts). We can further investigate this by visualizing the full probability matrix for the test observations, to see which categories the model is uncertain about.\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"true\" code-summary=\"Code to visualize probabilities\"}\n# getting the probabilities from the output layer\nfashion_test_pred <- predict(model_fashion_mnist,\n test_images, verbose = 0)\n\n# copying this from RF fake tree vote matrix\nproj <- t(geozoo::f_helmert(10)[-1,])\nf_nn_v_p <- as.matrix(fashion_test_pred) %*% proj\ncolnames(f_nn_v_p) <- c(\"x1\", \"x2\", \"x3\", \"x4\", \"x5\", \"x6\", \"x7\", \"x8\", \"x9\")\n\nf_nn_v_p <- f_nn_v_p %>%\n as.data.frame() %>%\n mutate(class = test_tags)\n\nsimp <- geozoo::simplex(p=9)\nsp <- data.frame(simp$points)\ncolnames(sp) <- c(\"x1\", \"x2\", \"x3\", \"x4\", \"x5\", \"x6\", \"x7\", \"x8\", \"x9\")\nsp$class = \"\"\nf_nn_v_p_s <- bind_rows(sp, f_nn_v_p) %>%\n mutate(class = ifelse(class %in% c(\"T-shirt/top\",\n \"Pullover\",\n \"Shirt\",\n \"Coat\"), class, \"Other\")) %>%\n mutate(class = factor(class, levels=c(\"Other\",\n \"T-shirt/top\",\n \"Pullover\",\n \"Shirt\",\n \"Coat\"))) \n# nicely shows confusion between certain classes is common\nanimate_xy(f_nn_v_p_s[,1:9], col = f_nn_v_p_s$class, \n axes = \"off\", pch = \".\",\n edges = as.matrix(simp$edges),\n edges.width = 0.05,\n palette = \"Lajolla\")\n```\n:::\n\n\nFor this data using explainers like SHAP is not so interesting, since the individual pixel contribution to a prediction are typically not of interest. With image classification a next step might be to further investigate which part of the image is important for a prediction, and this can be visualized as a heat map placed over the original image. This is especially interesting in the case of difficult or missclassified images. This however is beyond the scope of this book.\n\n\n\n\n## Exercises {-}\n\n1. The problem with the NN model fitted to the penguins is that the Gentoo are poorly classified, when they should be perfectly predictable due to the big gap between class clusters. Re-fit the NN to the penguins data, to find a better model that appropriately perfectly predicts Gentoo penguins. Support this by plotting the model (using the hidden layer), and the predictive probabilities as a ternary plot. Do the SHAP values also support that `bd` plays a stronger role in your best model? (`bd` is the main variable for distinguishing Gentoo's from the other species, particularly when used with `fl` or `bl`.)\n2. For the fashion MNIST data we have seen that certain categories are more likely to be confused with each other. Select a subset of the data including only the categories Ankle boot, Sneaker and Sandal and see if you can reproduce the analysis of the penguins data in this chapter with this subset.\n3. XXX fake trees, can we think about the number of nodes and make it work with a simple NN similar to penguins data?\n4. The sketches data could also be considered a classic image classification problem, and we have seen that we can get a reasonable accuracy with a random forest model. Because we only have a smaller number of observations (compared to the fashion MNIST data) when fitting a neural network we need to be very careful not to overfit the training data. Try fitting a flat neural network (similar to what we did for the fashion MNIST data) and check the test accuracy of the model.\n5. Challenge: try to design a more accurate neural network for the sketches data. Here you can investigate using a convolutional neural network in combination with data augmentation. In addition, using batch normalization should improve the model performance.\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n:::\n", + "supporting": [ + "17-nn_files/figure-html" + ], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": {}, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/17-nn/execute-results/tex.json b/_freeze/17-nn/execute-results/tex.json index 27277a1..62f84c2 100644 --- a/_freeze/17-nn/execute-results/tex.json +++ b/_freeze/17-nn/execute-results/tex.json @@ -1,8 +1,8 @@ { - "hash": "e80e81800e2cb4576a434dc4c07c520c", + "hash": "f21d7d85a8971a46f56c0f72440797c5", "result": { "engine": "knitr", - "markdown": "# Neural networks and deep learning\n\\index{classification!neural networks}\n\nNeural networks (NN) can be considered to be nested additive (or even ensemble) models where explanatory variables are combined, and transformed through an activation function like a logistic. These transformed combinations are added recursively to yield class predictions. They are considered to be black box models, but there is a growing demand for interpretability. Although interpretability is possible, it can be unappealing to understand a complex model constructed to tackle a difficult classification task. Nevertheless, this is the motivation for the explanation of visualisation for NN models in this chapter. \n\nIn the simplest form, we might write the equation for a NN as\n\n$$\n\\hat{y} = f(x) = a_0+\\sum_{h=1}^{s}\nw_{0h}\\phi(a_h+\\sum_{i=1}^{p} w_{ih}x_i)\n$$\nwhere $s$ indicates the number of nodes in the hidden (middle layer), and $\\phi$ is a choice of activation function. In a simple situation where $p=3$, $s=2$, and linear output layer, the model could be written as:\n\n$$\n\\begin{aligned}\n\\hat{y} = a_0+ & w_{01}\\phi(a_1+w_{11}x_1+w_{21}x_2+w_{31}x_3) +\\\\\n & w_{02}\\phi(a_2+w_{12}x_1+w_{22}x_2+w_{32}x_3)\n\\end{aligned}\n$$\nwhich is a combination of two (linear) models, each of which could be examined for their role in making predictions. \n\nIn practice, a model may have many nodes, and several hidden layers, a variety of activation functions, and regularisation modifications. One should keep in mind the principle of parsimony is important when applying NNs, because it is tempting to make an overly complex, and thus over-parameterised, construction. Fitting NNs is still problematic. One would hope that fitting produces a stable result, whatever the starting seed the same parameter estimates are returned. However, this is not the case, and different, sometimes radically different, results are routinely obtained after each attempted fit [@wickham2015]. \n\nFor these examples we use the software `keras` [@keras] following the installation and tutorial details at [https://tensorflow.rstudio.com/tutorials/](https://tensorflow.rstudio.com/tutorials/). Because it is an interface to python it can be tricky to install. If this is a problem, the example code should be possible to convert to use `nnet` [@VR02] or `neuralnet` [@neuralnet]. We will use the penguins data to illustrate the fitting, because it makes it easier to understand the procedures and the fit. However, a NN is like using a jackhammer instead of a trowel to plant a seedling, more complicated than necessary to build a good classification model for this data.\n\n## Setting up the model \n\\index{classification!ANN architecture}\n\nA first step is to decide how many nodes the NN architecture should have, and what activation function should be used. To make these decisions, ideally you already have some knowledge of the shapes of class clusters. For the penguins classification, we have seen that it contains three elliptically shaped clusters of roughly the same size. This suggests two nodes in the hidden layer would be sufficient to separate three clusters (@fig-nn-diagram). Because the shapes of the clusters are convex, using linear activation (\"relu\") will also be sufficient. The model specification is as follows:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(keras)\ntensorflow::set_random_seed(211)\n\n# Define model\np_nn_model <- keras_model_sequential()\np_nn_model %>% \n layer_dense(units = 2, activation = 'relu', \n input_shape = 4) %>% \n layer_dense(units = 3, activation = 'softmax')\np_nn_model %>% summary\n\nloss_fn <- loss_sparse_categorical_crossentropy(\n from_logits = TRUE)\n\np_nn_model %>% compile(\n optimizer = \"adam\",\n loss = loss_fn,\n metrics = c('accuracy')\n)\n```\n:::\n\n\n\nNote that `tensorflow::set_random_seed(211)` sets the seed for the model fitting so that we can obtain the same result to discuss later. It needs to be set before the model is defined in the code. The model will also be saved in order to diagnose and make predictions. \n\n![Network architecture for the model on the penguins data. The round nodes indicate original or transformed variables, and each arrow connecting these is represented as one of the weights $w_{ih}$ in the definition. The boxes indicate the additive constant entering the nodes, and the corresponding arrows represent the terms $a_h$. ](images/nn-diagram.png){#fig-nn-diagram align=\"center\"}\n\n\n\n::: {.cell}\n\n:::\n\n\n\n## Checking the training/test split\n\\index{classification!training/test split}\n\nSplitting the data into training and test is an essential way to protect against overfitting, for most classifiers, but especially so for the copiously parameterised NNs. The model specified for the penguins data with only two nodes is unlikely to be overfitted, but it is nevertheless good practice to use a training set for building and a test set for evaluation. \n\n@fig-p-split-pdf shows the tour being used to examine the split into training and test samples for the penguins data. Using random sampling, particularly stratified by group, should result the two sets being very similar, as can be seen here. It does happen that several observations in the test set are on the extremes of their class cluster, so it could be that the model makes errors in the neighbourhoods of these points.\n\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n:::\n\n\n\n::: {.content-visible when-format=\"html\"}\n::: {#fig-p-split-html layout-ncol=2}\n\n![Grand tour](gifs/p_split.gif){#fig-split-grand fig-alt=\"FIX ME\" width=300}\n\n![Guided tour](gifs/p_split_guided.gif){#fig-split-guided fig-alt=\"FIX ME\" width=300}\n\nEvaluating the training/test split, where we expect that the two samples should roughly match. There are a few observations in the test set that are on the outer edges of the clusters, which will likely result in the model making an error in these regions. However, the two samples roughly match.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n::: {#fig-p-split-pdf layout-ncol=2}\n\n![Grand tour](images/p_split.png){#fig-split-grand fig-alt=\"FIX ME\" width=220}\n\n![Guided tour](images/p_split_guided.png){#fig-split-guided fig-alt=\"FIX ME\" width=220}\n\nEvaluating the training/test split, where we expect that the two samples should roughly match. There are a few observations in the test set that are on the outer edges of the clusters, which will likely result in the model making an error in these regions. However, the two samples roughly match.\n:::\n:::\n\n## Fit the model\n\\index{classification!Fitting a NN}\n\nThe data needs to be specially formatted for the model fitted using `keras`. The explanatory variables need to be provided as a `matrix`, and the categorical response needs to be separate, and specified as a `numeric` variable, beginning with 0. \n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Data needs to be matrix, and response needs to be numeric\np_train_x <- p_train %>%\n select(bl:bm) %>%\n as.matrix()\np_train_y <- p_train %>% pull(species) %>% as.numeric() \np_train_y <- p_train_y-1 # Needs to be 0, 1, 2\np_test_x <- p_test %>%\n select(bl:bm) %>%\n as.matrix()\np_test_y <- p_test %>% pull(species) %>% as.numeric() \np_test_y <- p_test_y-1 # Needs to be 0, 1, 2\n```\n:::\n\n\n\nThe specified model is reasonably simple, four input variables, two nodes in the hidden layer and a three column binary matrix for output. This corresponds to 5+5+3+3+3=19 parameters. \n\n\n\n::: {.cell}\n::: {.cell-output .cell-output-stdout}\n\n```\nModel: \"sequential\"\n______________________________________________________________\n Layer (type) Output Shape Param # \n==============================================================\n dense_1 (Dense) (None, 2) 10 \n dense (Dense) (None, 3) 9 \n==============================================================\nTotal params: 19 (76.00 Byte)\nTrainable params: 19 (76.00 Byte)\nNon-trainable params: 0 (0.00 Byte)\n______________________________________________________________\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\n# Fit model\np_nn_fit <- p_nn_model %>% keras::fit(\n x = p_train_x, \n y = p_train_y,\n epochs = 200,\n verbose = 0\n)\n```\n:::\n\n::: {.cell}\n\n:::\n\n\n\nBecause we set the random number seed we will get the same fit each time the code provided here is run. However, if the model is re-fit without setting the seed, you will see that there is a surprising amount of variability in the fits. Setting `epochs = 200` helps to usually get a good fit. One expects that `keras` is reasonably stable so one would not expect the huge array of fits as observed in @wickham2015 using `nnet`. That this can happen with the simple model used here reinforces the notion that fitting of NN models is fiddly, and great care needs to be taken to validate and diagnose the fit. \n\n::: {.content-visible when-format=\"html\"}\n::: info\nFitting NN models is fiddly, and very different fitted models can result from restarts, parameter choices, and architecture. \n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\infobox{Fitting NN models is fiddly, and very different fitted models can result from restarts, parameter choices, and architecture. \n}\n:::\n\n\n\n::: {.cell}\n\n:::\n\n\n\nThe fitted model that we have chosen as the final one has reasonably small loss and high accuracy. Plots of loss and accuracy across epochs showing the change during fitting can be plotted, but we don't show them here, because they are generally not very interesting.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\np_nn_model %>% evaluate(p_test_x, p_test_y, verbose = 0)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n loss accuracy \n0.2563850 0.9553571 \n```\n\n\n:::\n:::\n\n\n\nThe model object can be saved for later use with:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsave_model_tf(p_nn_model, \"data/penguins_cnn\")\n```\n:::\n\n\n\n## Extracting model components\n\\index{classification!hidden layers}\n\n::: {.content-visible when-format=\"html\"}\n::: info\nView the individual node models to understand how they combine to produce the overall model.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\infobox{View the individual node models to understand how they combine to produce the overall model.\n}\n:::\n\nBecause nodes in the hidden layers of NNs are themselves (relatively simple regression) models, it can be interesting to examine these to understand how the model is making it's predictions. Although it's rarely easy, most software will allow the coefficients for the models at these nodes to be extracted. With the penguins NN model there are two nodes, so we can extract the coefficients and plot the resulting two linear combinations to examine the separation between classes.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Extract hidden layer model weights\np_nn_wgts <- keras::get_weights(p_nn_model, trainable=TRUE)\np_nn_wgts \n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[[1]]\n [,1] [,2]\n[1,] 0.6216676 1.33304155\n[2,] 0.1851478 -0.01596385\n[3,] -0.1680396 -0.30432791\n[4,] -0.8867414 -0.36627045\n\n[[2]]\n[1] 0.12708087 -0.09466381\n\n[[3]]\n [,1] [,2] [,3]\n[1,] -0.1646167 1.527644 -1.9215064\n[2,] -0.7547278 1.555889 0.3210194\n\n[[4]]\n[1] 0.4554813 -0.9371488 0.3577386\n```\n\n\n:::\n:::\n\n\n\nThe linear coefficients for the first node in the model are 0.62, 0.19, -0.17, -0.89, and the second node in the model are 1.33, -0.02, -0.3, -0.37. We can use these like we used the linear discriminants in LDA to make a 2D view of the data, where the model is separating the three species. The constants 0.13, -0.09 are not important for this. They are only useful for drawing the location of the boundaries between classes produced by the model.\n\nThese two sets of model coefficients provide linear combinations of the original variables. Together, they define a plane on which the data is projected to view the classification produced by the model. Ideally, though this plane should be defined using an orthonormal basis otherwise the shape of the data distribution might be warped. So we orthonormalise this matrix before computing the data projection.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Orthonormalise the weights to make 2D projection\np_nn_wgts_on <- tourr::orthonormalise(p_nn_wgts[[1]])\np_nn_wgts_on\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [,1] [,2]\n[1,] 0.5593355 0.7969849\n[2,] 0.1665838 -0.2145664\n[3,] -0.1511909 -0.1541475\n[4,] -0.7978314 0.5431528\n```\n\n\n:::\n:::\n\n::: {.cell}\n::: {.cell-output-display}\n![Plot of the data in the linear combinations from the two nodes in the hidden layer. The three species are clearly different, although with some overlap between all three. A main issue to notice is that there isn't a big gap between Gentoo and the other species, which we know is there based on our data exploration done in other chapters. This suggests this fitted model is sub-optimal.](17-nn_files/figure-pdf/fig-hidden-layer-1.pdf){#fig-hidden-layer fig-alt='FIXME' width=80%}\n:::\n:::\n\n\n\n@fig-hidden-layer shows the data projected into the plane determined by the two linear combinations of the two nodes in the hidden layer. Training and test sets are indicated by empty and solid circles. The three species are clearly different but there is some overlap or confusion for a few penguins. The most interesting aspect to learn is that there is no big gap between the Gentoo and other species, which we know exists in the data. The model has not found this gap, and thus is likely to unfortunately and erroneously confuse some Gentoo penguins, particularly with Adelie.\n\nWhat we have shown here is a process to use the models at the nodes of the hidden layer to produce a reduced dimensional space where the classes are best separated, at least as determined by the model. The process will work in higher dimensions also. \n\nWhen there are more nodes in the hidden layer than the number of original variables it means that the space is extended to achieve useful classifications that need more complicated non-linear boundaries. The extra nodes describe the non-linearity. @wickham2015 provides a good illustration of this in 2D. The process of examining each of the node models can be useful for understanding this non-linear separation, also in high dimensions.\n\n## Examining predictive probabilities\n\\index{classification!predictive probabilities}\n\nWhen the predictive probabilities are returned by a model, as is done by this NN, we can use a ternary diagram for three class problems, or high-dimensional simplex when there are more classes to examine the strength of the classification. This done in the same way that was used for the votes matrix from a random forest in @sec-votes. \n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Predict training and test set\np_train_pred <- p_nn_model %>% \n predict(p_train_x, verbose = 0)\np_train_pred_cat <- levels(p_train$species)[\n apply(p_train_pred, 1,\n which.max)]\np_train_pred_cat <- factor(\n p_train_pred_cat,\n levels=levels(p_train$species))\ntable(p_train$species, p_train_pred_cat)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n p_train_pred_cat\n Adelie Chinstrap Gentoo\n Adelie 92 4 1\n Chinstrap 0 45 0\n Gentoo 1 0 78\n```\n\n\n:::\n\n```{.r .cell-code}\np_test_pred <- p_nn_model %>% \n predict(p_test_x, verbose = 0)\np_test_pred_cat <- levels(p_test$species)[\n apply(p_test_pred, 1, \n which.max)]\np_test_pred_cat <- factor(\n p_test_pred_cat,\n levels=levels(p_test$species))\ntable(p_test$species, p_test_pred_cat)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n p_test_pred_cat\n Adelie Chinstrap Gentoo\n Adelie 45 3 1\n Chinstrap 0 23 0\n Gentoo 1 0 39\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n::: {.cell-output-display}\n![Ternary diagram for the three groups of the predictive probabilities of both training ans test sets. From what we already know about the penguins data this fit is not good. Both Chinstrap and Gentoo penguins are confused with Adelie, or at risk of it. Gentoo is very well-separated from the other two species when several variables are used, and this fitted model is blind to it. One useful finding is that there is little difference between training and test sets, so the model has not been over-fitted.](17-nn_files/figure-pdf/unnamed-chunk-17-1.pdf){width=70%}\n:::\n:::\n\n\n\n\n::: {.content-visible when-format=\"html\"}\n::: info\nIf the training and test sets look similar when plotted in the model space then the model is not suffering from over-fitting.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\infobox{If the training and test sets look similar when plotted in the model space then the model is not suffering from over-fitting. \n}\n:::\n\n## Local explanations\n\\index{classification!local explanations}\n\\index{classification!XAI}\n\nIt especially important to be able to interpret or explain a model, even more so when the model is complex or black-box'y. A good resource for learning about the range of methods is @iml. Local explanations provide some information about variables that are important for making the prediction for a particular observation. The method that we use here is Shapley value, as computed using the `kernelshap` package [@kernelshap]. \n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Explanations\n# https://www.r-bloggers.com/2022/08/kernel-shap/\nlibrary(kernelshap)\nlibrary(shapviz)\np_explain <- kernelshap(\n p_nn_model,\n p_train_x, \n bg_X = p_train_x,\n verbose = FALSE\n )\np_exp_sv <- shapviz(p_explain)\nsave(p_exp_sv, file=\"data/p_exp_sv.rda\")\n```\n:::\n\n\n\nA Shapley value for an observation indicates how the variable contributes to the model prediction for that observation, relative to other variables. It is an average, computed from the change in prediction when all combinations of presence or absence of other variables. In the computation, for each combination, the prediction is computed by substituting absent variables with their average value, like one might do when imputing missing values. \n\n@fig-shapley-pcp shows the Shapley values for Gentoo observations (both training and test sets) in the penguins data, as a parallel coordinate plot. The values for the single misclassified Gentoo penguin (in the training set) is coloured orange. Overall, the Shapley values don't vary much on `bl`, `bd` and `fl` but they do on `bm`. The effect of other variables is seems to be only important for `bm`. \n\nFor the misclassified penguin, the effect of `bm` for all combinations of other variables leads to a decline in predicted value, thus less confidence in it being a Gentoo. In contrast, for this same penguin when considering the effect of `bl` the predicted value increases on average. \n\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n::: {.cell-output-display}\n![SHAP values focused on Gentoo class, for each variable. The one misclassified penguin (orange) has a much lower value for body mass, suggesting that this variable is used differently for the prediction than for other penguins.](17-nn_files/figure-pdf/fig-shapley-pcp-1.pdf){#fig-shapley-pcp fig-alt='FIXME' width=80%}\n:::\n:::\n\n\n\nIf we examine the data [@fig-penguins-bl-bm-bd] the explanation makes some sense. The misclassified penguin has an unusually small value on `bm`. That the SHAP value for `bm` was quite different pointed to this being a potential issue with the model, particularly for this penguin. This penguin's prediction is negatively impacted by `bm` being in the model.\n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Plots of the data to help understand what the SHAP values indicate. The misclassified Gentoo penguin has an unusually low body mass value which makes it appear to be more like an Adelie penguin, particularly when considered in relation to it's bill length.](17-nn_files/figure-pdf/fig-penguins-bl-bm-bd-1.pdf){#fig-penguins-bl-bm-bd fig-alt='FIXME' width=100%}\n:::\n:::\n\n\n\n## Examining boundaries\n\n\n@fig-penguins-lda-nn shows the boundaries for this NN model along with those of the LDA model. \n\n\n\n::: {.cell}\n\n:::\n\n\n\n\n::: {.content-visible when-format=\"html\"}\n\n::: {#fig-penguins-lda-nn-html layout-ncol=2}\n\n![LDA model](gifs/penguins_lda_boundaries.gif){#fig-lda-boundary fig-alt=\"FIX ME\" width=300}\n\n![NN model](gifs/penguins_nn_boundaries.gif){#fig-tree-boundary fig-alt=\"FIX ME\" width=300}\n\nComparison of the boundaries produced by the LDA (a) and the NN (b) model, using a slice tour. \n:::\n:::\n\n::: {#fig-penguins-lda-nn layout-ncol=2}\n\n![LDA model](images/fig-lda-2D-boundaries-1.png){#fig-lda-boundary2 fig-alt=\"FIX ME\" width=200}\n\n![NN model](images/penguins-nn-boundaries-1.png){#fig-nn-boundary fig-alt=\"FIX ME\" width=290}\n\nComparison of the boundaries produced by the LDA (a) and the NN (b) model, using a slice tour. \n:::\n\n\\index{tour!slice} \n\n## Application to a large dataset\n\nTo see how these methods apply in the setting where we have a large number of variables, observations and classes we will look at a neural network that predicts the category for the fashion MNIST data. The code for designing and fitting the model is following the tutorial available from https://tensorflow.rstudio.com/tutorials/keras/classification and you can find additional information there. Below we only replicate the steps needed to build the model from scratch. We also note that a similar investigation was presented in @li2020visualizing, with a focus on investigating the model at different epochs during the training.\n\\index{data!fashion MNIST}\n\nThe first step is to download and prepare the data. Here we scale the observations to range between zero and one, and we define the label names.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(keras)\n\n# download the data\nfashion_mnist <- dataset_fashion_mnist()\n\n# split into input variables and response\nc(train_images, train_labels) %<-% fashion_mnist$train\nc(test_images, test_labels) %<-% fashion_mnist$test\n\n# for interpretation we also define the category names\nclass_names = c('T-shirt/top',\n 'Trouser',\n 'Pullover',\n 'Dress',\n 'Coat',\n 'Sandal',\n 'Shirt',\n 'Sneaker',\n 'Bag',\n 'Ankle boot')\n\n# rescaling to the range (0,1)\ntrain_images <- train_images / 255\ntest_images <- test_images / 255\n```\n:::\n\n\n\nIn the next step we define the neural network and train the model. Note that because we have many observations, even a very simple structure returns a good model. And because this example is well-known, we do not need to tune the model or check the validation accuracy.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# defining the model\nmodel_fashion_mnist <- keras_model_sequential()\nmodel_fashion_mnist %>%\n # flatten the image data into a long vector\n layer_flatten(input_shape = c(28, 28)) %>%\n # hidden layer with 128 units\n layer_dense(units = 128, activation = 'relu') %>%\n # output layer for 10 categories\n layer_dense(units = 10, activation = 'softmax')\n\nmodel_fashion_mnist %>% compile(\n optimizer = 'adam',\n loss = 'sparse_categorical_crossentropy',\n metrics = c('accuracy')\n)\n\n# fitting the model, if we did not know the model yet we\n# would add a validation split to diagnose the training\nmodel_fashion_mnist %>% fit(train_images,\n train_labels,\n epochs = 5)\nsave_model_tf(model_fashion_mnist, \"data/fashion_nn\")\n```\n:::\n\n\n\nWe have defined a flat neural network with a single hidden layer with 128 nodes. To investigate the model we can start by comparing the activations to the original input data distribution. Since both the input space and the space of activations is large, and they are of different dimensionality, we will first use principal component analysis. This simplifies the analysis, and in general we do not need the original pixel or hidden node information for the interpretation here. The comparison is using the test-subset of the data.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# get the fitted model\nmodel_fashion_mnist <- load_model_tf(\"data/fashion_nn\")\n# observed response labels in the test set\ntest_tags <- factor(class_names[test_labels + 1],\n levels = class_names)\n\n# calculate activation for the hidden layer, this can be done\n# within the keras framework\nactivations_model_fashion <- keras_model(\n inputs = model_fashion_mnist$input,\n outputs = model_fashion_mnist$layers[[2]]$output\n)\nactivations_fashion <- predict(\n activations_model_fashion,\n test_images, verbose = 0)\n\n# PCA for activations\nactivations_pca <- prcomp(activations_fashion)\nactivations_pc <- as.data.frame(activations_pca$x)\n\n# PCA on the original data\n# we first need to flatten the image input\ntest_images_flat <- test_images\ndim(test_images_flat) <- c(nrow(test_images_flat), 784)\nimages_pca <- prcomp(as.data.frame(test_images_flat))\nimages_pc <- as.data.frame(images_pca$x)\n```\n:::\n\n::: {.cell}\n::: {.cell-output-display}\n![](17-nn_files/figure-pdf/unnamed-chunk-27-1.pdf){width=80%}\n:::\n:::\n\n\n\nLooking only at the first two principal components we note some clear differences from the transformation in the hidden layer. The observations seem to be more evenly spread in the input space, while in the activations space we notice grouping along specific directions. In particular the category \"Bag\" appears to be most different from all other classes, and the non-linear transformation in the activations space shows that they are clearly different from the shoe categories, while in the input space we could note some overlap in the linear projection. To better identify differences between other groups we will use the tour on the first five principal components.\n\n\n\n\n::: {.cell}\n\n:::\n\n\n\n::: {.content-visible when-format=\"html\"}\n::: {#fig-fashion-gt-html layout-ncol=2}\n\n![Input space](gifs/fashion_images_gt.gif){#fig-fashion-input fig-alt=\"FIX ME\" width=200}\n\n![Activations](gifs/fashion_activations_gt.gif){#fig-fashion-activation fig-alt=\"FIX ME\" width=200}\n\nComparison of the test observations in the first five principal components of the input space (left) and in the hidden layer activations (right). The activation function results in more clearly defined grouping of the different classes.\n:::\n:::\n\n\n::: {.content-visible when-format=\"pdf\"}\n::: {#fig-p-split-pdf layout-ncol=2}\n\n![Input space](images/fashion_images_gt_36.png){#fig-fashion-input fig-alt=\"FIX ME\" width=200}\n\n![Activations](images/fashion_activation_gt_126.png){#fig-fashion-activation fig-alt=\"FIX ME\" width=200}\n\nComparison of the test observations in the first five principal components of the input space (left) and in the hidden layer activations (right). The activation function results in more clearly defined grouping of the different classes.\n:::\n:::\n\nAs with the first two principal components we get a much more spread out distribution in the original space. Nevertheless we can see differences between the classes, and that some groups are varying along specific directions in that space. Overall the activations space shows tighter clusters as expected after including the ReLU activation function, but the picture is not as neat as the first two principal components would suggest. While certain groups appear very compact even in this larger subspace, others vary quite a bit within part of the space. For example we can clearly see the \"Bag\" observations as different from all other images, but also notice that there is a large variation within this class along certain directions.\n\nFinally we will investigate the model performance through the misclassifications and uncertainty between classes. We start with the error matrix for the test observations. To fit the error matrix we use the numeric labels, the ordering is as defined above for the labels.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfashion_test_pred <- predict(model_fashion_mnist,\n test_images, verbose = 0)\nfashion_test_pred_cat <- levels(test_tags)[\n apply(fashion_test_pred, 1,\n which.max)]\npredicted <- factor(\n fashion_test_pred_cat,\n levels=levels(test_tags)) %>%\n as.numeric() - 1\nobserved <- as.numeric(test_tags) -1\ntable(observed, predicted)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n predicted\nobserved 0 1 2 3 4 5 6 7 8 9\n 0 128 0 84 719 10 0 51 7 1 0\n 1 0 42 15 941 0 0 0 2 0 0\n 2 6 0 824 38 121 0 3 8 0 0\n 3 1 0 14 976 8 0 1 0 0 0\n 4 1 0 205 181 605 0 0 8 0 0\n 5 1 0 0 2 0 77 0 902 1 17\n 6 24 0 231 282 405 0 44 12 2 0\n 7 0 0 0 0 0 0 0 1000 0 0\n 8 17 1 78 102 49 0 5 730 16 2\n 9 0 0 0 2 0 0 0 947 0 51\n```\n\n\n:::\n:::\n\n\n\nFrom this we see that the model mainly confuses certain categories with each other, and within expected groups (e.g. different types of shoes can be confused with each other, or different types of shirts). We can further investigate this by visualizing the full probability matrix for the test observations, to see which categories the model is uncertain about.\n\n\n\n::: {.cell}\n\n:::\n\n\n\nFor this data using explainers like SHAP is not so interesting, since the individual pixel contribution to a prediction are typically not of interest. With image classification a next step might be to further investigate which part of the image is important for a prediction, and this can be visualized as a heat map placed over the original image. This is especially interesting in the case of difficult or misclassified images. This however is beyond the scope of this book.\n\n\n\n\n## Exercises {-}\n\n1. The problem with the NN model fitted to the penguins is that the Gentoo are poorly classified, when they should be perfectly predictable due to the big gap between class clusters. Re-fit the NN to the penguins data, to find a better model that appropriately perfectly predicts Gentoo penguins. Support this by plotting the model (using the hidden layer), and the predictive probabilities as a ternary plot. Do the SHAP values also support that `bd` plays a stronger role in your best model? (`bd` is the main variable for distinguishing Gentoo's from the other species, particularly when used with `fl` or `bl`.)\n2. For the fashion MNIST data we have seen that certain categories are more likely to be confused with each other. Select a subset of the data including only the categories Ankle boot, Sneaker and Sandal and see if you can reproduce the analysis of the penguins data in this chapter with this subset.\n3. XXX fake trees, can we think about the number of nodes and make it work with a simple NN similar to penguins data?\n4. The sketches data could also be considered a classic image classification problem, and we have seen that we can get a reasonable accuracy with a random forest model. Because we only have a smaller number of observations (compared to the fashion MNIST data) when fitting a neural network we need to be very careful not to overfit the training data. Try fitting a flat neural network (similar to what we did for the fashion MNIST data) and check the test accuracy of the model.\n5. Challenge: try to design a more accurate neural network for the sketches data. Here you can investigate using a convolutional neural network in combination with data augmentation. In addition, using batch normalization should improve the model performance.\n\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n:::\n", + "markdown": "# Neural networks and deep learning\n\n\\index{classification!neural networks}\n\nNeural networks (NN) can be considered to be nested additive (or even ensemble) models where explanatory variables are combined, and transformed through an activation function like a logistic. These transformed combinations are added recursively to yield class predictions. They are considered to be black box models, but there is a growing demand for interpretability. Although interpretability is possible, it can be unappealing to understand a complex model constructed to tackle a difficult classification task. Nevertheless, this is the motivation for the explanation of visualisation for NN models in this chapter.\n\nIn the simplest form, we might write the equation for a NN as\n\n$$\n\\hat{y} = f(x) = a_0+\\sum_{h=1}^{s}\nw_{0h}\\phi(a_h+\\sum_{i=1}^{p} w_{ih}x_i)\n$$ where $s$ indicates the number of nodes in the hidden (middle layer), and $\\phi$ is a choice of activation function. In a simple situation where $p=3$, $s=2$, and linear output layer, the model could be written as:\n\n$$\n\\begin{aligned}\n\\hat{y} = a_0+ & w_{01}\\phi(a_1+w_{11}x_1+w_{21}x_2+w_{31}x_3) +\\\\\n & w_{02}\\phi(a_2+w_{12}x_1+w_{22}x_2+w_{32}x_3)\n\\end{aligned}\n$$ which is a combination of two (linear) models, each of which could be examined for their role in making predictions.\n\nIn practice, a model may have many nodes, and several hidden layers, a variety of activation functions, and regularisation modifications. One should keep in mind the principle of parsimony is important when applying NNs, because it is tempting to make an overly complex, and thus over-parameterised, construction. Fitting NNs is still problematic. One would hope that fitting produces a stable result, whatever the starting seed the same parameter estimates are returned. However, this is not the case, and different, sometimes radically different, results are routinely obtained after each attempted fit [@wickham2015].\n\nFor these examples we use the software `keras` [@keras] following the installation and tutorial details at . Because it is an interface to python it can be tricky to install. If this is a problem, the example code should be possible to convert to use `nnet` [@VR02] or `neuralnet` [@neuralnet]. We will use the penguins data to illustrate the fitting, because it makes it easier to understand the procedures and the fit. However, a NN is like using a jackhammer instead of a trowel to plant a seedling, more complicated than necessary to build a good classification model for this data.\n\n## Setting up the model\n\n\\index{classification!ANN architecture}\n\nA first step is to decide how many nodes the NN architecture should have, and what activation function should be used. To make these decisions, ideally you already have some knowledge of the shapes of class clusters. For the penguins classification, we have seen that it contains three elliptically shaped clusters of roughly the same size. This suggests two nodes in the hidden layer would be sufficient to separate three clusters (@fig-nn-diagram). Because the shapes of the clusters are convex, using linear activation (\"relu\") will also be sufficient. The model specification is as follows:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(keras)\ntensorflow::set_random_seed(211)\n\n# Define model\np_nn_model <- keras_model_sequential()\np_nn_model %>% \n layer_dense(units = 2, activation = 'relu', \n input_shape = 4) %>% \n layer_dense(units = 3, activation = 'softmax')\np_nn_model %>% summary\n\nloss_fn <- loss_sparse_categorical_crossentropy(\n from_logits = TRUE)\n\np_nn_model %>% compile(\n optimizer = \"adam\",\n loss = loss_fn,\n metrics = c('accuracy')\n)\n```\n:::\n\n\n\nNote that `tensorflow::set_random_seed(211)` sets the seed for the model fitting so that we can obtain the same result to discuss later. It needs to be set before the model is defined in the code. The model will also be saved in order to diagnose and make predictions.\n\n![Network architecture for the model on the penguins data. The round nodes indicate original or transformed variables, and each arrow connecting these is represented as one of the weights $w_{ih}$ in the definition. The boxes indicate the additive constant entering the nodes, and the corresponding arrows represent the terms $a_h$.](images/nn-diagram.png){#fig-nn-diagram align=\"center\"}\n\n\n\n::: {.cell}\n\n:::\n\n\n\n## Checking the training/test split\n\n\\index{classification!training/test split}\n\nSplitting the data into training and test is an essential way to protect against overfitting, for most classifiers, but especially so for the copiously parameterised NNs. The model specified for the penguins data with only two nodes is unlikely to be overfitted, but it is nevertheless good practice to use a training set for building and a test set for evaluation.\n\n@fig-p-split-pdf shows the tour being used to examine the split into training and test samples for the penguins data. Using random sampling, particularly stratified by group, should result the two sets being very similar, as can be seen here. It does happen that several observations in the test set are on the extremes of their class cluster, so it could be that the model makes errors in the neighbourhoods of these points.\n\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n:::\n\n\n\n::: {.content-visible when-format=\"html\"}\n::: {#fig-p-split-html layout-ncol=\"2\"}\n![Grand tour](gifs/p_split.gif){#fig-split-grand fig-alt=\"FIX ME\" width=\"300\"}\n\n![Guided tour](gifs/p_split_guided.gif){#fig-split-guided fig-alt=\"FIX ME\" width=\"300\"}\n\nEvaluating the training/test split, where we expect that the two samples should roughly match. There are a few observations in the test set that are on the outer edges of the clusters, which will likely result in the model making an error in these regions. However, the two samples roughly match.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n::: {#fig-p-split-pdf layout-ncol=\"2\"}\n![Grand tour](images/p_split.png){fig-alt=\"FIX ME\" width=\"220\"}\n\n![Guided tour](images/p_split_guided.png){fig-alt=\"FIX ME\" width=\"220\"}\n\nEvaluating the training/test split, where we expect that the two samples should roughly match. There are a few observations in the test set that are on the outer edges of the clusters, which will likely result in the model making an error in these regions. However, the two samples roughly match.\n:::\n:::\n\n## Fit the model\n\n\\index{classification!Fitting a NN}\n\nThe data needs to be specially formatted for the model fitted using `keras`. The explanatory variables need to be provided as a `matrix`, and the categorical response needs to be separate, and specified as a `numeric` variable, beginning with 0.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Data needs to be matrix, and response needs to be numeric\np_train_x <- p_train %>%\n select(bl:bm) %>%\n as.matrix()\np_train_y <- p_train %>% pull(species) %>% as.numeric() \np_train_y <- p_train_y-1 # Needs to be 0, 1, 2\np_test_x <- p_test %>%\n select(bl:bm) %>%\n as.matrix()\np_test_y <- p_test %>% pull(species) %>% as.numeric() \np_test_y <- p_test_y-1 # Needs to be 0, 1, 2\n```\n:::\n\n\n\nThe specified model is reasonably simple, four input variables, two nodes in the hidden layer and a three column binary matrix for output. This corresponds to 5+5+3+3+3=19 parameters.\n\n\n\n::: {.cell}\n::: {.cell-output .cell-output-stdout}\n\n```\nModel: \"sequential\"\n______________________________________________________________\n Layer (type) Output Shape Param # \n==============================================================\n dense_1 (Dense) (None, 2) 10 \n dense (Dense) (None, 3) 9 \n==============================================================\nTotal params: 19 (76.00 Byte)\nTrainable params: 19 (76.00 Byte)\nNon-trainable params: 0 (0.00 Byte)\n______________________________________________________________\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\n# Fit model\np_nn_fit <- p_nn_model %>% keras::fit(\n x = p_train_x, \n y = p_train_y,\n epochs = 200,\n verbose = 0\n)\n```\n:::\n\n::: {.cell}\n\n:::\n\n\n\nBecause we set the random number seed we will get the same fit each time the code provided here is run. However, if the model is re-fit without setting the seed, you will see that there is a surprising amount of variability in the fits. Setting `epochs = 200` helps to usually get a good fit. One expects that `keras` is reasonably stable so one would not expect the huge array of fits as observed in @wickham2015 using `nnet`. That this can happen with the simple model used here reinforces the notion that fitting of NN models is fiddly, and great care needs to be taken to validate and diagnose the fit.\n\n::: {.content-visible when-format=\"html\"}\n::: info\nFitting NN models is fiddly, and very different fitted models can result from restarts, parameter choices, and architecture.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\n\n```{=tex}\n\\infobox{Fitting NN models is fiddly, and very different fitted models can result from restarts, parameter choices, and architecture. \n}\n```\n\n\n:::\n\n\n\n::: {.cell}\n\n:::\n\n\n\nThe fitted model that we have chosen as the final one has reasonably small loss and high accuracy. Plots of loss and accuracy across epochs showing the change during fitting can be plotted, but we don't show them here, because they are generally not very interesting.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\np_nn_model %>% evaluate(p_test_x, p_test_y, verbose = 0)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n loss accuracy \n0.2563850 0.9553571 \n```\n\n\n:::\n:::\n\n\n\nThe model object can be saved for later use with:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsave_model_tf(p_nn_model, \"data/penguins_cnn\")\n```\n:::\n\n\n\n## Extracting model components\n\n\\index{classification!hidden layers}\n\n::: {.content-visible when-format=\"html\"}\n::: info\nView the individual node models to understand how they combine to produce the overall model.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\n\n```{=tex}\n\\infobox{View the individual node models to understand how they combine to produce the overall model.\n}\n```\n\n\n:::\n\nBecause nodes in the hidden layers of NNs are themselves (relatively simple regression) models, it can be interesting to examine these to understand how the model is making it's predictions. Although it's rarely easy, most software will allow the coefficients for the models at these nodes to be extracted. With the penguins NN model there are two nodes, so we can extract the coefficients and plot the resulting two linear combinations to examine the separation between classes.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Extract hidden layer model weights\np_nn_wgts <- keras::get_weights(p_nn_model, trainable=TRUE)\np_nn_wgts \n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[[1]]\n [,1] [,2]\n[1,] 0.6216676 1.33304155\n[2,] 0.1851478 -0.01596385\n[3,] -0.1680396 -0.30432791\n[4,] -0.8867414 -0.36627045\n\n[[2]]\n[1] 0.12708087 -0.09466381\n\n[[3]]\n [,1] [,2] [,3]\n[1,] -0.1646167 1.527644 -1.9215064\n[2,] -0.7547278 1.555889 0.3210194\n\n[[4]]\n[1] 0.4554813 -0.9371488 0.3577386\n```\n\n\n:::\n:::\n\n\n\nThe linear coefficients for the first node in the model are 0.62, 0.19, -0.17, -0.89, and the second node in the model are 1.33, -0.02, -0.3, -0.37. We can use these like we used the linear discriminants in LDA to make a 2D view of the data, where the model is separating the three species. The constants 0.13, -0.09 are not important for this. They are only useful for drawing the location of the boundaries between classes produced by the model.\n\nThese two sets of model coefficients provide linear combinations of the original variables. Together, they define a plane on which the data is projected to view the classification produced by the model. Ideally, though this plane should be defined using an orthonormal basis otherwise the shape of the data distribution might be warped. So we orthonormalise this matrix before computing the data projection.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Orthonormalise the weights to make 2D projection\np_nn_wgts_on <- tourr::orthonormalise(p_nn_wgts[[1]])\np_nn_wgts_on\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [,1] [,2]\n[1,] 0.5593355 0.7969849\n[2,] 0.1665838 -0.2145664\n[3,] -0.1511909 -0.1541475\n[4,] -0.7978314 0.5431528\n```\n\n\n:::\n:::\n\n::: {.cell}\n::: {.cell-output-display}\n![Plot of the data in the linear combinations from the two nodes in the hidden layer. The three species are clearly different, although with some overlap between all three. A main issue to notice is that there isn't a big gap between Gentoo and the other species, which we know is there based on our data exploration done in other chapters. This suggests this fitted model is sub-optimal.](17-nn_files/figure-pdf/fig-hidden-layer-1.pdf){#fig-hidden-layer fig-alt='FIXME' width=80%}\n:::\n:::\n\n\n\n@fig-hidden-layer shows the data projected into the plane determined by the two linear combinations of the two nodes in the hidden layer. Training and test sets are indicated by empty and solid circles. The three species are clearly different but there is some overlap or confusion for a few penguins. The most interesting aspect to learn is that there is no big gap between the Gentoo and other species, which we know exists in the data. The model has not found this gap, and thus is likely to unfortunately and erroneously confuse some Gentoo penguins, particularly with Adelie.\n\nWhat we have shown here is a process to use the models at the nodes of the hidden layer to produce a reduced dimensional space where the classes are best separated, at least as determined by the model. The process will work in higher dimensions also.\n\nWhen there are more nodes in the hidden layer than the number of original variables it means that the space is extended to achieve useful classifications that need more complicated non-linear boundaries. The extra nodes describe the non-linearity. @wickham2015 provides a good illustration of this in 2D. The process of examining each of the node models can be useful for understanding this non-linear separation, also in high dimensions.\n\n## Examining predictive probabilities\n\n\\index{classification!predictive probabilities}\n\nWhen the predictive probabilities are returned by a model, as is done by this NN, we can use a ternary diagram for three class problems, or high-dimensional simplex when there are more classes to examine the strength of the classification. This done in the same way that was used for the votes matrix from a random forest in @sec-votes.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Predict training and test set\np_train_pred <- p_nn_model %>% \n predict(p_train_x, verbose = 0)\np_train_pred_cat <- levels(p_train$species)[\n apply(p_train_pred, 1,\n which.max)]\np_train_pred_cat <- factor(\n p_train_pred_cat,\n levels=levels(p_train$species))\ntable(p_train$species, p_train_pred_cat)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n p_train_pred_cat\n Adelie Chinstrap Gentoo\n Adelie 92 4 1\n Chinstrap 0 45 0\n Gentoo 1 0 78\n```\n\n\n:::\n\n```{.r .cell-code}\np_test_pred <- p_nn_model %>% \n predict(p_test_x, verbose = 0)\np_test_pred_cat <- levels(p_test$species)[\n apply(p_test_pred, 1, \n which.max)]\np_test_pred_cat <- factor(\n p_test_pred_cat,\n levels=levels(p_test$species))\ntable(p_test$species, p_test_pred_cat)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n p_test_pred_cat\n Adelie Chinstrap Gentoo\n Adelie 45 3 1\n Chinstrap 0 23 0\n Gentoo 1 0 39\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n::: {.cell-output-display}\n![Ternary diagram for the three groups of the predictive probabilities of both training ans test sets. From what we already know about the penguins data this fit is not good. Both Chinstrap and Gentoo penguins are confused with Adelie, or at risk of it. Gentoo is very well-separated from the other two species when several variables are used, and this fitted model is blind to it. One useful finding is that there is little difference between training and test sets, so the model has not been over-fitted.](17-nn_files/figure-pdf/unnamed-chunk-17-1.pdf){width=70%}\n:::\n:::\n\n\n\n::: {.content-visible when-format=\"html\"}\n::: info\nIf the training and test sets look similar when plotted in the model space then the model is not suffering from over-fitting.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\n\n```{=tex}\n\\infobox{If the training and test sets look similar when plotted in the model space then the model is not suffering from over-fitting. \n}\n```\n\n\n:::\n\n## Local explanations\n\n\\index{classification!local explanations} \\index{classification!XAI}\n\nIt especially important to be able to interpret or explain a model, even more so when the model is complex or black-box'y. A good resource for learning about the range of methods is @iml. Local explanations provide some information about variables that are important for making the prediction for a particular observation. The method that we use here is Shapley value, as computed using the `kernelshap` package [@kernelshap].\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Explanations\n# https://www.r-bloggers.com/2022/08/kernel-shap/\nlibrary(kernelshap)\nlibrary(shapviz)\np_explain <- kernelshap(\n p_nn_model,\n p_train_x, \n bg_X = p_train_x,\n verbose = FALSE\n )\np_exp_sv <- shapviz(p_explain)\nsave(p_exp_sv, file=\"data/p_exp_sv.rda\")\n```\n:::\n\n\n\nA Shapley value for an observation indicates how the variable contributes to the model prediction for that observation, relative to other variables. It is an average, computed from the change in prediction when all combinations of presence or absence of other variables. In the computation, for each combination, the prediction is computed by substituting absent variables with their average value, like one might do when imputing missing values.\n\n@fig-shapley-pcp shows the Shapley values for Gentoo observations (both training and test sets) in the penguins data, as a parallel coordinate plot. The values for the single misclassified Gentoo penguin (in the training set) is coloured orange. Overall, the Shapley values don't vary much on `bl`, `bd` and `fl` but they do on `bm`. The effect of other variables is seems to be only important for `bm`.\n\nFor the misclassified penguin, the effect of `bm` for all combinations of other variables leads to a decline in predicted value, thus less confidence in it being a Gentoo. In contrast, for this same penguin when considering the effect of `bl` the predicted value increases on average.\n\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n::: {.cell-output-display}\n![SHAP values focused on Gentoo class, for each variable. The one misclassified penguin (orange) has a much lower value for body mass, suggesting that this variable is used differently for the prediction than for other penguins.](17-nn_files/figure-pdf/fig-shapley-pcp-1.pdf){#fig-shapley-pcp fig-alt='FIXME' width=80%}\n:::\n:::\n\n\n\nIf we examine the data [@fig-penguins-bl-bm-bd] the explanation makes some sense. The misclassified penguin has an unusually small value on `bm`. That the SHAP value for `bm` was quite different pointed to this being a potential issue with the model, particularly for this penguin. This penguin's prediction is negatively impacted by `bm` being in the model.\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Plots of the data to help understand what the SHAP values indicate. The misclassified Gentoo penguin has an unusually low body mass value which makes it appear to be more like an Adelie penguin, particularly when considered in relation to it's bill length.](17-nn_files/figure-pdf/fig-penguins-bl-bm-bd-1.pdf){#fig-penguins-bl-bm-bd fig-alt='FIXME' width=100%}\n:::\n:::\n\n\n\n## Examining boundaries\n\n\n\n@fig-penguins-lda-nn shows the boundaries for this NN model along with those of the LDA model.\n\n\n\n::: {.cell}\n\n:::\n\n\n\n::: {.content-visible when-format=\"html\"}\n::: {#fig-penguins-lda-nn-html layout-ncol=\"2\"}\n![LDA model](gifs/penguins_lda_boundaries.gif){#fig-lda-boundary fig-alt=\"FIX ME\" width=\"300\"}\n\n![NN model](gifs/penguins_nn_boundaries.gif){#fig-tree-boundary fig-alt=\"FIX ME\" width=\"300\"}\n\nComparison of the boundaries produced by the LDA (a) and the NN (b) model, using a slice tour.\n:::\n:::\n\n::: {#fig-penguins-lda-nn layout-ncol=\"2\"}\n![LDA model](images/fig-lda-2D-boundaries-1.png){#fig-lda-boundary2 fig-alt=\"FIX ME\" width=\"200\"}\n\n![NN model](images/penguins-nn-boundaries-1.png){#fig-nn-boundary fig-alt=\"FIX ME\" width=\"290\"}\n\nComparison of the boundaries produced by the LDA (a) and the NN (b) model, using a slice tour.\n:::\n\n\\index{tour!slice}\n\n## Application to a large dataset\n\nTo see how these methods apply in the setting where we have a large number of variables, observations and classes we will look at a neural network that predicts the category for the fashion MNIST data. The code for designing and fitting the model is following the tutorial available from https://tensorflow.rstudio.com/tutorials/keras/classification and you can find additional information there. Below we only replicate the steps needed to build the model from scratch. We also note that a similar investigation was presented in @li2020visualizing, with a focus on investigating the model at different epochs during the training. \\index{data!fashion MNIST}\n\nThe first step is to download and prepare the data. Here we scale the observations to range between zero and one, and we define the label names.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(keras)\n\n# download the data\nfashion_mnist <- dataset_fashion_mnist()\n\n# split into input variables and response\nc(train_images, train_labels) %<-% fashion_mnist$train\nc(test_images, test_labels) %<-% fashion_mnist$test\n\n# for interpretation we also define the category names\nclass_names = c('T-shirt/top',\n 'Trouser',\n 'Pullover',\n 'Dress',\n 'Coat',\n 'Sandal',\n 'Shirt',\n 'Sneaker',\n 'Bag',\n 'Ankle boot')\n\n# rescaling to the range (0,1)\ntrain_images <- train_images / 255\ntest_images <- test_images / 255\n```\n:::\n\n\n\nIn the next step we define the neural network and train the model. Note that because we have many observations, even a very simple structure returns a good model. And because this example is well-known, we do not need to tune the model or check the validation accuracy.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# defining the model\nmodel_fashion_mnist <- keras_model_sequential()\nmodel_fashion_mnist %>%\n # flatten the image data into a long vector\n layer_flatten(input_shape = c(28, 28)) %>%\n # hidden layer with 128 units\n layer_dense(units = 128, activation = 'relu') %>%\n # output layer for 10 categories\n layer_dense(units = 10, activation = 'softmax')\n\nmodel_fashion_mnist %>% compile(\n optimizer = 'adam',\n loss = 'sparse_categorical_crossentropy',\n metrics = c('accuracy')\n)\n\n# fitting the model, if we did not know the model yet we\n# would add a validation split to diagnose the training\nmodel_fashion_mnist %>% fit(train_images,\n train_labels,\n epochs = 5)\nsave_model_tf(model_fashion_mnist, \"data/fashion_nn\")\n```\n:::\n\n\n\nWe have defined a flat neural network with a single hidden layer with 128 nodes. To investigate the model we can start by comparing the activations to the original input data distribution. Since both the input space and the space of activations is large, and they are of different dimensionality, we will first use principal component analysis. This simplifies the analysis, and in general we do not need the original pixel or hidden node information for the interpretation here. The comparison is using the test-subset of the data.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# get the fitted model\nmodel_fashion_mnist <- load_model_tf(\"data/fashion_nn\")\n# observed response labels in the test set\ntest_tags <- factor(class_names[test_labels + 1],\n levels = class_names)\n\n# calculate activation for the hidden layer, this can be done\n# within the keras framework\nactivations_model_fashion <- keras_model(\n inputs = model_fashion_mnist$input,\n outputs = model_fashion_mnist$layers[[2]]$output\n)\nactivations_fashion <- predict(\n activations_model_fashion,\n test_images, verbose = 0)\n\n# PCA for activations\nactivations_pca <- prcomp(activations_fashion)\nactivations_pc <- as.data.frame(activations_pca$x)\n\n# PCA on the original data\n# we first need to flatten the image input\ntest_images_flat <- test_images\ndim(test_images_flat) <- c(nrow(test_images_flat), 784)\nimages_pca <- prcomp(as.data.frame(test_images_flat))\nimages_pc <- as.data.frame(images_pca$x)\n```\n:::\n\n::: {.cell}\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in get_plot_component(plot, \"guide-box\"): Multiple\ncomponents found; returning the first one. To return all, use\n`return_all = TRUE`.\n```\n\n\n:::\n\n::: {.cell-output-display}\n![](17-nn_files/figure-pdf/unnamed-chunk-27-1.png){width=80%}\n:::\n:::\n\n\n\nLooking only at the first two principal components we note some clear differences from the transformation in the hidden layer. The observations seem to be more evenly spread in the input space, while in the activations space we notice grouping along specific directions. In particular the category \"Bag\" appears to be most different from all other classes, and the non-linear transformation in the activations space shows that they are clearly different from the shoe categories, while in the input space we could note some overlap in the linear projection. To better identify differences between other groups we will use the tour on the first five principal components.\n\n\n\n::: {.cell}\n\n:::\n\n\n\n::: {.content-visible when-format=\"html\"}\n::: {#fig-fashion-gt-html layout-ncol=\"2\"}\n![Input space](gifs/fashion_images_gt.gif){#fig-fashion-input fig-alt=\"FIX ME\" width=\"200\"}\n\n![Activations](gifs/fashion_activations_gt.gif){#fig-fashion-activation fig-alt=\"FIX ME\" width=\"200\"}\n\nComparison of the test observations in the first five principal components of the input space (left) and in the hidden layer activations (right). The activation function results in more clearly defined grouping of the different classes.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n::: {layout-ncol=\"2\"}\n![Input space](images/fashion_images_gt_36.png){fig-alt=\"FIX ME\" width=\"200\"}\n\n![Activations](images/fashion_activation_gt_126.png){fig-alt=\"FIX ME\" width=\"200\"}\n\nComparison of the test observations in the first five principal components of the input space (left) and in the hidden layer activations (right). The activation function results in more clearly defined grouping of the different classes.\n:::\n:::\n\nAs with the first two principal components we get a much more spread out distribution in the original space. Nevertheless we can see differences between the classes, and that some groups are varying along specific directions in that space. Overall the activations space shows tighter clusters as expected after including the ReLU activation function, but the picture is not as neat as the first two principal components would suggest. While certain groups appear very compact even in this larger subspace, others vary quite a bit within part of the space. For example we can clearly see the \"Bag\" observations as different from all other images, but also notice that there is a large variation within this class along certain directions.\n\nFinally we will investigate the model performance through the misclassifications and uncertainty between classes. We start with the error matrix for the test observations. To fit the error matrix we use the numeric labels, the ordering is as defined above for the labels.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfashion_test_pred <- predict(model_fashion_mnist,\n test_images, verbose = 0)\nfashion_test_pred_cat <- levels(test_tags)[\n apply(fashion_test_pred, 1,\n which.max)]\npredicted <- factor(\n fashion_test_pred_cat,\n levels=levels(test_tags)) %>%\n as.numeric() - 1\nobserved <- as.numeric(test_tags) -1\ntable(observed, predicted)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n predicted\nobserved 0 1 2 3 4 5 6 7 8 9\n 0 128 0 84 719 10 0 51 7 1 0\n 1 0 42 15 941 0 0 0 2 0 0\n 2 6 0 824 38 121 0 3 8 0 0\n 3 1 0 14 976 8 0 1 0 0 0\n 4 1 0 205 181 605 0 0 8 0 0\n 5 1 0 0 2 0 77 0 902 1 17\n 6 24 0 231 282 405 0 44 12 2 0\n 7 0 0 0 0 0 0 0 1000 0 0\n 8 17 1 78 102 49 0 5 730 16 2\n 9 0 0 0 2 0 0 0 947 0 51\n```\n\n\n:::\n:::\n\n\n\nHere the labels are used as 0 - T-shirt/top, 1 - Trouser, 2 - Pullover, 3 - Dress, 4 - Coat, 5 - Sandal, 6 - Shirt, 7 - Sneaker, 8 - Bag, 9 - Ankle boot.\nFrom this we see that the model mainly confuses certain categories with each other, and within expected groups (e.g. different types of shoes can be confused with each other, or different types of shirts). We can further investigate this by visualizing the full probability matrix for the test observations, to see which categories the model is uncertain about.\n\n\n\n::: {.cell}\n\n:::\n\n\n\n::: {.content-visible when-format=\"html\"}\n::: {#fig-fashion-conf-gt-html}\n![Input space](gifs/fashion_confusion_gt.gif){#fig-fashion-confusion fig-alt=\"FIX ME\" width=\"400\"}\n\nA tour of the confusion matrix for the fashion MNIST test observations, focusing on a subset of items. Often observations get confused between two of the classes, this appears as points falling along one of the edges, for example some Shirts look more like T-shirts/tops, while others get confused with Coats. We can also notice that a subset of three other classes not mapped to colors as very separate from this group.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n::: {#fig-fashion-confusion-split-pdf layout-ncol=\"3\"}\n![](images/fashion_confustion_gt_36.png){fig-alt=\"FIX ME\" width=\"130\"}\n\n![](images/fashion_confusion_gt_58.png){fig-alt=\"FIX ME\" width=\"130\"}\n\n![](images/fashion_confusion_gt_69.png){fig-alt=\"FIX ME\" width=\"130\"}\n\n![](images/fashion_confusion_gt_161.png){fig-alt=\"FIX ME\" width=\"130\"}\n\n![](images/fashion_confusion_gt_212.png){fig-alt=\"FIX ME\" width=\"130\"}\n\n![](images/fashion_confusion_gt_333.png){fig-alt=\"FIX ME\" width=\"130\"}\n\nA tour of the confusion matrix for the fashion MNIST test observations, focusing on a subset of items. Often observations get confused between two of the classes, this appears as points falling along one of the edges, for example some Shirts look more like T-shirts/tops, while others get confused with Coats. We can also notice that a subset of three other classes not mapped to colors as very separate from this group.\n:::\n:::\n\nThe tour of the class probabilities shows that the model is often confused between two classes, this appears as points falling along one edge in the simplex. In particular for the highlighted categories we can notice some interesting patterns, where pairs of classes get confused with each other. We also see some three-way confusions, these are observations that fall on one surface triangle defined via three corners of the simplex, for example between Pullover, Shirt and Coat.\n\nFor this data using explainers like SHAP is not so interesting, since the individual pixel contribution to a prediction are typically not of interest. With image classification a next step might be to further investigate which part of the image is important for a prediction, and this can be visualized as a heat map placed over the original image. This is especially interesting in the case of difficult or misclassified images. This however is beyond the scope of this book.\n\n\n\n```{=html}\n\n```\n\n\n## Exercises {.unnumbered}\n\n1. The problem with the NN model fitted to the penguins is that the Gentoo are poorly classified, when they should be perfectly predictable due to the big gap between class clusters. Re-fit the NN to the penguins data, to find a better model that appropriately perfectly predicts Gentoo penguins. Support this by plotting the model (using the hidden layer), and the predictive probabilities as a ternary plot. Do the SHAP values also support that `bd` plays a stronger role in your best model? (`bd` is the main variable for distinguishing Gentoo's from the other species, particularly when used with `fl` or `bl`.)\n2. For the fashion MNIST data we have seen that certain categories are more likely to be confused with each other. Select a subset of the data including only the categories Ankle boot, Sneaker and Sandal and see if you can reproduce the analysis of the penguins data in this chapter with this subset.\n3. Can you fit a neural network that can predict the class in the fake tree data? Because the data is noisy and we do not have that many observations, it can be easy to overfit the data. Once you find a setting that works, think about what aspects of the model might be interesting for visualization. What comparisons with a random forest model could be of interest?\n4. The sketches data could also be considered a classic image classification problem, and we have seen that we can get a reasonable accuracy with a random forest model. Because we only have a smaller number of observations (compared to the fashion MNIST data) when fitting a neural network we need to be very careful not to overfit the training data. Try fitting a flat neural network (similar to what we did for the fashion MNIST data) and check the test accuracy of the model.\n5. Challenge: try to design a more accurate neural network for the sketches data. Here you can investigate using a convolutional neural network in combination with data augmentation. In addition, using batch normalization should improve the model performance.\n\n\n\n::: {.cell}\n\n:::\n", "supporting": [ "17-nn_files/figure-pdf" ], diff --git a/_freeze/17-nn/figure-html/fig-hidden-layer-1.png b/_freeze/17-nn/figure-html/fig-hidden-layer-1.png new file mode 100644 index 0000000..c8dab41 Binary files /dev/null and b/_freeze/17-nn/figure-html/fig-hidden-layer-1.png differ diff --git a/_freeze/17-nn/figure-html/fig-penguins-bl-bm-bd-1.png b/_freeze/17-nn/figure-html/fig-penguins-bl-bm-bd-1.png new file mode 100644 index 0000000..c6da286 Binary files /dev/null and b/_freeze/17-nn/figure-html/fig-penguins-bl-bm-bd-1.png differ diff --git a/_freeze/17-nn/figure-html/fig-shapley-pcp-1.png b/_freeze/17-nn/figure-html/fig-shapley-pcp-1.png new file mode 100644 index 0000000..ed2cf80 Binary files /dev/null and b/_freeze/17-nn/figure-html/fig-shapley-pcp-1.png differ diff --git a/_freeze/17-nn/figure-html/unnamed-chunk-17-1.png b/_freeze/17-nn/figure-html/unnamed-chunk-17-1.png new file mode 100644 index 0000000..6e42447 Binary files /dev/null and b/_freeze/17-nn/figure-html/unnamed-chunk-17-1.png differ diff --git a/_freeze/17-nn/figure-html/unnamed-chunk-27-1.png b/_freeze/17-nn/figure-html/unnamed-chunk-27-1.png new file mode 100644 index 0000000..311d8cb Binary files /dev/null and b/_freeze/17-nn/figure-html/unnamed-chunk-27-1.png differ diff --git a/_freeze/17-nn/figure-pdf/fig-hidden-layer-1.pdf b/_freeze/17-nn/figure-pdf/fig-hidden-layer-1.pdf index d82f686..dad4f1b 100644 Binary files a/_freeze/17-nn/figure-pdf/fig-hidden-layer-1.pdf and b/_freeze/17-nn/figure-pdf/fig-hidden-layer-1.pdf differ diff --git a/_freeze/17-nn/figure-pdf/fig-penguins-bl-bm-bd-1.pdf b/_freeze/17-nn/figure-pdf/fig-penguins-bl-bm-bd-1.pdf new file mode 100644 index 0000000..90aa639 Binary files /dev/null and b/_freeze/17-nn/figure-pdf/fig-penguins-bl-bm-bd-1.pdf differ diff --git a/_freeze/17-nn/figure-pdf/fig-shapley-pcp-1.pdf b/_freeze/17-nn/figure-pdf/fig-shapley-pcp-1.pdf new file mode 100644 index 0000000..729e762 Binary files /dev/null and b/_freeze/17-nn/figure-pdf/fig-shapley-pcp-1.pdf differ diff --git a/_freeze/17-nn/figure-pdf/unnamed-chunk-17-1.pdf b/_freeze/17-nn/figure-pdf/unnamed-chunk-17-1.pdf index b0a1a1e..a514ee4 100644 Binary files a/_freeze/17-nn/figure-pdf/unnamed-chunk-17-1.pdf and b/_freeze/17-nn/figure-pdf/unnamed-chunk-17-1.pdf differ diff --git a/_freeze/17-nn/figure-pdf/unnamed-chunk-26-1.pdf b/_freeze/17-nn/figure-pdf/unnamed-chunk-26-1.pdf new file mode 100644 index 0000000..8efdeb9 Binary files /dev/null and b/_freeze/17-nn/figure-pdf/unnamed-chunk-26-1.pdf differ diff --git a/_freeze/17-nn/figure-pdf/unnamed-chunk-27-1.pdf b/_freeze/17-nn/figure-pdf/unnamed-chunk-27-1.pdf new file mode 100644 index 0000000..bab5cec Binary files /dev/null and b/_freeze/17-nn/figure-pdf/unnamed-chunk-27-1.pdf differ diff --git a/_freeze/18-summary-class/execute-results/html.json b/_freeze/18-summary-class/execute-results/html.json new file mode 100644 index 0000000..3aeead0 --- /dev/null +++ b/_freeze/18-summary-class/execute-results/html.json @@ -0,0 +1,17 @@ +{ + "hash": "05884bf473d446af97b5d4ab51be7d64", + "result": { + "engine": "knitr", + "markdown": "# Exploring misclassifications\n\\index{classification!misclassification}\n\n## Errors for a single model\n\nTo examine misclassifications, we can create a separate variable that identifies the errors or not. Constructing this for each class, and exploring in small steps is helpful. Let's do this using the random forest model for the penguins fit. The random forest fit has only a few misclassifications. There are four Adelie penguins confused with Chinstrap, and similarly four Chinstrap confused with Adelie. There is one Gentoo penguin confused with a Chinstrap. This is interesting, because the Gentoo cluster is well separated from the clusters of the other two penguin species. \n\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to fit forest\"}\nlibrary(randomForest)\nlibrary(dplyr)\nload(\"data/penguins_sub.rda\")\n\npenguins_rf <- randomForest(species~.,\n data=penguins_sub[,1:5],\n importance=TRUE)\n```\n:::\n\n\\index{classification!confusion matrix}\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\npenguins_rf$confusion\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n Adelie Chinstrap Gentoo class.error\nAdelie 143 3 0 0.020547945\nChinstrap 4 64 0 0.058823529\nGentoo 0 1 118 0.008403361\n```\n\n\n:::\n\n```{.r .cell-code code-fold=\"false\"}\npenguins_errors <- penguins_sub %>%\n mutate(err = ifelse(penguins_rf$predicted !=\n penguins_rf$y, 1, 0))\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to make animated gifs\"}\nlibrary(tourr)\nsymbols <- c(1, 16)\np_pch <- symbols[penguins_errors$err+1]\np_cex <- rep(1, length(p_pch))\np_cex[penguins_errors$err==1] <- 2\nanimate_xy(penguins_errors[,1:4],\n col=penguins_errors$species,\n pch=p_pch, cex=p_cex)\nrender_gif(penguins_errors[,1:4],\n grand_tour(),\n display_xy(col=penguins_errors$species,\n pch=p_pch, cex=p_cex),\n gif_file=\"gifs/p_rf_errors.gif\",\n frames=500,\n width=400,\n height=400)\n\nanimate_xy(penguins_errors[,1:4],\n guided_tour(lda_pp(penguins_errors$species)),\n col=penguins_errors$species,\n pch=pch)\n\nrender_gif(penguins_errors[,1:4],\n guided_tour(lda_pp(penguins_errors$species)),\n display_xy(col=penguins_errors$species,\n pch=p_pch, cex=p_cex),\n gif_file=\"gifs/p_rf_errors_guided.gif\",\n frames=500,\n width=400,\n height=400,\n loop=FALSE)\n```\n:::\n\n\n@fig-p-errors-html shows a grand tour, and a guided tour, of the penguins data, where the misclassifications are marked by an asterisk. (If the gifs are too small to see the different glyphs, you can zoom in to make the figures larger.) It can be seen that the one Gentoo penguin that is mistaken for a Chinstrap by the forest model is always moving with its other Gentoo (yellow) family. It can occasionally be seen to be on the edge of the group, closer to the Chinstraps, in some projections in the grand tour. But in the final projection from the guided tour it is hiding well among the other Gentoos. This is an observation where a mistake has been made because of the inadequacies of the forest algorithm. Forests are only as good as the trees they are constructed from, and we have seen from @sec-trees that the splits only on single variables done by trees does not adequately utilise the covariance structure in each class. They make mistakes based on the boxy nature of the boundaries. This can carry through to the forests model. Even though many trees are combined to generate smoother boundaries, forests do not effectively utilise covariance in clusters either. The other mistakes, where Chinstrap are predicted to be Adelie, and vice versa, are more sensible. These mistaken observations can be seen to lie in the border region between the two clusters, and reflect genuine uncertainty about the classification of penguins in these two species.\n\n::: {.content-visible when-format=\"html\"}\n::: {#fig-p-errors-html layout-ncol=2}\n\n![Grand tour](gifs/p_rf_errors.gif){#fig-rf-errors fig-alt=\"FIX ME\" width=300}\n\n![Guided tour](gifs/p_rf_errors_guided.gif){#fig-rf-errors-guided fig-alt=\"FIX ME\" width=300}\n\nExamining the misclassified cases (marked as solid circles) from a random forest fit to the penguins data. The one Gentoo penguin mistaken for a Chinstrap is a mistake made because the forest method suffers from the same problems as trees - cutting on single variables rather than effectively using covariance structure. The mistakes between the Adelie and Chinstrap penguins are more sensible because all of these observations lie is the bordering regions between the two clusters.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n::: {#fig-p-errors-pdf layout-ncol=2}\n\n![Grand tour](images/p_rf_errors.png){#fig-rf-errors fig-alt=\"FIX ME\" width=200}\n\n![Guided tour](images/p_rf_errors_guided.png){#fig-rf-errors-guided fig-alt=\"FIX ME\" width=200}\n\nExamining the misclassified cases (marked as asterisks) from a random forest fit to the penguins data. The one Gentoo penguin mistaken for a Chinstrap is a mistake made because the forest method suffers from the same problems as trees - cutting on single variables rather than effectively using covariance structure. The mistakes between the Adelie and Chinstrap penguins are more sensible because all of these observations lie is the bordering regions between the two clusters.\n:::\n:::\n\n::: {.content-visible when-format=\"html\"}\n::: info\nSome errors are reasonable because there is overlap between the class clusters. Some errors are not reasonable because the model used is inadequate.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\infobox{Some errors are reasonable because there is overlap between the class clusters. Some errors are not reasonable because the model used is inadequate.\n}\n:::\n\n## Comparison between LDA and CNN\n\n## Constructing data to diagnose your model\n\n## Explainability\n\n## Exercises {-}\n\n1. Examine misclassifications from a random forest model for the fake_trees data between cluster 1 and 0, using the \n a. principal components\n b. votes matrix. \nDescribe where these errors relative to their true and predicted class clusters. When examining the simplex, are the misclassifications the points that are furthest from any vertices?\n2. Examine the misclassifications for the random forest model on the sketches data, focusing on cactus sketches that were mistaken for bananas. Follow up by plotting the images of these errors, and describe whether the classifier is correct that these sketches are so poor their true cactus or banana identity cannot be determined. \n3. How do the errors from the random forest model compare with those of your best fitting CNN model? Are the the corresponding images poor sketches of cacti or bananas?\n4. Now examine the misclassifications of the sketches data in the \n a. votes matrix from the random forest model\n b. predictive probability distribution from the CNN model,\nusing the simplex approach. Are they as expected, points lying in the middle or along an edge of the simplex?\n", + "supporting": [ + "18-summary-class_files" + ], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": {}, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/2-notation/execute-results/html.json b/_freeze/2-notation/execute-results/html.json new file mode 100644 index 0000000..e2e9164 --- /dev/null +++ b/_freeze/2-notation/execute-results/html.json @@ -0,0 +1,17 @@ +{ + "hash": "c5de00d6ea01c3702b5c96105df12e07", + "result": { + "engine": "knitr", + "markdown": "# Notation conventions and R objects\n\nThe data can be considered to be a matrix of numbers with the columns corresponding to variables, and the rows correspond to observations. It can be helpful to write this in mathematical notation, like:\n\n\\begin{eqnarray*}\nX_{n\\times p} =\n[X_1~X_2~\\dots~X_p]_{n\\times p} = \\left[ \\begin{array}{cccc}\nX_{11} & X_{12} & \\dots & X_{1p} \\\\\nX_{21} & X_{22} & \\dots & X_{2p}\\\\\n\\vdots & \\vdots & & \\vdots \\\\\nX_{n1} & X_{n2} & \\dots & X_{np} \\end{array} \\right]_{n\\times p}\n\\end{eqnarray*}\n\nwhere $X$ indicates the the $n\\times p$ data matrix, $X_j$ indicates variable $j, j=1, \\dots, p$ and $X_{ij}$ indicates the value $j^{th}$ variable of the $i^{th}$ observation. (It can be confusing to distinguish whether one is referring to the observation or a variable, because $X_i$ is used to indicate observation also. When this is done it is usually accompanied by qualifying words such as **observation** $X_3$, or **variable** $X_3$.)\n\n::: info\nHaving notation is helpful for concise explanations of different methods, to explain how data is scaled, processed and projected for various tasks, and how different quantities are calculated from the data. \n:::\n\nWhen there is a response variable(s), it is common to consider $X$ to be the predictors, and use $Y$ to indicate the response variable(s). $Y$ could be a matrix, also, and would be $n\\times q$, where commonly $q=1$. $Y$ could be numeric or categorical, and this would change how it is handled with visualisation.\n\nTo make a low-dimensional projection (shadow) of the data, we need a projection matrix:\n\n\\begin{eqnarray*}\nA_{p\\times d} = \\left[ \\begin{array}{cccc}\nA_{11} & A_{12} & \\dots & A_{1d} \\\\\nA_{21} & A_{22} & \\dots & A_{2d}\\\\\n\\vdots & \\vdots & & \\vdots \\\\\nA_{p1} & A_{p2} & \\dots & A_{pd} \\end{array} \\right]_{p\\times d}\n\\end{eqnarray*}\n\n\n$A$ should be an orthonormal matrix, which means that the $\\sum_{j=1}^p A_{jk}^2=1, k=1, \\dots, d$ (columns represent vectors of length 1) and $\\sum_{j=1}^p A_{jk}A_{jl}=0, k,l=1, \\dots, d; k\\neq l$ (columns represent vectors that are orthogonal to each other). In matrix notation, this can be written as $A^{\\top}A = I_d$.\n\nThen the projected data is written as:\n\n\\begin{eqnarray*}\nY_{n\\times d} = XA = \\left[ \\begin{array}{cccc}\ny_{11} & y_{12} & \\dots & y_{1d} \\\\\ny_{21} & y_{22} & \\dots & y_{2d}\\\\\n\\vdots & \\vdots & & \\vdots \\\\\ny_{n1} & y_{n2} & \\dots & y_{nd} \\end{array} \\right]_{n\\times d}\n\\end{eqnarray*}\n\nwhere $y_{ij} = \\sum_{k=1}^p X_{ik}A_{kj}$. Note that we are using $Y$ as the projected data here, as well as it possibly being used for a response variable. Where necessary, this will be clarified with words in the text, when notation is used in explanations later.\n\nWhen using R, if we only have the data corresponding to $X$ it makes sense to use a `matrix` object. However, if the response variable is included and it is categorical, then we might use a `data.frame` or a `tibble` which can accommodate non-numerical values. Then to work with the data, we can use the base R methods:\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nX <- matrix(c(1.1, 1.3, 1.4, 1.2, \n 2.7, 2.6, 2.4, 2.5, \n 3.5, 3.4, 3.2, 3.6), \n ncol=4, byrow=TRUE)\nX\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [,1] [,2] [,3] [,4]\n[1,] 1.1 1.3 1.4 1.2\n[2,] 2.7 2.6 2.4 2.5\n[3,] 3.5 3.4 3.2 3.6\n```\n\n\n:::\n:::\n\n\nwhich is a data matrix with $n=3, p=4$ and to extract a column (variable):\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nX[,2]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 1.3 2.6 3.4\n```\n\n\n:::\n:::\n\n\nor a row (observation):\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nX[2,]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 2.7 2.6 2.4 2.5\n```\n\n\n:::\n:::\n\n\nor an individual cell (value):\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nX[3,2]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 3.4\n```\n\n\n:::\n:::\n\n\nTo make a projection we need an orthonormal matrix:\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nA <- matrix(c(0.707,0.707,0,0,0,0,0.707,0.707), ncol=2, byrow=FALSE)\nA\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [,1] [,2]\n[1,] 0.707 0.000\n[2,] 0.707 0.000\n[3,] 0.000 0.707\n[4,] 0.000 0.707\n```\n\n\n:::\n:::\n\n\nYou can check that it is orthonormal by\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nsum(A[,1]^2)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0.999698\n```\n\n\n:::\n\n```{.r .cell-code code-fold=\"false\"}\nsum(A[,1]*A[,2])\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0\n```\n\n\n:::\n:::\n\n\nand make a projection using matrix multiplication:\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nX %*% A\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [,1] [,2]\n[1,] 1.6968 1.8382\n[2,] 3.7471 3.4643\n[3,] 4.8783 4.8076\n```\n\n\n:::\n:::\n\n\nThe seemingly magical number `0.707` used above and to create the projection in @fig-explain-1D-html arises from normalising a vector with equal contributions from each variable, `(1, 1)`. Dividing by `sqrt(2)` gives `(0.707, 0.707)`.\n\n\n\n::: info\nThe notation convention used throughout the book is:\n\n `n =` number of observations
\n `p =` number of variables, dimension of data
\n `d =` dimension of the projection
\n `g =` number of groups, in classification
\n `X =` data matrix\n:::\n\n## Exercises {-}\n\n1. Generate a matrix $A$ with $p=5$ (rows) and $d=2$ (columns), where each value is randomly drawn from a standard normal distribution. Extract the element at row 3 and column 1.\n2. We will interpret $A$ as a projection matrix and therefore it needs to be orthonormalised. Use the function `tourr::orthonormalise` to do this, and explicitly check that each column is normalised and that the two columns are orthogonal now. Which dimensions contribute most to the projection for your $A$?\n3. Use matrix multiplication to calculate the projection of the `mulgar::clusters` data onto the 2D plane defined by $A$. Make a scatterplot of the projected data. Can you identify clustering in this view?\n\n \n\n::: {.cell}\n\n:::\n", + "supporting": [ + "2-notation_files" + ], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": {}, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/2-notation/execute-results/tex.json b/_freeze/2-notation/execute-results/tex.json index 764232d..9fc3c26 100644 --- a/_freeze/2-notation/execute-results/tex.json +++ b/_freeze/2-notation/execute-results/tex.json @@ -1,8 +1,8 @@ { - "hash": "c5de00d6ea01c3702b5c96105df12e07", + "hash": "74a77846b34130e709260a7c86ef4ea0", "result": { "engine": "knitr", - "markdown": "# Notation conventions and R objects\n\nThe data can be considered to be a matrix of numbers with the columns corresponding to variables, and the rows correspond to observations. It can be helpful to write this in mathematical notation, like:\n\n\\begin{eqnarray*}\nX_{n\\times p} =\n[X_1~X_2~\\dots~X_p]_{n\\times p} = \\left[ \\begin{array}{cccc}\nX_{11} & X_{12} & \\dots & X_{1p} \\\\\nX_{21} & X_{22} & \\dots & X_{2p}\\\\\n\\vdots & \\vdots & & \\vdots \\\\\nX_{n1} & X_{n2} & \\dots & X_{np} \\end{array} \\right]_{n\\times p}\n\\end{eqnarray*}\n\nwhere $X$ indicates the the $n\\times p$ data matrix, $X_j$ indicates variable $j, j=1, \\dots, p$ and $X_{ij}$ indicates the value $j^{th}$ variable of the $i^{th}$ observation. (It can be confusing to distinguish whether one is referring to the observation or a variable, because $X_i$ is used to indicate observation also. When this is done it is usually accompanied by qualifying words such as **observation** $X_3$, or **variable** $X_3$.)\n\n::: info\nHaving notation is helpful for concise explanations of different methods, to explain how data is scaled, processed and projected for various tasks, and how different quantities are calculated from the data. \n:::\n\nWhen there is a response variable(s), it is common to consider $X$ to be the predictors, and use $Y$ to indicate the response variable(s). $Y$ could be a matrix, also, and would be $n\\times q$, where commonly $q=1$. $Y$ could be numeric or categorical, and this would change how it is handled with visualisation.\n\nTo make a low-dimensional projection (shadow) of the data, we need a projection matrix:\n\n\\begin{eqnarray*}\nA_{p\\times d} = \\left[ \\begin{array}{cccc}\nA_{11} & A_{12} & \\dots & A_{1d} \\\\\nA_{21} & A_{22} & \\dots & A_{2d}\\\\\n\\vdots & \\vdots & & \\vdots \\\\\nA_{p1} & A_{p2} & \\dots & A_{pd} \\end{array} \\right]_{p\\times d}\n\\end{eqnarray*}\n\n\n$A$ should be an orthonormal matrix, which means that the $\\sum_{j=1}^p A_{jk}^2=1, k=1, \\dots, d$ (columns represent vectors of length 1) and $\\sum_{j=1}^p A_{jk}A_{jl}=0, k,l=1, \\dots, d; k\\neq l$ (columns represent vectors that are orthogonal to each other). In matrix notation, this can be written as $A^{\\top}A = I_d$.\n\nThen the projected data is written as:\n\n\\begin{eqnarray*}\nY_{n\\times d} = XA = \\left[ \\begin{array}{cccc}\ny_{11} & y_{12} & \\dots & y_{1d} \\\\\ny_{21} & y_{22} & \\dots & y_{2d}\\\\\n\\vdots & \\vdots & & \\vdots \\\\\ny_{n1} & y_{n2} & \\dots & y_{nd} \\end{array} \\right]_{n\\times d}\n\\end{eqnarray*}\n\nwhere $y_{ij} = \\sum_{k=1}^p X_{ik}A_{kj}$. Note that we are using $Y$ as the projected data here, as well as it possibly being used for a response variable. Where necessary, this will be clarified with words in the text, when notation is used in explanations later.\n\nWhen using R, if we only have the data corresponding to $X$ it makes sense to use a `matrix` object. However, if the response variable is included and it is categorical, then we might use a `data.frame` or a `tibble` which can accommodate non-numerical values. Then to work with the data, we can use the base R methods:\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nX <- matrix(c(1.1, 1.3, 1.4, 1.2, \n 2.7, 2.6, 2.4, 2.5, \n 3.5, 3.4, 3.2, 3.6), \n ncol=4, byrow=TRUE)\nX\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [,1] [,2] [,3] [,4]\n[1,] 1.1 1.3 1.4 1.2\n[2,] 2.7 2.6 2.4 2.5\n[3,] 3.5 3.4 3.2 3.6\n```\n\n\n:::\n:::\n\n\n\nwhich is a data matrix with $n=3, p=4$ and to extract a column (variable):\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nX[,2]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 1.3 2.6 3.4\n```\n\n\n:::\n:::\n\n\n\nor a row (observation):\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nX[2,]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 2.7 2.6 2.4 2.5\n```\n\n\n:::\n:::\n\n\n\nor an individual cell (value):\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nX[3,2]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 3.4\n```\n\n\n:::\n:::\n\n\n\nTo make a projection we need an orthonormal matrix:\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nA <- matrix(c(0.707,0.707,0,0,0,0,0.707,0.707), ncol=2, byrow=FALSE)\nA\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [,1] [,2]\n[1,] 0.707 0.000\n[2,] 0.707 0.000\n[3,] 0.000 0.707\n[4,] 0.000 0.707\n```\n\n\n:::\n:::\n\n\n\nYou can check that it is orthonormal by\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nsum(A[,1]^2)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0.999698\n```\n\n\n:::\n\n```{.r .cell-code code-fold=\"false\"}\nsum(A[,1]*A[,2])\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0\n```\n\n\n:::\n:::\n\n\n\nand make a projection using matrix multiplication:\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nX %*% A\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [,1] [,2]\n[1,] 1.6968 1.8382\n[2,] 3.7471 3.4643\n[3,] 4.8783 4.8076\n```\n\n\n:::\n:::\n\n\n\nThe seemingly magical number `0.707` used above and to create the projection in @fig-explain-1D-pdf arises from normalising a vector with equal contributions from each variable, `(1, 1)`. Dividing by `sqrt(2)` gives `(0.707, 0.707)`.\n\n\n\n::: info\nThe notation convention used throughout the book is:\n\n `n =` number of observations
\n `p =` number of variables, dimension of data
\n `d =` dimension of the projection
\n `g =` number of groups, in classification
\n `X =` data matrix\n:::\n\n## Exercises {-}\n\n1. Generate a matrix $A$ with $p=5$ (rows) and $d=2$ (columns), where each value is randomly drawn from a standard normal distribution. Extract the element at row 3 and column 1.\n2. We will interpret $A$ as a projection matrix and therefore it needs to be orthonormalised. Use the function `tourr::orthonormalise` to do this, and explicitly check that each column is normalised and that the two columns are orthogonal now. Which dimensions contribute most to the projection for your $A$?\n3. Use matrix multiplication to calculate the projection of the `mulgar::clusters` data onto the 2D plane defined by $A$. Make a scatterplot of the projected data. Can you identify clustering in this view?\n\n \n\n\n::: {.cell}\n\n:::\n", + "markdown": "# Notation conventions and R objects\n\nThe data can be considered to be a matrix of numbers with the columns corresponding to variables, and the rows correspond to observations. It can be helpful to write this in mathematical notation, like:\n\n\\begin{eqnarray*}\nX_{n\\times p} =\n[X_1~X_2~\\dots~X_p]_{n\\times p} = \\left[ \\begin{array}{cccc}\nX_{11} & X_{12} & \\dots & X_{1p} \\\\\nX_{21} & X_{22} & \\dots & X_{2p}\\\\\n\\vdots & \\vdots & & \\vdots \\\\\nX_{n1} & X_{n2} & \\dots & X_{np} \\end{array} \\right]_{n\\times p}\n\\end{eqnarray*}\n\nwhere $X$ indicates the $n\\times p$ data matrix, $X_j$ indicates variable $j, j=1, \\dots, p$ and $X_{ij}$ indicates the value of the $j^{th}$ variable for the $i^{th}$ observation. (It can be confusing to distinguish whether one is referring to the observation or a variable, because $X_i$ is used to indicate observation also. When this is done it is usually accompanied by qualifying words such as **observation** $X_3$, or **variable** $X_3$.)\n\n::: {.content-visible when-format=\"html\"}\n::: info\nHaving notation is helpful for concise explanations of different methods, to explain how data is scaled, processed and projected for various tasks, and how different quantities are calculated from the data. \n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\n\\infobox{Having notation is helpful for concise explanations of different methods, to explain how data is scaled, processed and projected for various tasks, and how different quantities are calculated from the data. }\n\n:::\n\nWhen there is a response variable(s), it is common to consider $X$ to be the predictors, and use $Y$ to indicate the response variable(s). $Y$ could be a matrix, also, and would be $n\\times q$, where commonly $q=1$. $Y$ could be numeric or categorical, and this would change how it is handled with visualisation.\n\nTo make a low-dimensional projection (shadow) of the data onto $d$ dimensions ($d < p$), we need an orthonormal basis:\n\n\\begin{eqnarray*}\nA_{p\\times d} = \\left[ \\begin{array}{cccc}\nA_{11} & A_{12} & \\dots & A_{1d} \\\\\nA_{21} & A_{22} & \\dots & A_{2d}\\\\\n\\vdots & \\vdots & & \\vdots \\\\\nA_{p1} & A_{p2} & \\dots & A_{pd} \\end{array} \\right]_{p\\times d}\n\\end{eqnarray*}\n\n\n$A$ should be an orthonormal matrix, which means that the $\\sum_{j=1}^p A_{jk}^2=1, k=1, \\dots, d$ (columns represent vectors of length 1) and $\\sum_{j=1}^p A_{jk}A_{jl}=0, k,l=1, \\dots, d; k\\neq l$ (columns represent vectors that are orthogonal to each other). In matrix notation, this can be written as $A^{\\top}A = I_d$.\n\nThen the projected data is written as:\n\n\\begin{eqnarray*}\nY_{n\\times d} = XA = \\left[ \\begin{array}{cccc}\ny_{11} & y_{12} & \\dots & y_{1d} \\\\\ny_{21} & y_{22} & \\dots & y_{2d}\\\\\n\\vdots & \\vdots & & \\vdots \\\\\ny_{n1} & y_{n2} & \\dots & y_{nd} \\end{array} \\right]_{n\\times d}\n\\end{eqnarray*}\n\nwhere $y_{ij} = \\sum_{k=1}^p X_{ik}A_{kj}$. Note that we are using $Y$ as the projected data here, as well as it possibly being used for a response variable. Where necessary, this will be clarified with words in the text, when notation is used in explanations later.\n\nWhen using R, if we only have the data corresponding to $X$ it makes sense to use a `matrix` object. However, if the response variable is included and it is categorical, then we might use a `data.frame` or a `tibble` which can accommodate non-numerical values. Then to work with the data, we can use the base R methods:\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nX <- matrix(c(1.1, 1.3, 1.4, 1.2, \n 2.7, 2.6, 2.4, 2.5, \n 3.5, 3.4, 3.2, 3.6), \n ncol=4, byrow=TRUE)\nX\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [,1] [,2] [,3] [,4]\n[1,] 1.1 1.3 1.4 1.2\n[2,] 2.7 2.6 2.4 2.5\n[3,] 3.5 3.4 3.2 3.6\n```\n\n\n:::\n:::\n\n\n\nwhich is a data matrix with $n=3, p=4$ and to extract a column (variable):\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nX[,2]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 1.3 2.6 3.4\n```\n\n\n:::\n:::\n\n\n\nor a row (observation):\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nX[2,]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 2.7 2.6 2.4 2.5\n```\n\n\n:::\n:::\n\n\n\nor an individual cell (value):\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nX[3,2]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 3.4\n```\n\n\n:::\n:::\n\n\n\nTo make the data projection we need an orthonormal matrix:\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nA <- matrix(c(0.707,0.707,0,0,0,0,0.707,0.707), ncol=2, byrow=FALSE)\nA\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [,1] [,2]\n[1,] 0.707 0.000\n[2,] 0.707 0.000\n[3,] 0.000 0.707\n[4,] 0.000 0.707\n```\n\n\n:::\n:::\n\n\n\nYou can check that it is orthonormal by\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nsum(A[,1]^2)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0.999698\n```\n\n\n:::\n\n```{.r .cell-code code-fold=\"false\"}\nsum(A[,1]*A[,2])\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0\n```\n\n\n:::\n:::\n\n\n\nand compute the projected data using matrix multiplication:\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nX %*% A\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [,1] [,2]\n[1,] 1.6968 1.8382\n[2,] 3.7471 3.4643\n[3,] 4.8783 4.8076\n```\n\n\n:::\n:::\n\n\n\nThe magical number `0.707` used above and to create the projection in @fig-explain-1D-pdf arises from normalising a vector with equal contributions from each variable, `(1, 1)`. Dividing by `sqrt(2)` gives `(0.707, 0.707)`.\n\n\n::: {.content-visible when-format=\"html\"}\n::: info\nThe notation convention used throughout the book is:\n\n `n =` number of observations
\n `p =` number of variables, dimension of data
\n `d =` dimension of the projection
\n `g =` number of groups, in classification
\n `X =` data matrix\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\n\\infobox{The notation convention used throughout the book is:\n\\begin{itemize}\n\\item n = number of observations\n\\item p = number of variables, dimension of data\n\\item d = dimension of the projection\n\\item g = number of groups, in classification\n\\item X = data matrix\n\\end{itemize}\n}\n:::\n\n## Exercises {-}\n\n1. Generate a matrix $A$ with $p=5$ (rows) and $d=2$ (columns), where each value is randomly drawn from a standard normal distribution. Extract the element at row 3 and column 1.\n2. We will interpret $A$ as an orthonormal basis and therefore it needs to be checked for orthonormality, and if it fails, then to be orthonormalised. Use the function `tourr::is_orthonormal` to explicitly check that each column is normalised and that the two columns are orthogonal. If they are not, then use `tourr::orthonormalise` to make them so. For the fixed version of $A$, which dimensions contribute most to the projection, horizontally and vertically?\n3. Use matrix multiplication to calculate the projection of the `mulgar::clusters` data onto the 2D plane defined by $A$. Make a scatterplot of the projected data. Can you identify clustering in this view?\n\n \n\n\n::: {.cell}\n\n:::\n", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/_freeze/3-intro-dimred/execute-results/html.json b/_freeze/3-intro-dimred/execute-results/html.json new file mode 100644 index 0000000..73e485a --- /dev/null +++ b/_freeze/3-intro-dimred/execute-results/html.json @@ -0,0 +1,17 @@ +{ + "hash": "d316a736bd82f864329d0f93f70fab0b", + "result": { + "engine": "knitr", + "markdown": "# Overview {#sec-dimension-overview}\n\nThis chapter will focus on methods for reducing dimension, and how the tour[^tour-link] can be used to assist with the common methods such as principal component analysis (PCA), multidimensional scaling (MDS), t-stochastic neighbour embedding (t-SNE), and factor analysis. \n\n[^tour-link]: Note that the animated tours from this chapter can be viewed at [https://dicook.github.io/mulgar_book/3-intro-dimred.html](https://dicook.github.io/mulgar_book/3-intro-dimred.html).\n\nDimension is perceived in a tour using the spread of points. When the points are spread far apart, then the data is filling the space. Conversely when the points \"collapse\" into a sub-region then the data is only partially filling the space, and some dimension reduction to reduce to this smaller dimensional space may be worthwhile. \n\n::: {.content-visible when-format=\"html\"}\n::: info\nWhen points do not fill the plotting canvas fully, it means that it lives in a lower dimension. This low-dimensional space might be linear or non-linear, with the latter being much harder to define and capture.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\infobox{When points do not fill the plotting canvas fully, it means that it lives in a lower dimension. This low-dimensional space might be linear or non-linear, with the latter being much harder to define and capture.}\n:::\n\nLet's start with some 2D examples. You need at least two variables to be able to talk about association between variables. @fig-2D shows three plots of two variables. Plot (a) shows two variables that are strongly linearly associated[^correlation], because when `x1` is low, `x2` is low also, and conversely when `x1` is high, `x2` is also high. This can also be seen by the reduction in spread of points (or \"collapse\") in one direction making the data fill less than the full square of the plot. *So from this we can conclude that the data is not fully 2D.* The second step is to infer which variables contribute to this reduction in dimension. The axes for `x1` and `x2` are drawn extending from $(0,0)$ and because they both extend out of the cloud of points, in the direction away from the collapse of points we can say that they are jointly responsible for the dimension reduction. \n\n@fig-2D (b) shows a pair of variables that are **not** linearly associated. Variable `x1` is more varied than `x3` but knowing the value on `x1` tells us nothing about possible values on `x3`. Before running a tour all variables are typically scaled to have equal spread. The purpose of the tour is to capture association and relationships between the variables, so any univariate differences should be removed ahead of time. @fig-2D (c) shows what this would look like when `x3` is scaled - the points are fully spread in the full square of the plot. \n\n[^correlation]: It is generally better to use *associated* than *correlated*. Correlation is a statistical quantity, measuring linear association. The term *associated* can be prefaced with the type of association, such as *linear* or *non-linear*. \n\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to produce 2D data examples\"}\nlibrary(tibble)\nset.seed(6045)\nx1 <- runif(123)\nx2 <- x1 + rnorm(123, sd=0.1)\nx3 <- rnorm(123, sd=0.2)\ndf <- tibble(x1 = (x1-mean(x1))/sd(x1), \n x2 = (x2-mean(x2))/sd(x2),\n x3, \n x3scaled = (x3-mean(x3))/sd(x3))\n```\n:::\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n::: {.cell-output-display}\n![Explanation of how dimension reduction is perceived in 2D, relative to variables: (a) Two variables with strong linear association. Both variables contribute to the association, as indicated by their axes extending out from the 'collapsed' direction of the points; (b) Two variables with no linear association. But x3 has less variation, so points collapse in this direction; (c) The situation in plot (b) does not arise in a tour because all variables are (usually) scaled. When an axes extends out of a direction where the points are collapsed, it means that this variable is partially responsible for the reduced dimension.](3-intro-dimred_files/figure-html/fig-2D-1.png){#fig-2D fig-alt='Three scatterplots: (a) points lie close to a straight line in the x=y direction, (b) points lie close to a horizontal line, (c) points spread out in the full plot region. There are no axis labels or scales.' width=100%}\n:::\n:::\n\n\nNow let's think about what this looks like with five variables. @fig-dimension-html shows a grand tour on five variables, with (a) data that is primarily 2D, (b) data that is primarily 3D and (c) fully 5D data. You can see that both (a) and (b) the spread of points collapse in some projections, with it happening more in (a). In (c) the data is always spread out in the square, although it does seem to concentrate or pile in the centre. This piling is typical when projecting from high dimensions to low dimensions. The sage tour [@sagetour] makes a correction for this. \n\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to make animated gifs\"}\nlibrary(mulgar)\ndata(plane)\ndata(box)\nrender_gif(plane,\n grand_tour(), \n display_xy(),\n gif_file=\"gifs/plane.gif\",\n frames=500,\n width=200,\n height=200)\nrender_gif(box,\n grand_tour(), \n display_xy(),\n gif_file=\"gifs/box.gif\",\n frames=500,\n width=200,\n height=200)\n# Simulate full cube\nlibrary(geozoo)\ncube5d <- data.frame(cube.solid.random(p=5, n=300)$points)\ncolnames(cube5d) <- paste0(\"x\", 1:5)\ncube5d <- data.frame(apply(cube5d, 2, function(x) (x-mean(x))/sd(x)))\nrender_gif(cube5d,\n grand_tour(), \n display_xy(),\n gif_file=\"gifs/cube5d.gif\",\n frames=500,\n width=200,\n height=200)\n```\n:::\n\n\n::: {.content-visible when-format=\"html\"}\n\n::: {#fig-dimension-html fig-align=\"center\" layout-ncol=3}\n\n![2D plane in 5D](gifs/plane.gif){#fig-plane width=180 fig-alt=\"Animation of sequences of 2D projections shown as scatterplots. You can see points collapsing into a thick straight line in various projections. A circle with line segments indicates the projection coefficients for each variable for all projections viewed.\"}\n\n![3D plane in 5D](gifs/box.gif){#fig-box width=180 fig-alt=\"Animation of sequences of 2D projections shown as scatterplots. You can see points collapsing into a thick straight line in various projections, but not as often as in the animation in (a). A circle with line segments indicates the projection coefficients for each variable for all projections viewed.\"}\n\n![5D plane in 5D](gifs/cube5d.gif){#fig-cube5 width=180 fig-alt=\"Animation of sequences of 2D projections shown as scatterplots. You can see points are always spread out fully in the plot space, in all projections. A circle with line segments indicates the projection coefficients for each variable for all projections viewed.\"}\n\nDifferent dimensional planes - 2D, 3D, 5D - displayed in a grand tour projecting into 2D. Notice that the 5D in 5D always fills out the box (although it does concentrate some in the middle which is typical when projecting from high to low dimensions). Also you can see that the 2D in 5D, concentrates into a line more than the 3D in 5D. This suggests that it is lower dimensional. \n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\n::: {#fig-dimension-pdf layout-ncol=3}\n\n![2D plane in 5D](images/plane.png){#fig-plane width=160}\n\n![3D plane in 5D](images/box.png){#fig-box width=160}\n\n![5D plane in 5D](images/cube5d.png){#fig-cube5 width=160}\n\nSingle frames from different dimensional planes - 2D, 3D, 5D - displayed in a grand tour projecting into 2D. Notice that the 5D in 5D always fills out the box (although it does concentrate some in the middle which is typical when projecting from high to low dimensions). Also you can see that the 2D in 5D, concentrates into a line more than the 3D in 5D. This suggests that it is lower dimensional. (Animations can be viewed [here](https://dicook.github.io/mulgar_book/3-intro-dimred.html).)\n:::\n:::\n\nThe next step is to determine which variables contribute. In the examples just provided, all variables are linearly associated in the 2D and 3D data. You can check this by making a scatterplot matrix, @fig-plane-scatmat.\n\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code for scatterplot matrix\"}\nlibrary(GGally)\nlibrary(mulgar)\ndata(plane)\nggscatmat(plane) +\n theme(panel.background = \n element_rect(colour=\"black\", fill=NA),\n axis.text = element_blank(),\n axis.ticks = element_blank())\n```\n\n::: {.cell-output-display}\n![Scatterplot matrix of plane data. You can see that x1-x3 are strongly linearly associated, and also x4 and x5. When you watch the tour of this data, any time the data collapses into a line you should see only (x1, x2, x3) or (x4, x5). When combinations of x1 and x4 or x5 show, the data should be spread out.](3-intro-dimred_files/figure-html/fig-plane-scatmat-1.png){#fig-plane-scatmat fig-alt='A five-by-five scatterplot matrix, with scatterplots in the lower triangle, correlaton printed in the upper triangle and density plots shown on the diagonal. Plots of x1 vs x2, x1 vs x3, x2 vs x3, and x4 vs x5 have strong positive or negative correlation. The remaining pairs of variables have no association.' width=80%}\n:::\n:::\n\n\nTo make an example where not all variables contribute, we have added two additional variables to the `plane` data set, which are purely noise.\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\n# Add two pure noise dimensions to the plane\nplane_noise <- plane\nplane_noise$x6 <- rnorm(100)\nplane_noise$x7 <- rnorm(100)\nplane_noise <- data.frame(apply(plane_noise, 2, \n function(x) (x-mean(x))/sd(x)))\nggduo(plane_noise, columnsX = 1:5, columnsY = 6:7, \n types = list(continuous = \"points\")) +\n theme(aspect.ratio=1,\n panel.background = \n element_rect(colour=\"black\", fill=NA),\n axis.text = element_blank(),\n axis.ticks = element_blank())\n```\n\n::: {.cell-output-display}\n![Scatterplots showing two additional noise variables that are not associated with any of the first five variables.](3-intro-dimred_files/figure-html/fig-plane-noise-scatter-1.png){#fig-plane-noise-scatter fig-alt='Two rows of scatterplots showing x6 and x7 against x1-x5. The points are spread out in the full plotting region, although x6 has one point with an unusually low value.' width=576}\n:::\n:::\n\n\nNow we have 2D structure in 7D, but only five of the variables contribute to the 2D structure, that is, five of the variables are linearly related with each other. The other two variables (x6, x7) are not linearly related to any of the others. \n\nThe data is viewed with a grand tour in @fig-plane-noise-html. We can still see the concentration of points along a line in some dimensions, which tells us that the data is not fully 7D. Then if you look closely at the variable axes you will see that the collapsing to a line only occurs when any of x1-x5 contribute strongly in the direction orthogonal to this. This does not happen when x6 or x7 contribute strongly to a projection - the data is always expanded to fill much of the space. That tells us that x6 and x7 don't substantially contribute to the dimension reduction, that is, they are not linearly related to the other variables.\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"true\" code-summary=\"Code to generate animation\"}\nlibrary(ggplot2)\nlibrary(plotly)\nlibrary(htmlwidgets)\n\nset.seed(78)\nb <- basis_random(7, 2)\npn_t <- tourr::save_history(plane_noise, \n tour_path = grand_tour(),\n start = b,\n max_bases = 8)\npn_t <- interpolate(pn_t, 0.1)\npn_anim <- render_anim(plane_noise,\n frames=pn_t)\n\npn_gp <- ggplot() +\n geom_path(data=pn_anim$circle, \n aes(x=c1, y=c2,\n frame=frame), linewidth=0.1) +\n geom_segment(data=pn_anim$axes, \n aes(x=x1, y=y1, \n xend=x2, yend=y2, \n frame=frame), \n linewidth=0.1) +\n geom_text(data=pn_anim$axes, \n aes(x=x2, y=y2, \n frame=frame, \n label=axis_labels), \n size=5) +\n geom_point(data=pn_anim$frames, \n aes(x=P1, y=P2, \n frame=frame), \n alpha=0.8) +\n xlim(-1,1) + ylim(-1,1) +\n coord_equal() +\n theme_bw() +\n theme(axis.text=element_blank(),\n axis.title=element_blank(),\n axis.ticks=element_blank(),\n panel.grid=element_blank())\npn_tour <- ggplotly(pn_gp,\n width=500,\n height=550) %>%\n animation_button(label=\"Go\") %>%\n animation_slider(len=0.8, x=0.5,\n xanchor=\"center\") %>%\n animation_opts(easing=\"linear\", \n transition = 0)\n\nhtmlwidgets::saveWidget(pn_tour,\n file=\"html/plane_noise.html\",\n selfcontained = TRUE)\n```\n:::\n\n\n::: {.content-visible when-format=\"html\"}\n::: {#fig-plane-noise-html}\n\n\n\nGrand tour of the plane with two additional dimensions of pure noise. The collapsing of the points indicates that this is not fully 7D. This only happens when any of x1-x5 are contributing strongly (frame 49 x4, x5; frame 79 x1; frame 115 x2, x3). If x6 or x7 are contributing strongly the data is spread out fully (frames 27, 96). This tells us that x6 and x7 are not linearly associated, but other variables are.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n::: {#fig-plane-noise-pdf layout-ncol=2}\n\n![](images/plane_noise1.png){width=200 fig-align=\"center\"}\n\n![](images/plane_noise2.png){width=200 fig-align=\"center\"}\n\nTwo frames from a grand tour of the plane with two additional dimensions of pure noise. The collapsing of the points indicates that this is not fully 7D. This only happens when any of x1-x5 are contributing strongly (frame 49 x4, x5; frame 79 x1; frame 115 x2, x3). If x6 or x7 are contributing strongly the data is spread out fully (frames 27, 96). This tells us that x6 and x7 are not linearly associated, but other variables are. \n:::\n:::\n\n::: {.content-visible when-format=\"html\"}\n::: info\nTo determine which variables are responsible for the reduced dimension look for the axes that extend out of the point cloud. These contribute to smaller variation in the observations, and thus indicate dimension reduction.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\infobox{To determine which variables are responsible for the reduced dimension look for the axes that extend out of the point cloud. These contribute to smaller variation in the observations, and thus indicate dimension reduction.}\n:::\n\nThe simulated data here is very simple, and what we have learned from the tour could also be learned from principal component analysis. However, if there are small complications, such as outliers or nonlinear relationships, that might not be visible from principal component analysis, the tour can help you to see them.\n\n@fig-plane-noise-outlier and @fig-outlier-nonlin-html(a) show example data with an outlier and @fig-outlier-nonlin-html(b) shows data with non-linear relationships. \n\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code for scatterplot matrix\"}\n# Add several outliers to the plane_noise data\nplane_noise_outliers <- plane_noise\nplane_noise_outliers[101,] <- c(2, 2, -2, 0, 0, 0, 0)\nplane_noise_outliers[102,] <- c(0, 0, 0,-2, -2, 0, 0)\n\nggscatmat(plane_noise_outliers, columns = 1:5) +\n theme(aspect.ratio=1,\n panel.background = \n element_rect(colour=\"black\", fill=NA),\n axis.text = element_blank(),\n axis.ticks = element_blank())\n```\n\n::: {.cell-output-display}\n![Scatterplot matrix of the plane with noise data, with two added outliers in variables with strong correlation.](3-intro-dimred_files/figure-html/fig-plane-noise-outlier-1.png){#fig-plane-noise-outlier fig-alt='A five-by-five scatterplot matrix, with scatterplots in the lower triangle, correlaton printed in the upper triangle and density plots shown on the diagonal. Plots of x1 vs x2, x1 vs x3, x2 vs x3, and x4 vs x5 have strong positive or negative correlation, with an outlier in the corner of the plot. The remaining pairs of variables have no association, and thus also no outliers.' width=80%}\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to generate animated gif\"}\nrender_gif(plane_noise_outliers, \n grand_tour(), \n display_xy(),\n gif_file=\"gifs/pn_outliers.gif\",\n frames=500,\n width=200,\n height=200)\n\ndata(plane_nonlin)\nrender_gif(plane_nonlin, \n grand_tour(), \n display_xy(),\n gif_file=\"gifs/plane_nonlin.gif\",\n frames=500,\n width=200,\n height=200)\n```\n:::\n\n\n\n::: {.content-visible when-format=\"html\"}\n\n::: {#fig-outlier-nonlin-html fig-align=\"center\" layout-ncol=2}\n\n![Outliers](gifs/pn_outliers.gif){#fig-outlier width=200 fig-alt=\"Animation showing scatterplots of 2D projections from 5D. The points sometimes appear to be a plane viewed from the side, with two single points futher away. A circle with line segments indicates the projection coefficients for each variable for all projections viewed.\"}\n\n![Non-linear relationship](gifs/plane_nonlin.gif){#fig-nonlinear width=200200 fig-alt=\"Animation showing scatterplots of 2D projections from 5D. The points sometimes appear to be lying on a curve in various projections. A circle with line segments indicates the projection coefficients for each variable for all projections viewed.\"}\n\nExamples of different types of dimensionality issues: outliers (a) and non-linearity (b). In (a) you can see two points far from the others in some projections. Also the two can be seen with different movement patterns -- moving faster and different directions than the other points during the tour. Outliers will affect detection of reduced dimension, but they can be ignored when assessing dimensionality with the tour. In (b) there is a non-linear relationship between several variables, primarily with x3. Non-linear relationships may not be easily captured by other techniques but are often visible with the tour.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\n::: {#fig-outlier-nonlin-pdf fig-align=\"center\" layout-ncol=2}\n\n![Outliers](images/pn_outliers.png){#fig-outlier width=200}\n\n![Non-linear relationship](images/plane_nonlin.png){#fig-nonlinear width=200}\n\nTwo frames from tours of examples of different types of dimensionality issues: outliers (a) and non-linearity (b). In (a) you can see two points far from the others in the projection. During a tour the two can be seen with different movement patterns -- moving faster and in different directions than other points. Outliers will affect detection of reduced dimension, but they can be ignored when assessing dimensionality with the tour. In (b) there is a non-linear relationship between several variables, primarily with x3. Non-linear relationships may not be easily captured by other techniques but are often visible with the tour. \n:::\n:::\n\n\n::: {.cell}\n\n:::\n\n\n## Exercises {-}\n\n1. Multicollinearity is when the predictors for a model are strongly linearly associated. It can adversely affect the fitting of most models, because many possible models may be equally as good. Variable importance might be masked by correlated variables, and confidence intervals generated for linear models might be too wide. Check the for multicollinearity or other associations between the predictors in:\n a. 2001 Australian election data\n b. 2016 Australian election data\n2. Examine 5D multivariate normal samples drawn from populations with a range of variance-covariance matrices. (You can use the `mvtnorm` package to do the sampling, for example.) Examine the data using a grand tour. What changes when you change the correlation from close to zero to close to 1? Can you see a difference between strong positive correlation and strong negative correlation?\n3. The following code shows how to hide a point in a four-dimensional space, so that it is not visible in any of the plots of two variables. Generate both `d` and `d_r` and confirm that the point is visible in a scatterplot matrix of `d`, but not in the scatterplot matrix of `d_r`. Also confirm that it is visible in both data sets when you use a tour.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(tourr)\nlibrary(GGally)\nset.seed(946)\nd <- tibble(x1=runif(200, -1, 1), \n x2=runif(200, -1, 1), \n x3=runif(200, -1, 1))\nd <- d %>%\n mutate(x4 = x3 + runif(200, -0.1, 0.1))\n# outlier is visible in d\nd <- bind_rows(d, c(x1=0, x2=0, x3=-0.5, x4=0.5))\n\n# Point is hiding in d_r\nd_r <- d %>%\n mutate(x1 = cos(pi/6)*x1 + sin(pi/6)*x3,\n x3 = -sin(pi/6)*x1 + cos(pi/6)*x3,\n x2 = cos(pi/6)*x2 + sin(pi/6)*x4,\n x4 = -sin(pi/6)*x2 + cos(pi/6)*x4)\n```\n:::", + "supporting": [ + "3-intro-dimred_files/figure-html" + ], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": {}, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/3-intro-dimred/execute-results/tex.json b/_freeze/3-intro-dimred/execute-results/tex.json index d5e8a41..d51a1dd 100644 --- a/_freeze/3-intro-dimred/execute-results/tex.json +++ b/_freeze/3-intro-dimred/execute-results/tex.json @@ -1,8 +1,8 @@ { - "hash": "9e5c73cbbc471844d9177f17b2ce6db1", + "hash": "c297cd56843a99cef9403ae20e91c6ba", "result": { "engine": "knitr", - "markdown": "# Dimension reduction overview {#sec-dimension-overview}\n\nThis chapter will focus on methods for reducing dimension, and how the tour[^tour-link] can be used to assist with the common methods such as principal component analysis (PCA), multidimensional scaling (MDS), t-stochastic neighbour embedding (t-SNE), and factor analysis. \n\n[^tour-link]: Note that the animated tours from this chapter can be viewed at [https://dicook.github.io/mulgar_book/3-intro-dimred.html](https://dicook.github.io/mulgar_book/3-intro-dimred.html).\n\nDimension is perceived in a tour using the spread of points. When the points are spread far apart, then the data is filling the space. Conversely when the points \"collapse\" into a sub-region then the data is only partially filling the space, and some dimension reduction to reduce to this smaller dimensional space may be worthwhile. \n\n::: {.content-visible when-format=\"html\"}\n::: info\nWhen points do not fill the plotting canvas fully, it means that it lives in a lower dimension. This low-dimensional space might be linear or non-linear, with the latter being much harder to define and capture.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\infobox{When points do not fill the plotting canvas fully, it means that it lives in a lower dimension. This low-dimensional space might be linear or non-linear, with the latter being much harder to define and capture.}\n:::\n\nLet's start with some 2D examples. You need at least two variables to be able to talk about association between variables. @fig-2D shows three plots of two variables. Plot (a) shows two variables that are strongly linearly associated[^correlation], because when `x1` is low, `x2` is low also, and conversely when `x1` is high, `x2` is also high. This can also be seen by the reduction in spread of points (or \"collapse\") in one direction making the data fill less than the full square of the plot. *So from this we can conclude that the data is not fully 2D.* The second step is to infer which variables contribute to this reduction in dimension. The axes for `x1` and `x2` are drawn extending from $(0,0)$ and because they both extend out of the cloud of points, in the direction away from the collapse of points we can say that they are jointly responsible for the dimension reduction. \n\n@fig-2D (b) shows a pair of variables that are **not** linearly associated. Variable `x1` is more varied than `x3` but knowing the value on `x1` tells us nothing about possible values on `x3`. Before running a tour all variables are typically scaled to have equal spread. The purpose of the tour is to capture association and relationships between the variables, so any univariate differences should be removed ahead of time. @fig-2D (c) shows what this would look like when `x3` is scaled - the points are fully spread in the full square of the plot. \n\n[^correlation]: It is generally better to use *associated* than *correlated*. Correlation is a statistical quantity, measuring linear association. The term *associated* can be prefaced with the type of association, such as *linear* or *non-linear*. \n\n\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to produce 2D data examples\"}\nlibrary(tibble)\nset.seed(6045)\nx1 <- runif(123)\nx2 <- x1 + rnorm(123, sd=0.1)\nx3 <- rnorm(123, sd=0.2)\ndf <- tibble(x1 = (x1-mean(x1))/sd(x1), \n x2 = (x2-mean(x2))/sd(x2),\n x3, \n x3scaled = (x3-mean(x3))/sd(x3))\n```\n:::\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n::: {.cell-output-display}\n![Explanation of how dimension reduction is perceived in 2D, relative to variables: (a) Two variables with strong linear association. Both variables contribute to the association, as indicated by their axes extending out from the 'collapsed' direction of the points; (b) Two variables with no linear association. But x3 has less variation, so points collapse in this direction; (c) The situation in plot (b) does not arise in a tour because all variables are (usually) scaled. When an axes extends out of a direction where the points are collapsed, it means that this variable is partially responsible for the reduced dimension.](3-intro-dimred_files/figure-pdf/fig-2D-1.pdf){#fig-2D fig-alt='Three scatterplots: (a) points lie close to a straight line in the x=y direction, (b) points lie close to a horizontal line, (c) points spread out in the full plot region. There are no axis labels or scales.' width=100%}\n:::\n:::\n\n\n\nNow let's think about what this looks like with five variables. @fig-dimension-pdf shows a grand tour on five variables, with (a) data that is primarily 2D, (b) data that is primarily 3D and (c) fully 5D data. You can see that both (a) and (b) the spread of points collapse in some projections, with it happening more in (a). In (c) the data is always spread out in the square, although it does seem to concentrate or pile in the centre. This piling is typical when projecting from high dimensions to low dimensions. The sage tour [@sagetour] makes a correction for this. \n\n\n\n::: {.cell}\n\n:::\n\n\n\n::: {.content-visible when-format=\"html\"}\n\n::: {#fig-dimension-html fig-align=\"center\" layout-ncol=3}\n\n![2D plane in 5D](gifs/plane.gif){#fig-plane width=180 fig-alt=\"Animation of sequences of 2D projections shown as scatterplots. You can see points collapsing into a thick straight line in various projections. A circle with line segments indicates the projection coefficients for each variable for all projections viewed.\"}\n\n![3D plane in 5D](gifs/box.gif){#fig-box width=180 fig-alt=\"Animation of sequences of 2D projections shown as scatterplots. You can see points collapsing into a thick straight line in various projections, but not as often as in the animation in (a). A circle with line segments indicates the projection coefficients for each variable for all projections viewed.\"}\n\n![5D plane in 5D](gifs/cube5d.gif){#fig-cube5 width=180 fig-alt=\"Animation of sequences of 2D projections shown as scatterplots. You can see points are always spread out fully in the plot space, in all projections. A circle with line segments indicates the projection coefficients for each variable for all projections viewed.\"}\n\nDifferent dimensional planes - 2D, 3D, 5D - displayed in a grand tour projecting into 2D. Notice that the 5D in 5D always fills out the box (although it does concentrate some in the middle which is typical when projecting from high to low dimensions). Also you can see that the 2D in 5D, concentrates into a line more than the 3D in 5D. This suggests that it is lower dimensional. \n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\n::: {#fig-dimension-pdf layout-ncol=3}\n\n![2D plane in 5D](images/plane.png){#fig-plane width=160}\n\n![3D plane in 5D](images/box.png){#fig-box width=160}\n\n![5D plane in 5D](images/cube5d.png){#fig-cube5 width=160}\n\nSingle frames from different dimensional planes - 2D, 3D, 5D - displayed in a grand tour projecting into 2D. Notice that the 5D in 5D always fills out the box (although it does concentrate some in the middle which is typical when projecting from high to low dimensions). Also you can see that the 2D in 5D, concentrates into a line more than the 3D in 5D. This suggests that it is lower dimensional. (Animations can be viewed [here](https://dicook.github.io/mulgar_book/3-intro-dimred.html).)\n:::\n:::\n\nThe next step is to determine which variables contribute. In the examples just provided, all variables are linearly associated in the 2D and 3D data. You can check this by making a scatterplot matrix, @fig-plane-scatmat.\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code for scatterplot matrix\"}\nlibrary(GGally)\nlibrary(mulgar)\ndata(plane)\nggscatmat(plane) +\n theme(panel.background = \n element_rect(colour=\"black\", fill=NA),\n axis.text = element_blank(),\n axis.ticks = element_blank())\n```\n\n::: {.cell-output-display}\n![Scatterplot matrix of plane data. You can see that x1-x3 are strongly linearly associated, and also x4 and x5. When you watch the tour of this data, any time the data collapses into a line you should see only (x1, x2, x3) or (x4, x5). When combinations of x1 and x4 or x5 show, the data should be spread out.](3-intro-dimred_files/figure-pdf/fig-plane-scatmat-1.pdf){#fig-plane-scatmat fig-pos='H' fig-alt='A five-by-five scatterplot matrix, with scatterplots in the lower triangle, correlaton printed in the upper triangle and density plots shown on the diagonal. Plots of x1 vs x2, x1 vs x3, x2 vs x3, and x4 vs x5 have strong positive or negative correlation. The remaining pairs of variables have no association.' width=80%}\n:::\n:::\n\n\n\nTo make an example where not all variables contribute, we have added two additional variables to the `plane` data set, which are purely noise.\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\n# Add two pure noise dimensions to the plane\nplane_noise <- plane\nplane_noise$x6 <- rnorm(100)\nplane_noise$x7 <- rnorm(100)\nplane_noise <- data.frame(apply(plane_noise, 2, \n function(x) (x-mean(x))/sd(x)))\nggduo(plane_noise, columnsX = 1:5, columnsY = 6:7, \n types = list(continuous = \"points\")) +\n theme(aspect.ratio=1,\n panel.background = \n element_rect(colour=\"black\", fill=NA),\n axis.text = element_blank(),\n axis.ticks = element_blank())\n```\n\n::: {.cell-output-display}\n![Scatterplots showing two additional noise variables that are not associated with any of the first five variables.](3-intro-dimred_files/figure-pdf/fig-plane-noise-scatter-1.pdf){#fig-plane-noise-scatter fig-pos='H' fig-alt='Two rows of scatterplots showing x6 and x7 against x1-x5. The points are spread out in the full plotting region, although x6 has one point with an unusually low value.' width=80%}\n:::\n:::\n\n\n\nNow we have 2D structure in 7D, but only five of the variables contribute to the 2D structure, that is, five of the variables are linearly related with each other. The other two variables (x6, x7) are not linearly related to any of the others. \n\nThe data is viewed with a grand tour in @fig-plane-noise-pdf. We can still see the concentration of points along a line in some dimensions, which tells us that the data is not fully 7D. Then if you look closely at the variable axes you will see that the collapsing to a line only occurs when any of x1-x5 contribute strongly in the direction orthogonal to this. This does not happen when x6 or x7 contribute strongly to a projection - the data is always expanded to fill much of the space. That tells us that x6 and x7 don't substantially contribute to the dimension reduction, that is, they are not linearly related to the other variables.\n\n\n\n::: {.cell}\n\n:::\n\n\n\n::: {.content-visible when-format=\"html\"}\n::: {#fig-plane-noise-html}\n\n\n\nGrand tour of the plane with two additional dimensions of pure noise. The collapsing of the points indicates that this is not fully 7D. This only happens when any of x1-x5 are contributing strongly (frame 49 x4, x5; frame 79 x1; frame 115 x2, x3). If x6 or x7 are contributing strongly the data is spread out fully (frames 27, 96). This tells us that x6 and x7 are not linearly associated, but other variables are.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n::: {#fig-plane-noise-pdf layout-ncol=2}\n\n![](images/plane_noise1.png){width=200 fig-align=\"center\"}\n\n![](images/plane_noise2.png){width=200 fig-align=\"center\"}\n\nTwo frames from a grand tour of the plane with two additional dimensions of pure noise. The collapsing of the points indicates that this is not fully 7D. This only happens when any of x1-x5 are contributing strongly (frame 49 x4, x5; frame 79 x1; frame 115 x2, x3). If x6 or x7 are contributing strongly the data is spread out fully (frames 27, 96). This tells us that x6 and x7 are not linearly associated, but other variables are. \n:::\n:::\n\n::: {.content-visible when-format=\"html\"}\n::: info\nTo determine which variables are responsible for the reduced dimension look for the axes that extend out of the point cloud. These contribute to smaller variation in the observations, and thus indicate dimension reduction.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\infobox{To determine which variables are responsible for the reduced dimension look for the axes that extend out of the point cloud. These contribute to smaller variation in the observations, and thus indicate dimension reduction.}\n:::\n\nThe simulated data here is very simple, and what we have learned from the tour could also be learned from principal component analysis. However, if there are small complications, such as outliers or nonlinear relationships, that might not be visible from principal component analysis, the tour can help you to see them.\n\n@fig-plane-noise-outlier and @fig-outlier-nonlin-pdf(a) show example data with an outlier and @fig-outlier-nonlin-pdf(b) shows data with non-linear relationships. \n\n\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code for scatterplot matrix\"}\n# Add several outliers to the plane_noise data\nplane_noise_outliers <- plane_noise\nplane_noise_outliers[101,] <- c(2, 2, -2, 0, 0, 0, 0)\nplane_noise_outliers[102,] <- c(0, 0, 0,-2, -2, 0, 0)\n\nggscatmat(plane_noise_outliers, columns = 1:5) +\n theme(aspect.ratio=1,\n panel.background = \n element_rect(colour=\"black\", fill=NA),\n axis.text = element_blank(),\n axis.ticks = element_blank())\n```\n\n::: {.cell-output-display}\n![Scatterplot matrix of the plane with noise data, with two added outliers in variables with strong correlation.](3-intro-dimred_files/figure-pdf/fig-plane-noise-outlier-1.pdf){#fig-plane-noise-outlier fig-pos='H' fig-alt='A five-by-five scatterplot matrix, with scatterplots in the lower triangle, correlaton printed in the upper triangle and density plots shown on the diagonal. Plots of x1 vs x2, x1 vs x3, x2 vs x3, and x4 vs x5 have strong positive or negative correlation, with an outlier in the corner of the plot. The remaining pairs of variables have no association, and thus also no outliers.' width=80%}\n:::\n:::\n\n::: {.cell}\n\n:::\n\n\n\n\n::: {.content-visible when-format=\"html\"}\n\n::: {#fig-outlier-nonlin-html fig-align=\"center\" layout-ncol=2}\n\n![Outliers](gifs/pn_outliers.gif){#fig-outlier width=200 fig-alt=\"Animation showing scatterplots of 2D projections from 5D. The points sometimes appear to be a plane viewed from the side, with two single points futher away. A circle with line segments indicates the projection coefficients for each variable for all projections viewed.\"}\n\n![Non-linear relationship](gifs/plane_nonlin.gif){#fig-nonlinear width=200200 fig-alt=\"Animation showing scatterplots of 2D projections from 5D. The points sometimes appear to be lying on a curve in various projections. A circle with line segments indicates the projection coefficients for each variable for all projections viewed.\"}\n\nExamples of different types of dimensionality issues: outliers (a) and non-linearity (b). In (a) you can see two points far from the others in some projections. Also the two can be seen with different movement patterns -- moving faster and different directions than the other points during the tour. Outliers will affect detection of reduced dimension, but they can be ignored when assessing dimensionality with the tour. In (b) there is a non-linear relationship between several variables, primarily with x3. Non-linear relationships may not be easily captured by other techniques but are often visible with the tour.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\n::: {#fig-outlier-nonlin-pdf fig-align=\"center\" layout-ncol=2}\n\n![Outliers](images/pn_outliers.png){#fig-outlier width=200}\n\n![Non-linear relationship](images/plane_nonlin.png){#fig-nonlinear width=200}\n\nTwo frames from tours of examples of different types of dimensionality issues: outliers (a) and non-linearity (b). In (a) you can see two points far from the others in the projection. During a tour the two can be seen with different movement patterns -- moving faster and in different directions than other points. Outliers will affect detection of reduced dimension, but they can be ignored when assessing dimensionality with the tour. In (b) there is a non-linear relationship between several variables, primarily with x3. Non-linear relationships may not be easily captured by other techniques but are often visible with the tour. \n:::\n:::\n\n\n\n::: {.cell}\n\n:::\n\n\n\n## Exercises {-}\n\n1. Multicollinearity is when the predictors for a model are strongly linearly associated. It can adversely affect the fitting of most models, because many possible models may be equally as good. Variable importance might be masked by correlated variables, and confidence intervals generated for linear models might be too wide. Check the for multicollinearity or other associations between the predictors in:\n a. 2001 Australian election data\n b. 2016 Australian election data\n2. Examine 5D multivariate normal samples drawn from populations with a range of variance-covariance matrices. (You can use the `mvtnorm` package to do the sampling, for example.) Examine the data using a grand tour. What changes when you change the correlation from close to zero to close to 1? Can you see a difference between strong positive correlation and strong negative correlation?\n3. The following code shows how to hide a point in a four-dimensional space, so that it is not visible in any of the plots of two variables. Generate both `d` and `d_r` and confirm that the point is visible in a scatterplot matrix of `d`, but not in the scatterplot matrix of `d_r`. Also confirm that it is visible in both data sets when you use a tour.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(tourr)\nlibrary(GGally)\nset.seed(946)\nd <- tibble(x1=runif(200, -1, 1), \n x2=runif(200, -1, 1), \n x3=runif(200, -1, 1))\nd <- d %>%\n mutate(x4 = x3 + runif(200, -0.1, 0.1))\n# outlier is visible in d\nd <- bind_rows(d, c(x1=0, x2=0, x3=-0.5, x4=0.5))\n\n# Point is hiding in d_r\nd_r <- d %>%\n mutate(x1 = cos(pi/6)*x1 + sin(pi/6)*x3,\n x3 = -sin(pi/6)*x1 + cos(pi/6)*x3,\n x2 = cos(pi/6)*x2 + sin(pi/6)*x4,\n x4 = -sin(pi/6)*x2 + cos(pi/6)*x4)\n```\n:::", + "markdown": "# Dimension reduction overview {#sec-dimension-overview}\n\nThis chapter sets up the concepts related to methods for reducing dimension such as principal component analysis (PCA) and t-stochastic neighbour embedding (t-SNE), and how the tour can be used to assist with these methods. \n\n## The meaning of dimension\n\nDimension is perceived in a tour using the spread of points. When the points are spread far apart, then the data is filling the space. Conversely when the points \"collapse\" into a sub-region then the data is only partially filling the space, and some dimension reduction to reduce to this smaller dimensional space may be worthwhile. \n\n::: {.content-visible when-format=\"html\"}\n::: info\nWhen points do not fill the plotting canvas fully, it means that it lives in a lower dimension. This low-dimensional space might be linear or non-linear, with the latter being much harder to define and capture.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\infobox{When points do not fill the plotting canvas fully, it means that it lives in a lower dimension. This low-dimensional space might be linear or non-linear, with the latter being much harder to define and capture.}\n:::\n\nLet's start with some 2D examples. You need at least two variables to be able to talk about association between variables. @fig-2D shows three plots of two variables. Plot (a) shows two variables that are strongly linearly associated[^correlation], because when `x1` is low, `x2` is low also, and conversely when `x1` is high, `x2` is also high. This can also be seen by the reduction in spread of points (or \"collapse\") in one direction making the data fill less than the full square of the plot. *So from this we can conclude that the data is not fully 2D.* The second step is to infer which variables contribute to this reduction in dimension. The axes for `x1` and `x2` are drawn extending from $(0,0)$ and because they both extend out of the cloud of points, in the direction away from the collapse of points we can say that they are jointly responsible for the dimension reduction. \n\n@fig-2D (b) shows a pair of variables that are **not** linearly associated. Variable `x1` is more varied than `x3` but knowing the value on `x1` tells us nothing about possible values on `x3`. Before running a tour all variables are typically scaled to have equal spread. The purpose of the tour is to capture association and relationships between the variables, so any univariate differences should be removed ahead of time. @fig-2D (c) shows what this would look like when `x3` is scaled - the points are fully spread in the full square of the plot. \n\n[^correlation]: It is generally better to use *associated* than *correlated*. Correlation is a statistical quantity, measuring linear association. The term *associated* can be prefaced with the type of association, such as *linear* or *non-linear*. \n\n\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to produce 2D data examples\"}\nlibrary(tibble)\nset.seed(6045)\nx1 <- runif(123)\nx2 <- x1 + rnorm(123, sd=0.1)\nx3 <- rnorm(123, sd=0.2)\ndf <- tibble(x1 = (x1-mean(x1))/sd(x1), \n x2 = (x2-mean(x2))/sd(x2),\n x3, \n x3scaled = (x3-mean(x3))/sd(x3))\n```\n:::\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n::: {.cell-output-display}\n![Explanation of how dimension reduction is perceived in 2D, relative to variables: (a) Two variables with strong linear association. Both variables contribute to the association, as indicated by their axes extending out from the 'collapsed' direction of the points; (b) Two variables with no linear association. But x3 has less variation, so points collapse in this direction; (c) The situation in plot (b) does not arise in a tour because all variables are (usually) scaled. When an axis extends out of a direction where the points are collapsed, it means that this variable is partially responsible for the reduced dimension.](3-intro-dimred_files/figure-pdf/fig-2D-1.pdf){#fig-2D fig-alt='Three scatterplots: (a) points lie close to a straight line in the x=y direction, (b) points lie close to a horizontal line, (c) points spread out in the full plot region. There are no axis labels or scales.' width=100%}\n:::\n:::\n\n\n\n## How to perceive the dimensionality using a tour\n\nNow let's think about what this looks like with five variables. @fig-dimension-pdf shows a grand tour on five variables, with (a) data that is primarily 2D, (b) data that is primarily 3D and (c) fully 5D data. You can see that both (a) and (b) the spread of points collapse in some projections, with it happening more in (a). In (c) the data is always spread out in the square, although it does seem to concentrate or pile in the centre. This piling is typical when projecting from high dimensions to low dimensions. The sage tour [@sagetour] makes a correction for this. \n\n\n\n::: {.cell}\n\n:::\n\n\n\n::: {.content-visible when-format=\"html\"}\n\n::: {#fig-dimension-html fig-align=\"center\" layout-ncol=3}\n\n![2D plane in 5D](gifs/plane.gif){#fig-plane width=180 fig-alt=\"Animation of sequences of 2D projections shown as scatterplots. You can see points collapsing into a thick straight line in various projections. A circle with line segments indicates the projection coefficients for each variable for all projections viewed.\"}\n\n![3D plane in 5D](gifs/box.gif){#fig-box width=180 fig-alt=\"Animation of sequences of 2D projections shown as scatterplots. You can see points collapsing into a thick straight line in various projections, but not as often as in the animation in (a). A circle with line segments indicates the projection coefficients for each variable for all projections viewed.\"}\n\n![5D plane in 5D](gifs/cube5d.gif){#fig-cube5 width=180 fig-alt=\"Animation of sequences of 2D projections shown as scatterplots. You can see points are always spread out fully in the plot space, in all projections. A circle with line segments indicates the projection coefficients for each variable for all projections viewed.\"}\n\nDifferent dimensional planes - 2D, 3D, 5D - displayed in a grand tour projecting into 2D. Notice that the 5D in 5D always fills out the box (although it does concentrate some in the middle which is typical when projecting from high to low dimensions). Also you can see that the 2D in 5D, concentrates into a line more than the 3D in 5D. This suggests that it is lower dimensional. \n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\n::: {#fig-dimension-pdf layout-ncol=3}\n\n![2D plane in 5D](images/plane.png){#fig-plane width=160}\n\n![3D plane in 5D](images/box.png){#fig-box width=160}\n\n![5D plane in 5D](images/cube5d.png){#fig-cube5 width=160}\n\nSingle frames from different dimensional planes - 2D, 3D, 5D - displayed in a grand tour projecting into 2D. Notice that the 5D in 5D always fills out the box (although it does concentrate some in the middle which is typical when projecting from high to low dimensions). Also you can see that the 2D in 5D, concentrates into a line more than the 3D in 5D. This suggests that it is lower dimensional. {{< fa play-circle >}}\n:::\n:::\n\nThe next step is to determine which variables contribute. In the examples just provided, all variables are linearly associated in the 2D and 3D data. You can check this by making a scatterplot matrix, @fig-plane-scatmat.\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code for scatterplot matrix\"}\nlibrary(GGally)\nlibrary(mulgar)\ndata(plane)\nggscatmat(plane) +\n theme(panel.background = \n element_rect(colour=\"black\", fill=NA),\n axis.text = element_blank(),\n axis.ticks = element_blank())\n```\n\n::: {.cell-output-display}\n![Scatterplot matrix of plane data. You can see that x1-x3 are strongly linearly associated, and also x4 and x5. When you watch the tour of this data, any time the data collapses into a line you should see only (x1, x2, x3) or (x4, x5). When combinations of x1 and x4 or x5 show, the data should be spread out.](3-intro-dimred_files/figure-pdf/fig-plane-scatmat-1.pdf){#fig-plane-scatmat fig-pos='H' fig-alt='A five-by-five scatterplot matrix, with scatterplots in the lower triangle, correlaton printed in the upper triangle and density plots shown on the diagonal. Plots of x1 vs x2, x1 vs x3, x2 vs x3, and x4 vs x5 have strong positive or negative correlation. The remaining pairs of variables have no association.' width=80%}\n:::\n:::\n\n\n\nTo make an example where not all variables contribute, we have added two additional variables to the `plane` data set, which are purely noise.\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\n# Add two pure noise dimensions to the plane\nplane_noise <- plane\nplane_noise$x6 <- rnorm(100)\nplane_noise$x7 <- rnorm(100)\nplane_noise <- data.frame(apply(plane_noise, 2, \n function(x) (x-mean(x))/sd(x)))\nggduo(plane_noise, columnsX = 1:5, columnsY = 6:7, \n types = list(continuous = \"points\")) +\n theme(aspect.ratio=1,\n panel.background = \n element_rect(colour=\"black\", fill=NA),\n axis.text = element_blank(),\n axis.ticks = element_blank())\n```\n\n::: {.cell-output-display}\n![Scatterplots showing two additional noise variables that are not associated with any of the first five variables.](3-intro-dimred_files/figure-pdf/fig-plane-noise-scatter-1.pdf){#fig-plane-noise-scatter fig-pos='H' fig-alt='Two rows of scatterplots showing x6 and x7 against x1-x5. The points are spread out in the full plotting region, although x6 has one point with an unusually low value.' width=80%}\n:::\n:::\n\n\n\nNow we have 2D structure in 7D, but only five of the variables contribute to the 2D structure, that is, five of the variables are linearly related with each other. The other two variables (x6, x7) are not linearly related to any of the others. \n\nThe data is viewed with a grand tour in @fig-plane-noise-pdf. We can still see the concentration of points along a line in some dimensions, which tells us that the data is not fully 7D. Then if you look closely at the variable axes you will see that the collapsing to a line only occurs when any of x1-x5 contribute strongly in the direction orthogonal to this. This does not happen when x6 or x7 contribute strongly to a projection - the data is always expanded to fill much of the space. That tells us that x6 and x7 don't substantially contribute to the dimension reduction, that is, they are not linearly related to the other variables.\n\n\n\n::: {.cell}\n\n:::\n\n\n\n::: {.content-visible when-format=\"html\"}\n::: {#fig-plane-noise-html}\n\n\n\nGrand tour of the plane with two additional dimensions of pure noise. The collapsing of the points indicates that this is not fully 7D. This only happens when any of x1-x5 are contributing strongly (frame 49 x4, x5; frame 79 x1; frame 115 x2, x3). If x6 or x7 are contributing strongly the data is spread out fully (frames 27, 96). This tells us that x6 and x7 are not linearly associated, but other variables are.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n::: {#fig-plane-noise-pdf layout-ncol=2}\n\n![](images/plane_noise1.png){width=200 fig-align=\"center\"}\n\n![](images/plane_noise2.png){width=200 fig-align=\"center\"}\n\nTwo frames from a grand tour of the plane with two additional dimensions of pure noise. The collapsing of the points indicates that this is not fully 7D. This only happens when any of x1-x5 are contributing strongly (frame 49 x4, x5; frame 79 x1; frame 115 x2, x3). If x6 or x7 are contributing strongly the data is spread out fully (frames 27, 96). This tells us that x6 and x7 are not linearly associated, but other variables are. {{< fa play-circle >}}\n:::\n:::\n\n::: {.content-visible when-format=\"html\"}\n::: info\nTo determine which variables are responsible for the reduced dimension look for the axes that extend out of the point cloud. These contribute to smaller variation in the observations, and thus indicate dimension reduction.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\infobox{To determine which variables are responsible for the reduced dimension look for the axes that extend out of the point cloud. These contribute to smaller variation in the observations, and thus indicate dimension reduction.}\n:::\n\nThe simulated data here is very simple, and what we have learned from the tour could also be learned from principal component analysis. However, if there are small complications, such as outliers or nonlinear relationships, that might not be visible from principal component analysis, the tour can help you to see them.\n\n@fig-plane-noise-outlier and @fig-outlier-nonlin-pdf(a) show example data with an outlier and @fig-outlier-nonlin-pdf(b) shows data with non-linear relationships. \n\n\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code for scatterplot matrix\"}\n# Add several outliers to the plane_noise data\nplane_noise_outliers <- plane_noise\nplane_noise_outliers[101,] <- c(2, 2, -2, 0, 0, 0, 0)\nplane_noise_outliers[102,] <- c(0, 0, 0,-2, -2, 0, 0)\n\nggscatmat(plane_noise_outliers, columns = 1:5) +\n theme(aspect.ratio=1,\n panel.background = \n element_rect(colour=\"black\", fill=NA),\n axis.text = element_blank(),\n axis.ticks = element_blank())\n```\n\n::: {.cell-output-display}\n![Scatterplot matrix of the plane with noise data, with two added outliers in variables with strong correlation.](3-intro-dimred_files/figure-pdf/fig-plane-noise-outlier-1.pdf){#fig-plane-noise-outlier fig-pos='H' fig-alt='A five-by-five scatterplot matrix, with scatterplots in the lower triangle, correlaton printed in the upper triangle and density plots shown on the diagonal. Plots of x1 vs x2, x1 vs x3, x2 vs x3, and x4 vs x5 have strong positive or negative correlation, with an outlier in the corner of the plot. The remaining pairs of variables have no association, and thus also no outliers.' width=80%}\n:::\n:::\n\n::: {.cell}\n\n:::\n\n\n\n\n::: {.content-visible when-format=\"html\"}\n\n::: {#fig-outlier-nonlin-html fig-align=\"center\" layout-ncol=2}\n\n![Outliers](gifs/pn_outliers.gif){#fig-outlier width=200 fig-alt=\"Animation showing scatterplots of 2D projections from 5D. The points sometimes appear to be a plane viewed from the side, with two single points futher away. A circle with line segments indicates the projection coefficients for each variable for all projections viewed.\"}\n\n![Non-linear relationship](gifs/plane_nonlin.gif){#fig-nonlinear width=200200 fig-alt=\"Animation showing scatterplots of 2D projections from 5D. The points sometimes appear to be lying on a curve in various projections. A circle with line segments indicates the projection coefficients for each variable for all projections viewed.\"}\n\nExamples of different types of dimensionality issues: outliers (a) and non-linearity (b). In (a) you can see two points far from the others in some projections. Also the two can be seen with different movement patterns -- moving faster and different directions than the other points during the tour. Outliers will affect detection of reduced dimension, but they can be ignored when assessing dimensionality with the tour. In (b) there is a non-linear relationship between several variables, primarily with x3. Non-linear relationships may not be easily captured by other techniques but are often visible with the tour.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\n::: {#fig-outlier-nonlin-pdf fig-align=\"center\" layout-ncol=2}\n\n![Outliers](images/pn_outliers.png){#fig-outlier width=200}\n\n![Non-linear relationship](images/plane_nonlin.png){#fig-nonlinear width=200}\n\nTwo frames from tours of examples of different types of dimensionality issues: outliers (a) and non-linearity (b). In (a) you can see two points far from the others in the projection. During a tour the two can be seen with different movement patterns -- moving faster and in different directions than other points. Outliers will affect detection of reduced dimension, but they can be ignored when assessing dimensionality with the tour. In (b) there is a non-linear relationship between several variables, primarily with x3. Non-linear relationships may not be easily captured by other techniques but are often visible with the tour. {{< fa play-circle >}}\n:::\n:::\n\n\n\n::: {.cell}\n\n:::\n\n\n\n## Exercises {-}\n\n1. Multicollinearity is when the predictors for a model are strongly linearly associated. It can adversely affect the fitting of most models, because many possible models may be equally as good. Variable importance might be masked by correlated variables, and confidence intervals generated for linear models might be too wide. Check the for multicollinearity or other associations between the predictors in:\n a. 2001 Australian election data\n b. 2016 Australian election data\n2. Examine 5D multivariate normal samples drawn from populations with a range of variance-covariance matrices. (You can use the `mvtnorm` package to do the sampling, for example.) Examine the data using a grand tour. What changes when you change the correlation from close to zero to close to 1? Can you see a difference between strong positive correlation and strong negative correlation?\n3. The following code shows how to hide a point in a four-dimensional space, so that it is not visible in any of the plots of two variables. Generate both `d` and `d_r` and confirm that the point is visible in a scatterplot matrix of `d`, but not in the scatterplot matrix of `d_r`. Also confirm that it is visible in both data sets when you use a tour.\n\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(tourr)\nlibrary(GGally)\nset.seed(946)\nd <- tibble(x1=runif(200, -1, 1), \n x2=runif(200, -1, 1), \n x3=runif(200, -1, 1))\nd <- d %>%\n mutate(x4 = x3 + runif(200, -0.1, 0.1))\n# outlier is visible in d\nd <- bind_rows(d, c(x1=0, x2=0, x3=-0.5, x4=0.5))\n\n# Point is hiding in d_r\nd_r <- d %>%\n mutate(x1 = cos(pi/6)*x1 + sin(pi/6)*x3,\n x3 = -sin(pi/6)*x1 + cos(pi/6)*x3,\n x2 = cos(pi/6)*x2 + sin(pi/6)*x4,\n x4 = -sin(pi/6)*x2 + cos(pi/6)*x4)\n```\n:::", "supporting": [ "3-intro-dimred_files/figure-pdf" ], diff --git a/_freeze/3-intro-dimred/figure-html/fig-2D-1.png b/_freeze/3-intro-dimred/figure-html/fig-2D-1.png new file mode 100644 index 0000000..8789933 Binary files /dev/null and b/_freeze/3-intro-dimred/figure-html/fig-2D-1.png differ diff --git a/_freeze/3-intro-dimred/figure-html/fig-plane-noise-outlier-1.png b/_freeze/3-intro-dimred/figure-html/fig-plane-noise-outlier-1.png new file mode 100644 index 0000000..520d1cf Binary files /dev/null and b/_freeze/3-intro-dimred/figure-html/fig-plane-noise-outlier-1.png differ diff --git a/_freeze/3-intro-dimred/figure-html/fig-plane-noise-scatter-1.png b/_freeze/3-intro-dimred/figure-html/fig-plane-noise-scatter-1.png new file mode 100644 index 0000000..9d55475 Binary files /dev/null and b/_freeze/3-intro-dimred/figure-html/fig-plane-noise-scatter-1.png differ diff --git a/_freeze/3-intro-dimred/figure-html/fig-plane-scatmat-1.png b/_freeze/3-intro-dimred/figure-html/fig-plane-scatmat-1.png new file mode 100644 index 0000000..fc9846e Binary files /dev/null and b/_freeze/3-intro-dimred/figure-html/fig-plane-scatmat-1.png differ diff --git a/_freeze/3-intro-dimred/figure-pdf/fig-2D-1.pdf b/_freeze/3-intro-dimred/figure-pdf/fig-2D-1.pdf index c6b244f..b26e405 100644 Binary files a/_freeze/3-intro-dimred/figure-pdf/fig-2D-1.pdf and b/_freeze/3-intro-dimred/figure-pdf/fig-2D-1.pdf differ diff --git a/_freeze/3-intro-dimred/figure-pdf/fig-plane-noise-outlier-1.pdf b/_freeze/3-intro-dimred/figure-pdf/fig-plane-noise-outlier-1.pdf index bc6776c..821cb81 100644 Binary files a/_freeze/3-intro-dimred/figure-pdf/fig-plane-noise-outlier-1.pdf and b/_freeze/3-intro-dimred/figure-pdf/fig-plane-noise-outlier-1.pdf differ diff --git a/_freeze/3-intro-dimred/figure-pdf/fig-plane-noise-scatter-1.pdf b/_freeze/3-intro-dimred/figure-pdf/fig-plane-noise-scatter-1.pdf index e8eb628..3ded153 100644 Binary files a/_freeze/3-intro-dimred/figure-pdf/fig-plane-noise-scatter-1.pdf and b/_freeze/3-intro-dimred/figure-pdf/fig-plane-noise-scatter-1.pdf differ diff --git a/_freeze/3-intro-dimred/figure-pdf/fig-plane-scatmat-1.pdf b/_freeze/3-intro-dimred/figure-pdf/fig-plane-scatmat-1.pdf index 871b5dd..312b468 100644 Binary files a/_freeze/3-intro-dimred/figure-pdf/fig-plane-scatmat-1.pdf and b/_freeze/3-intro-dimred/figure-pdf/fig-plane-scatmat-1.pdf differ diff --git a/_freeze/4-pca/execute-results/html.json b/_freeze/4-pca/execute-results/html.json new file mode 100644 index 0000000..bfa5790 --- /dev/null +++ b/_freeze/4-pca/execute-results/html.json @@ -0,0 +1,21 @@ +{ + "hash": "8cdc12c7c8dae022bd6f2aae3ce978c9", + "result": { + "engine": "knitr", + "markdown": "## Principal component analysis \n\\index{dimension reduction!principal component analysis (PCA)}\n\nReducing dimensionality using principal component analysis (PCA) dates back to @pearson-pca and @hotelling-pca, and @joliffe2016 provides a current overview. The goal is to find a smaller set of variables, $q (< p)$, that contain as much information as the original as possible. The new set of variables, known as principal components (PCs), are linear combinations of the original variables. The PCs can be used to represent the data in a lower-dimensional space.\n\nThe process is essentially an optimisation procedure, although PCA has an analytical solution. It solves the problem of \n\n$$\n\\max_{a_k} ~\\text{Var} (Xa_k),\n$$\nwhere $X$ is the $n \\times p$ data matrix, $a_k (k=1, ..., p)$ is a 1D projection vector, called an eigenvector, and the $\\text{Var} (Xa_k)$ is called an eigenvalue. So PCA is a sequential process, that will find the direction in the high-dimensional space (as given by the first eigenvector) where the data is most varied, and then find the second most varied direction, and so on. The eigenvectors define the combination of the original variables, and the eigenvalues define the amount of variance explained by the reduced number of variables.\n\\index{dimension reduction!eigenvalue}\n\\index{dimension reduction!eigenvector}\n\nPCA is very broadly useful for summarising linear association by using combinations of variables that are highly correlated. However, high correlation can also occur when there are outliers, or clustering. PCA is commonly used to detect these patterns also.\n\n::: {.content-visible when-format=\"html\"}\n::: info\nWith visualisation we want to assess whether it is appropriate to use PCA to summarise any linear association by using combinations of variables that are highly correlated. It can help to detect other patterns that might affect the PCA results such as outliers, clustering or non-linear dependence. \n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\infobox{With visualisation we want to assess whether it is appropriate to use PCA to summarise any linear association by using combinations of variables that are highly correlated. It can help to detect other patterns that might affect the PCA results such as outliers, clustering or non-linear dependence.}\n:::\n\n\nPCA is not very effective when the distribution of the variables is highly skewed, so it can be helpful to transform variables to make them more symmetrically distributed before conducting PCA. It is also possible to summarise different types of structure by generalising the optimisation criteria to any function of projected data, $f(XA)$, which is called *projection pursuit* (PP). PP has a long history (@Kr64a, @FT74, @DF84, @JS87, @Hu85), and there are regularly new developments (e.g. @Lee2009, @perisic2009, @Lee2013, @loperfido, @bickel2018, @zhang2023). \n\n## Determining how many dimensions\n\nWe would start by examining the data using a grand tour. The goal is to check whether there might be potential issues for PCA, such as skewness, outliers or clustering, or even non-linear dependencies.\n\nWe'll start be showing PCA on the simulated data from @sec-dimension-overview. The scree plots show that PCA supports that the data are 2D, 3D and 5D respectively.\n\n\\index{dimension reduction!scree plot}\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nlibrary(dplyr)\nlibrary(ggplot2)\nlibrary(mulgar)\ndata(plane)\ndata(box)\nlibrary(geozoo)\ncube5d <- data.frame(cube.solid.random(p=5, n=300)$points)\ncolnames(cube5d) <- paste0(\"x\", 1:5)\ncube5d <- data.frame(apply(cube5d, 2, \n function(x) (x-mean(x))/sd(x)))\np_pca <- prcomp(plane)\nb_pca <- prcomp(box)\nc_pca <- prcomp(cube5d)\np_scree <- ggscree(p_pca, q = 5) + theme_minimal()\n\nb_scree <- ggscree(b_pca, q = 5) + theme_minimal()\nc_scree <- ggscree(c_pca, q = 5) + theme_minimal()\n```\n:::\n\n::: {.cell}\n::: {.cell-output-display}\n![Scree plots for the three simulated data sets shown in Figure 3.2. The 2D in 5D is clearly recognised by PCA to be 2D because the variance drops substantially between 2-3 principal components. The 3D in 5D is possibly 3D because the variance drops from 3-4 principal components. The fully 5D data has no drop in variance, and all values are close to the typical value one would observe if the data was fully 5D.](4-pca_files/figure-html/fig-2D-pca-1.png){#fig-2D-pca width=100%}\n:::\n:::\n\n\n\nThe next step is to look at the coefficients for the selected number of PCs. @tbl-plane-pcs shows the coefficients for the first two PCs of the `plane` data. All five variables contribute, with `x1`, `x2`, `x3` contributing more to `PC1`, and `x4`, `x5` contributing more to `PC2`. @tbl-box-pcs shows the coefficients for the first three PCs. Variables `x1`, `x2`, `x3` contribute strongly to `PC1`, `PC2` has contributions from all variables except `x3` and variables `x4` and `x5` contribute strongly to `PC3`. \n\n\\index{dimension reduction!coefficients}\n\\index{dimension reduction!principal components}\n\n\n\n::: {#tbl-plane-pcs .cell tbl-cap='Coefficients for the first two PCs for the plane data.'}\n\n```{.r .cell-code code-summary=\"Code to print PC coefficients\"}\nlibrary(gt)\np_pca$rotation[,1:2] %>%\n as_tibble(rownames=\"Variable\") %>% \n gt() %>%\n fmt_number(columns = c(PC1, PC2),\n decimals = 2)\n```\n\n::: {.cell-output-display}\n\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n\n\n \n\n\n \n\n\n \n\n\n \n\n\n \n \n \n
VariablePC1PC2
x10.58−0.06
x2−0.550.21
x30.47−0.41
x40.250.64
x5−0.29−0.62
\n
\n```\n\n:::\n:::\n\n::: {#tbl-box-pcs .cell tbl-cap='Coefficients for the first three PCs for the box data.'}\n\n```{.r .cell-code code-summary=\"Code to print PC coefficients\"}\nb_pca$rotation[,1:3] %>%\n as_tibble(rownames=\"Variable\") %>% \n gt() %>%\n fmt_number(columns = c(PC1, PC2, PC3),\n decimals = 2)\n```\n\n::: {.cell-output-display}\n\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n\n\n\n \n\n\n\n \n\n\n\n \n\n\n\n \n\n\n\n \n \n \n
VariablePC1PC2PC3
x1−0.510.460.11
x20.510.460.00
x3−0.65−0.090.23
x4−0.220.36−0.87
x50.020.660.43
\n
\n```\n\n:::\n:::\n\n\nIn each of these simulated data sets, all five variables contributed to the dimension reduction. If we added two purely noise variables to the plane data, as done in @sec-dimension-overview, the scree plot would indicate that the data is now 4D, and we would get a different interpretation of the coefficients from the PCA. We see that `PC1` and `PC2` are approximately the same as before, with main variables being (`x1`, `x2`, `x3`) and (`x4`, `x5`) respectively. `PC3` and `PC4` are both `x6` and `x7`. \n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nset.seed(5143)\nplane_noise <- plane\nplane_noise$x6 <- rnorm(100)\nplane_noise$x7 <- rnorm(100)\nplane_noise <- data.frame(apply(plane_noise, 2, function(x) (x-mean(x))/sd(x)))\n\npn_pca <- prcomp(plane_noise)\nggscree(pn_pca, q = 7) + theme_minimal()\n```\n\n::: {.cell-output-display}\n![Additional noise variables expands the data to 4D.](4-pca_files/figure-html/fig-plane-noise-scree-1.png){#fig-plane-noise-scree width=80%}\n:::\n:::\n\n::: {#tbl-plane-noise-pcs .cell tbl-cap='Coefficients for the first four PCs for the box data.'}\n\n```{.r .cell-code code-summary=\"Code to print PC coefficients\"}\npn_pca$rotation[,1:4] %>%\n as_tibble(rownames=\"Variable\") %>% \n gt() %>%\n fmt_number(columns = c(PC1, PC2, PC3, PC4),\n decimals = 2)\n```\n\n::: {.cell-output-display}\n\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n\n\n\n\n \n\n\n\n\n \n\n\n\n\n \n\n\n\n\n \n\n\n\n\n \n\n\n\n\n \n\n\n\n\n \n \n \n
VariablePC1PC2PC3PC4
x10.580.040.010.00
x2−0.55−0.18−0.030.07
x30.470.370.05−0.20
x40.24−0.62−0.060.17
x5−0.280.600.07−0.14
x60.050.29−0.580.76
x7−0.02−0.08−0.81−0.58
\n
\n```\n\n:::\n:::\n\n\n### Example: pisa\n\\index{data!pisa}\n\nThe `pisa` data contains simulated data from math, reading and science scores, totalling 30 variables. PCA is used here to examine the association. We might expect that it is 3D, but what we see suggests it is primarily 1D. This means that a student that scores well in math, will also score well in reading and science. \n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\ndata(pisa)\npisa_std <- pisa %>%\n filter(CNT == \"Australia\") %>%\n select(-CNT) %>%\n mutate_all(mulgar:::scale2)\npisa_pca <- prcomp(pisa_std)\npisa_scree <- ggscree(pisa_pca, q = 15) + theme_minimal()\n```\n:::\n\n\nThe scree plot in @fig-pisa-pca-html shows a big drop from one to two PCs in the amount of variance explained. A grand tour on the 30 variables can be run using `animate_xy()`: \n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nanimate_xy(pisa_std, half_range=1)\n```\n:::\n\n\nor rendered as an animated gif using `render_gif()`:\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nrender_gif(pisa_std, \n grand_tour(), \n display_xy(half_range=0.9),\n gif_file=\"gifs/pisa_gt.gif\",\n frames=500,\n width=400,\n height=400,\n loop=FALSE)\n```\n:::\n\n\nand we can see that the data is elliptical in most projections, sometimes shrinking to be a small circle. This pattern strongly indicates that there is one primary direction of variation in the data, with only small variation in any direction away from it. Shrinking to the small circle is analogous to to how *a pencil or cigar or water bottle in 3D looks from some angles*.\n\n::: {.content-visible when-format=\"html\"}\n::: {#fig-pisa-pca-html fig-align=\"center\" layout-ncol=2}\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Scree plot for the PCA on the pisa data suggests that the data is 1D.](4-pca_files/figure-html/unnamed-chunk-10-1.png){width=80%}\n:::\n:::\n\n\n![Grand tour of the pisa data.](gifs/pisa_gt.gif){#fig-pisa-gt fig-alt=\"Tour showing lots of linear projections of the pisa data. You can see strong linear dependence.\" fig.align=\"center\"}\n\nScree plot and tour of the `pisa` data, with 30 variables being the plausible scores for Australian students.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n::: {#fig-pisa-pca-pdf fig-align=\"center\" layout-ncol=2}\n\n![](images/fig-pisa-scree-1.png){fig-alt=\"FIX ME\"}\n\n![Grand tour of the pisa data.](images/pisa_gt_249.png){#fig-pisa-gt fig-alt=\"Selected linear projection of the pisa data from a grand tour. You can see strong linear dependence.\" fig.align=\"center\"}\n\nScree plot and tour of the `pisa` data, with 30 variables being the plausible scores for Australian students.\n:::\n:::\n\nThe coefficients of the first PC (first eigenvector) are roughly equal in magnitude (as shown below), which tells us that all variables roughly contribute. Interestingly, they are all negative, which is not actually meaningful. With different software these could easily have been all positive. The sign of the coefficients can be reversed, as long as all are reversed, which is the same as an arrow pointing one way, changing and pointing the other way. \n\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to print PC coefficients\"}\nround(pisa_pca$rotation[,1], 2)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n PV1MATH PV2MATH PV3MATH PV4MATH PV5MATH PV6MATH PV7MATH PV8MATH \n -0.18 -0.18 -0.18 -0.18 -0.18 -0.18 -0.18 -0.18 \n PV9MATH PV10MATH PV1READ PV2READ PV3READ PV4READ PV5READ PV6READ \n -0.18 -0.18 -0.19 -0.18 -0.19 -0.19 -0.19 -0.19 \n PV7READ PV8READ PV9READ PV10READ PV1SCIE PV2SCIE PV3SCIE PV4SCIE \n -0.19 -0.19 -0.19 -0.19 -0.18 -0.18 -0.19 -0.18 \n PV5SCIE PV6SCIE PV7SCIE PV8SCIE PV9SCIE PV10SCIE \n -0.19 -0.18 -0.19 -0.18 -0.19 -0.18 \n```\n\n\n:::\n:::\n\n\n::: {.content-visible when-format=\"html\"}\n::: insight\nThe tour verifies that the `pisa` data is primarily 1D, indicating that a student who scores well in math, probably scores well in reading and science, too. More interestingly, the regular shape of the data strongly indicates that it is \"synthetic\", simulated rather than observed.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\insightbox{The tour verifies that the `pisa` data is primarily 1D, indicating that a student who scores well in math, probably scores well in reading and science, too. More interestingly, the regular shape of the data strongly indicates that it is \"synthetic\", simulated rather than observed.}\n:::\n\n### Example: aflw\n\\index{data!aflw}\n\nThis data has player statistics for all the matches in the 2021 season. We would be interested to know which variables contain similar information, and thus might be combined into single variables. We would expect that many statistics to group into a few small sets, such as offensive and defensive skills. We might also expect that some of the statistics are skewed, most players have low values and just a handful of players are stellar. It is also possible that there are some extreme values. These are interesting features, but they will distract from the main purpose of grouping the statistics. Thus the tour is used to check for potential problems with the data prior to conducting PCA.\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nlibrary(tourr)\ndata(aflw)\naflw_std <- aflw %>%\n mutate_if(is.numeric, function(x) (x-\n mean(x, na.rm=TRUE))/\n sd(x, na.rm=TRUE))\n```\n:::\n\n\nTo look at all of the 29 player statistics in a grand tour in @fig-aflw-gt-html.\n\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to generate tour\"}\nanimate_xy(aflw_std[,7:35], half_range=0.9)\nrender_gif(aflw_std[,7:35], \n grand_tour(), \n display_xy(half_range=0.9),\n gif_file=\"gifs/aflw_gt.gif\",\n frames=500,\n loop=FALSE)\n```\n:::\n\n\n::: {.content-visible when-format=\"html\"}\n::: {#fig-aflw-gt-html}\n\n![](gifs/aflw_gt.gif){fig-alt=\"Tour showing lots of linear projections of the aflw data. You can see linear dependence, and some outliers.\" fig.align=\"center\"}\n\nGrand tour of the AFLW player statistics.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n::: {#fig-aflw-gt-pdf fig.align=\"center\" fig-alt=\"Example linear projection of the aflw data from a grand tour. You can see linear dependence, and some outliers.\"}\n\n![](images/aflw_gt_70.png){}\n\nGrand tour of the AFLW player statistics.\n:::\n:::\n\nNo major surprises! There is a small amount of skewness, and there are no major outliers. Skewness indicates that most players have reasonably similar skills (bunching of points), except for some key players (the moderate outliers). The skewness could be reduced by applying a log or square root transformation to some variables prior to running the PCA. However, we elect not to do this because the moderate outliers are of interest. These correspond to talented players that we'd like to explore further with the analysis.\n\nBelow we have the conventional summary of the PCA, a scree plot showing the reduction in variance to be explained when each additional PC is considered. It is also conventional to look at a table summarising the proportions of variance explained by PCs, but with almost 30 variables it is easier to make some decision on the number of PCs needed based on the scree plot.\n\n\n::: {.cell alt-text='Scree plot showing variance vertically against PC number horizontally. Variance drops from close to 10 for PC 1 to about 1.2 for PC 4 then slowly decays through to PC 29'}\n\n```{.r .cell-code code-summary=\"Code to make screeplot\"}\naflw_pca <- prcomp(aflw_std[,7:35], \n scale = FALSE, \n retx=TRUE)\n\nggscree(aflw_pca, q = 29) + theme_minimal()\n```\n\n::: {.cell-output-display}\n![Scree plot showing decay in variance of PCs.](4-pca_files/figure-html/fig-aflw-pca-1.png){#fig-aflw-pca width=80%}\n:::\n:::\n\n\\index{dimension reduction!scree plot}\n\nFrom the scree plot in @fig-aflw-pca, we see a sharp drop from one to two, two to three and then smaller drops. After four PCs the variance drops again at six PCs and then gradually decays. We will choose four PCs to examine more closely. This explains 67.2% of the variance.\n\n\n::: {#tbl-aflw-pcs .cell tbl-cap='Coefficients for the first four PCs.'}\n\n```{.r .cell-code code-summary=\"Code to print PC coefficients\"}\nlibrary(gt)\naflw_pca$rotation[,1:4] %>%\n as_tibble(rownames=\"Variable\") %>% \n arrange(desc(PC1), desc(PC2), desc(PC3)) %>%\n gt() %>%\n fmt_number(columns = c(PC1, PC2, PC3, PC4),\n decimals = 2)\n```\n\n::: {.cell-output-display}\n\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n\n\n\n\n \n\n\n\n\n \n\n\n\n\n \n\n\n\n\n \n\n\n\n\n \n\n\n\n\n \n\n\n\n\n \n\n\n\n\n \n\n\n\n\n \n\n\n\n\n \n\n\n\n\n \n\n\n\n\n \n\n\n\n\n \n\n\n\n\n \n\n\n\n\n \n\n\n\n\n \n\n\n\n\n \n\n\n\n\n \n\n\n\n\n \n\n\n\n\n \n\n\n\n\n \n\n\n\n\n \n\n\n\n\n \n\n\n\n\n \n\n\n\n\n \n\n\n\n\n \n\n\n\n\n \n\n\n\n\n \n\n\n\n\n \n \n \n
VariablePC1PC2PC3PC4
disposals0.31−0.05−0.030.07
possessions0.31−0.03−0.070.09
kicks0.29−0.040.09−0.12
metres0.28−0.030.10−0.15
contested0.280.01−0.120.23
uncontested0.28−0.06−0.01−0.05
turnovers0.27−0.01−0.01−0.29
clearances0.230.00−0.290.19
clangers0.23−0.02−0.06−0.33
handballs0.23−0.04−0.190.31
frees_for0.210.02−0.130.18
marks0.210.030.320.02
tackles0.200.01−0.280.09
time_pct0.16−0.040.35−0.02
intercepts0.13−0.280.240.03
rebounds_in500.13−0.280.24−0.06
frees_against0.130.03−0.16−0.23
assists0.090.230.000.05
bounces0.090.030.02−0.28
behinds0.090.320.08−0.02
shots0.080.380.12−0.03
tackles_in500.070.27−0.180.03
marks_in500.060.340.180.04
contested_marks0.050.160.340.15
goals0.040.370.160.03
accuracy0.040.340.100.06
one_pct0.03−0.210.330.08
disposal0.02−0.130.200.50
hitouts−0.040.00−0.030.32
\n
\n```\n\n:::\n:::\n\n\nWhen there are as many variables as this, it can be hard to digest the combinations of variables most contributing to each PC. Rearranging the table by sorting on a selected PC can help. @tbl-aflw-pcs has been sorted according to the PC 1 coefficients.\n\nPC 1 is primarily composed of `disposals`, `possessions`, `kicks`, `metres`, `uncontested`, `contested`, .... Actually almost all variables positively contribute, albeit in different amounts! It is quite common in PCA for the first PC to be a combination of all variables, although it might commonly be a closer to equal contribution, and it tells us that there is one main direction of variation in the data. For PC 1 in the `aflw` data, PCA is telling us that the primary variation is through a combination of skills, and this maps to basic football playing skills, where some skills (e.g. disposals, possessions, kicks, ...) are more important.\n\nThus the second PC might be the more interesting. PC 2 is primarily a combination of `shots`, `goals`, `marks_in50`, `accuracy`, and `behinds` contrasted against `rebounds_in50` and `intercepts`. The negative coefficients are primary offensive skills and the positive coefficients are defensive skills. This PC is reasonable measure of the offensive vs defensive skills of a player.\n\n\\index{dimension reduction!interpretation}\n\nWe would continue to interpret each PC by examining large coefficients to help decide how many PCs are a suitable summary of the information in the data. Briefly, PC 3 is a measure of worth of the player because `time_pct` has a large coefficient, so players that are on the field longer will contribute strongly to this new variable. It also has large (and opposite) contributions from `clearances`, `tackles`, `contested_marks`. PC 4 appears to be related to aggressive play with `clangers`, `turnovers`, `bounces` and `frees_against` featuring. So all four PCs have useful information. (Note, if we had continued to examine large coefficients on PC 5 we would find that all variables already have had reasonably large coefficients on PC 1-4, which supports restricting attention to the first four.)\n\n\n\nIdeally, when we tour the four PCs, we'd like to be able to stop and identify players. This involves creating a pre-computed animation, with additional mouse-over. This is only feasible with a small number of observations, like the `aflw` data, because all of the animation frames are constructed in a single object and passed to `plotly`. This object gets large very quickly!\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"true\" code-summary=\"Code to make tour animation\"}\nlibrary(plotly)\nlibrary(htmlwidgets)\nset.seed(20)\nb <- basis_random(4, 2)\naflw_pct <- tourr::save_history(aflw_pca$x[,1:4], \n tour_path = grand_tour(),\n start = b,\n max_bases = 5)\n# To reconstruct projected data plots, later\nsave(aflw_pct, file=\"data/aflw_pct.rda\") \naflw_pcti <- interpolate(aflw_pct, 0.1)\naflw_anim <- render_anim(aflw_pca$x[,1:4],\n frames=aflw_pcti, \n obs_labels=paste0(aflw$surname,\n aflw$given_name))\n\naflw_gp <- ggplot() +\n geom_path(data=aflw_anim$circle, \n aes(x=c1, y=c2,\n frame=frame), linewidth=0.1) +\n geom_segment(data=aflw_anim$axes, \n aes(x=x1, y=y1, \n xend=x2, yend=y2, \n frame=frame), \n linewidth=0.1) +\n geom_text(data=aflw_anim$axes, \n aes(x=x2, y=y2, \n frame=frame, \n label=axis_labels), \n size=5) +\n geom_point(data=aflw_anim$frames, \n aes(x=P1, y=P2, \n frame=frame, \n label=obs_labels), \n alpha=0.8) +\n xlim(-1,1) + ylim(-1,1) +\n coord_equal() +\n theme_bw() +\n theme(axis.text=element_blank(),\n axis.title=element_blank(),\n axis.ticks=element_blank(),\n panel.grid=element_blank())\naflw_pctour <- ggplotly(aflw_gp,\n width=500,\n height=550) %>%\n animation_button(label=\"Go\") %>%\n animation_slider(len=0.8, x=0.5,\n xanchor=\"center\") %>%\n animation_opts(easing=\"linear\", transition = 0)\n\nhtmlwidgets::saveWidget(aflw_pctour,\n file=\"html/aflw_pca.html\",\n selfcontained = TRUE)\n```\n:::\n\n\n::: {.content-visible when-format=\"html\"}\n::: {#fig-aflw-pcatour}\n\n\n\nAnimation of four PCs of the aflw data with interactive labelling.\n:::\n:::\n\n\n::: {.cell}\n\n:::\n\n\n::: {.content-visible when-format=\"html\"}\nFrom @fig-aflw-pcatour the shape of the four PCs is similar to that of all the variables, bunching of points in the centre with a lot of moderate outliers.\n:::\n\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to generate interactive plot of frame 18\"}\nlibrary(plotly)\nload(\"data/aflw_pct.rda\")\naflw_pcti <- interpolate(aflw_pct, 0.1)\nf18 <- matrix(aflw_pcti[,,18], ncol=2)\np18 <- render_proj(aflw_pca$x[,1:4], f18, \n obs_labels=paste0(aflw$surname,\n aflw$given_name))\npg18 <- ggplot() +\n geom_path(data=p18$circle, aes(x=c1, y=c2)) +\n geom_segment(data=p18$axes, aes(x=x1, y=y1, xend=x2, yend=y2)) +\n geom_text(data=p18$axes, aes(x=x2, y=y2, label=rownames(p18$axes))) +\n geom_point(data=p18$data_prj, aes(x=P1, y=P2, label=obs_labels)) +\n xlim(-1,1) + ylim(-1, 1) +\n #ggtitle(\"Frame 18\") +\n theme_bw() +\n theme(\n axis.text=element_blank(),\n axis.title=element_blank(),\n axis.ticks=element_blank(),\n panel.grid=element_blank())\nggplotly(pg18, width=500, height=500)\n```\n\n::: {#fig-aflw-pcaplots .cell-output-display}\n\n```{=html}\n
\n\n```\n\n\nFrame 18 re-plotted so that players can be identified on mouse-over.\n:::\n:::\n\n\n::: {.content-visible when-format=\"html\"}\nFor any particular frame, like 18 re-plotted in @fig-aflw-pcaplots, we can investigate further. Here there is a branching pattern, where the branch points in the direction of PC 1. Mouse-over the players at the tip of this branch and we find players like Alyce Parker, Brittany Bonnici, Dana Hooker, Kiara Bowers. If you look up the bios of these players you'll find they all have generally good player descriptions like \"elite disposals\", \"powerful left foot\", \"hard-running midfielder\", \"best and fairest\".\n:::\n\nIn the direction of PC 2, you'll find players like Lauren Ahrens, Stacey Livingstone who are star defenders. Players in this end of PC 1, have high scores on `intercepts` and `rebounds_in50`.\n\nAnother interesting frame for inspecting PC 2 is 59. PC 2 at one end has players with high goal scoring skills, and the other good defending skills. So mousing over the other end of PC 2 finds players like Gemma Houghton and Katie Brennan who are known for their goal scoring. The branch pattern is an interesting one, because it tells us there is some combination of skills that are lacking among all players, primarily this appears to be there some distinction between defenders skills and general playing skills. It's not as simple as this because the branching is only visible when PC 1 and PC 2 are examined with PC 3.\n\nPCA is useful for getting a sense of the variation in a high-dimensional data set. Interpreting the principal components is often useful, but it can be discombobulating. For the `aflw` data it would be good to think about it as a guide to the main directions of variation and to follow with a more direct engineering of variables into interesting player characteristics. For example, calculate offensive skill as an equal combination of goals, accuracy, shots, behinds. A set of new variables specifically computed to measure particular skills would make explaining an analysis easier.\n\n::: {.content-visible when-format=\"html\"}\n::: insight\nThe tour verifies that PCA on the `aflw` data is complicated and doesn't capture all of the variation. However, it does provide useful insights. It detected outstanding players, and indicated the different skills sets of top goal scorers and top defensive players.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\insightbox{The tour verifies that PCA on the `aflw` data is complicated and doesn't capture all of the variation. However, it does provide useful insights. It detected outstanding players, and indicated the different skills sets of top goal scorers and top defensive players.}\n:::\n\n## Examining the PCA model in the data space\n\\index{model-in-the-data-space}\n\nWhen you choose a smaller number of PCs $(k)$ than the number of original variables, this is essentially producing a model for the data. The model is the lower dimensional $k$-D space. It is analogous to a linear regression model, except that the residuals from the model are $(p-k)$-D. \n\nIt is common to show the model, that is the data projected into the $k$-D model space. When $k=2$ this is called a \"biplot\". For the `plane` and `plane_noise` data the biplots are shown in @fig-plane-biplot. This is useful for checking which variables contribute most to the new principal component variables, and also to check for any problems that might have affected the fit, such as outliers, clusters or non-linearity. Interestingly, biplots are typically only made in 2D, even if the data should be summarised by more than two PCs. Occasionally you will see the biplot made for PC $j$ vs PC $k$ also. With the `pca_tour()` function in the `tourr` package you can view a $k$-D biplot. This will display the $k$ PCs with the axes displaying the original variables, and thus see their contribution to the PCs. \n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nlibrary(ggfortify)\nlibrary(patchwork)\nplane_pca <- prcomp(plane)\npl1 <- autoplot(plane_pca, loadings = TRUE, \n loadings.label = TRUE) + \n ggtitle(\"(a)\") +\n theme_minimal() + \n theme(aspect.ratio=1)\nplane_noise_pca <- prcomp(plane_noise)\npl2 <- autoplot(plane_noise_pca, loadings = TRUE, \n loadings.label = TRUE) + \n ggtitle(\"(b)\") +\n theme_minimal() + \n theme(aspect.ratio=1)\npl1 + pl2\n```\n\n::: {.cell-output-display}\n![Biplots of the plane (a) and plane + noise (b) data. All five variables contribute strongly to the two principal components in (a): PC1 is primarily `x1`, `x2` and `x3` and PC2 is primarily `x4` and `x5`. In (b) the same four variables contribute in almost the same way, with variables `x6` and `x7` contributing very little. The data was constructed this way, that these two dimensions were purely noise.](4-pca_files/figure-html/fig-plane-biplot-1.png){#fig-plane-biplot width=768}\n:::\n:::\n\n\n\nIt can be useful to examine this model using the tour. The model is simply a plane in high dimensions. This would be considered to be the model in the data space. The reason to do this is to check how well the model fits the data. The plane corresponding to the model should be oriented along the main direction of the points, and the spread of points around the plane should be small. We should also be able to see if there has been any strong non-linear relationship missed by the model, or outliers and clusters.\n\nThe function `pca_model()` from the `mulgar` package can be used to represent the model as a $k$-D wire-frame plane. @fig-plane-box-model-html shows the models for the `plane` and `box` data, 2D and 3D respectively.\n\n::: {.content-visible when-format=\"html\"}\n::: info\nWe look at the model in the data space to check how well the model fits the data. If it fits well, the points will cluster tightly around the model representation, with little spread in other directions.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\infobox{We look at the model in the data space to check how well the model fits the data. If it fits well, the points will cluster tightly around the model representation, with little spread in other directions.}\n:::\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nplane_m <- pca_model(plane_pca)\nplane_m_d <- rbind(plane_m$points, plane)\nanimate_xy(plane_m_d, edges=plane_m$edges,\n axes=\"bottomleft\",\n edges.col=\"#E7950F\",\n edges.width=3)\nrender_gif(plane_m_d, \n grand_tour(), \n display_xy(half_range=0.9,\n edges=plane_m$edges, \n edges.col=\"#E7950F\",\n edges.width=3),\n gif_file=\"gifs/plane_model.gif\",\n frames=500,\n width=400,\n height=400,\n loop=FALSE)\nbox_pca <- prcomp(box)\nbox_m <- pca_model(box_pca, d=3)\nbox_m_d <- rbind(box_m$points, box)\nanimate_xy(box_m_d, edges=box_m$edges, \n axes=\"bottomleft\", edges.col=\"#E7950F\", edges.width=3)\nrender_gif(box_m_d, \n grand_tour(), \n display_xy(half_range=0.9,\n edges=box_m$edges, \n edges.col=\"#E7950F\",\n edges.width=3),\n gif_file=\"gifs/box_model.gif\",\n frames=500,\n width=400,\n height=400,\n loop=FALSE)\n```\n:::\n\n\n\n::: {.content-visible when-format=\"html\"}\n::: {#fig-plane-box-model-html fig-align=\"center\" layout-ncol=2}\n\n![Model for the 2D in 5D data.](gifs/plane_model.gif){#fig-plane-model fig-alt=\"FIX ME.\" fig.align=\"center\"}\n\n![Model for the 3D in 5D data.](gifs/box_model.gif){#fig-box-model fig-alt=\"FIX ME.\" fig.align=\"center\"}\n\nPCA model overlaid on the data for the 2D in 5D, and 3D in 5D simulated data. \n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n::: {#fig-plane-box-model-pdf fig-align=\"center\" layout-ncol=2}\n\n![Model for the 2D in 5D data.](images/plane_model_55.png){#fig-plane-model fig-alt=\"FIX ME.\" fig.align=\"center\"}\n\n![Model for the 3D in 5D data.](images/box_model_13.png){#fig-box-model fig-alt=\"FIX ME.\" fig.align=\"center\"}\n\nPCA model overlaid on the data for the 2D in 5D, and 3D in 5D simulated data. \n:::\n:::\n\n### Example: pisa\n\\index{data!pisa}\n\nThe model for the `pisa` data is a 1D vector, shown in @fig-pisa-model-html. \n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\npisa_model <- pca_model(pisa_pca, d=1, s=2)\n\npisa_all <- rbind(pisa_model$points, pisa_std)\nanimate_xy(pisa_all, edges=pisa_model$edges,\n edges.col=\"#E7950F\", edges.width=3)\nrender_gif(pisa_all, \n grand_tour(), \n display_xy(half_range=0.9,\n edges=pisa_model$edges, \n edges.col=\"#E7950F\", \n edges.width=5),\n gif_file=\"gifs/pisa_model.gif\",\n frames=500,\n width=400,\n height=400,\n loop=FALSE)\n```\n:::\n\n\n::: {.content-visible when-format=\"html\"}\n::: {#fig-pisa-model-html}\n\n![](gifs/pisa_model.gif){fig-alt=\"Something here\" fig.align=\"center\"}\n\nPCA model of the `pisa` data. The 1D model captures the primary variation in the data and there is a small amount of spread in all directions away from the model. \n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n::: {#fig-pisa-model-pdf fig-alt=\"Something here\" fig.align=\"center\"}\n\n![](images/pisa_model_17.png){}\n\nPCA model of the `pisa` data. The 1D model captures the primary variation in the data and there is a small amount of spread in all directions away from the model. \n:::\n:::\n\n::: {.content-visible when-format=\"html\"}\n::: insight\nThe `pisa` data fits fairly closely to the 1D PCA model. The variance of points away from the model is symmetric and relatively small. These suggest the 1D model is a reasonably summary of the test scores.\n:::\n:::\n\n::: {.content-visible when-format=\"html\"}\n\\insightbox{The `pisa` data fits fairly closely to the 1D PCA model. The variance of points away from the model is symmetric and relatively small. These suggest the 1D model is a reasonably summary of the test scores.}\n:::\n\n### Example: aflw\n\\index{data!aflw}\n\nIt is less useful to examine the PCA model for the `aflw` data, because the main patterns that were of interest were the exceptional players. However, we will do it anyway! @fig-aflw-model-html shows the 4D PCA model overlain on the data. Even though the distribution of points is not as symmetric and balanced as the other examples, we can see that the cube structure mirrors the variation. We can see that the relationships between variables are not strictly linear, because the spread extends unevenly away from the box. \n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\naflw_model <- pca_model(aflw_pca, d=4, s=1)\n\naflw_all <- rbind(aflw_model$points, aflw_std[,7:35])\nanimate_xy(aflw_all, edges=aflw_model$edges,\n edges.col=\"#E7950F\", \n edges.width=3, \n half_range=0.8, \n axes=\"off\")\nrender_gif(aflw_all, \n grand_tour(), \n display_xy(half_range=0.8,\n edges=aflw_model$edges, \n edges.col=\"#E7950F\", \n edges.width=3, \n axes=\"off\"),\n gif_file=\"gifs/aflw_model.gif\",\n frames=500,\n width=400,\n height=400,\n loop=FALSE)\n```\n:::\n\n\n::: {.content-visible when-format=\"html\"}\n::: {#fig-aflw-model-html}\n\n![](gifs/aflw_model.gif){ fig-alt=\"Something here\" fig.align=\"center\"}\n\nPCA model of the `aflw` data. The linear model is not ideal for this data, which has other patterns like outliers, and some branching. However, the model roughly captures the linear associations, and leaves unequal variation in different directions.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n::: {#fig-aflw-model-pdf fig-alt=\"Something here\" fig.align=\"center\"}\n\n![](images/aflw_model_70.png){}\n\nPCA model of the `aflw` data. The linear model is not ideal for this data, which has other patterns like outliers, and some branching. However, the model roughly captures the linear associations, and leaves unequal variation in different directions.\n:::\n:::\n\n::: {.content-visible when-format=\"html\"}\n::: insight\nFrom the tour we see that the 4D model leaves substantial variation unexplained. It is also not symmetric, and there is some larger variation away from the model in some combinations of variables than others. \n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\insightbox{From the tour we see that the 4D model leaves substantial variation unexplained. It is also not symmetric, and there is some larger variation away from the model in some combinations of variables than others.}\n:::\n\n## When relationships are not linear \n\n### Example: outliers\n\\index{outliers}\n\n@fig-plane-n-o-scree shows the scree plot for the planar data with noise and outliers. It is very similar to the scree plot on the data without the outliers (@fig-plane-noise-scree). However, what we see from @fig-p-o-pca-html is that PCA loses the outliers. The animation in (a) shows the full data, and the outliers marked by colour and labels 1, 2, are clearly unusual in some projections. When we examine the tour of the first four PCs (as suggested by the scree plot) the outliers are not unusual. They are almost contained in the point cloud. The reason is clear when all the PCs are plotted, and the outliers can be seen to be clearly detected only in PC5, PC6 and PC7. \n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nplane_n_o_pca <- prcomp(plane_noise_outliers)\nggscree(plane_n_o_pca, q = 7) + theme_minimal()\n```\n\n::: {.cell-output-display}\n![Scree plot of the planar data with noise and an outlier. It is almost the same as the data without the outliers.](4-pca_files/figure-html/fig-plane-n-o-scree-1.png){#fig-plane-n-o-scree width=80%}\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nclrs <- hcl.colors(12, \"Zissou 1\")\np_col <- c(rep(\"black\", 100), clrs[11], clrs[11])\np_obs_labels <- c(rep(\"\", 100), \"1\", \"2\")\n\nanimate_xy(plane_n_o_pca$x[,1:4],\n col=p_col,\n obs_labels=p_obs_labels)\nanimate_xy(plane_noise_outliers,\n col=p_col,\n obs_labels=p_obs_labels)\nrender_gif(plane_noise_outliers, \n grand_tour(), \n display_xy(half_range=0.8,\n col=p_col,\n obs_labels=p_obs_labels),\n gif_file=\"gifs/plane_n_o_clr.gif\",\n frames=500,\n width=200,\n height=200,\n loop=FALSE)\nrender_gif(plane_n_o_pca$x[,1:4], \n grand_tour(), \n display_xy(half_range=0.8,\n col=p_col,\n obs_labels=p_obs_labels),\n gif_file=\"gifs/plane_n_o_pca.gif\",\n frames=500,\n width=200,\n height=200,\n loop=FALSE)\n```\n:::\n\n\n::: {.content-visible when-format=\"html\"}\n::: {#fig-p-o-pca-html fig-align=\"center\" layout-ncol=2}\n\n![Outliers clearly visible.](gifs/plane_n_o_clr.gif){#fig-plane-n-o-clr width=250}\n\n![Outliers not clearly visible in PC1-4.](gifs/plane_n_o_pca.gif){#fig-plane-n-o-pca width=250}\n\nExamining the handling of outliers in the PCA of the planar data with noise variables and two outliers. PCA has lost these two extreme values.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n::: {#fig-p-o-pca-pdf fig-align=\"center\" layout-ncol=2}\n\n![Outliers clearly visible.](images/plane_n_o_clr_181.png){#fig-plane-n-o-clr width=250}\n\n![Outliers not clearly visible in PC1-4.](images/plane_n_o_pca_181.png){#fig-plane-n-o-pca width=250}\n\nExamining the handling of outliers in the PCA of the planar data with noise variables and two outliers. PCA has lost these two extreme values.\n:::\n:::\n\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to make scatterplot matrix\"}\nlibrary(GGally)\nggscatmat(plane_n_o_pca$x) + theme_minimal()\n```\n\n::: {.cell-output-display}\n![From the scatterplot matrix we can see that the outliers are present in PC5, PC6 and PC7. That means by reducing the dimensionality to the first four PCs the model has missed some important characteristics in the data.](4-pca_files/figure-html/fig-plane-o-n-pairs-1.png){#fig-plane-o-n-pairs width=80%}\n:::\n:::\n\n\n### Example: Non-linear associations\n\\index{nonlinearity}\n\n@fig-plane-nonlin-html shows the tour of the full 5D data containing non-linear relationships in comparison with a tour of the first three PCs, as recommended by the scree plot (@fig-plane-nonlin-scree). The PCs capture some clear and very clean non-linear relationship, but it looks like it has missed some of the complexities of the relationships. The scatterplot matrix of all 5 PCs (@fig-plane-nonlin-pairs) shows that PC4 and PC5 contain interesting features: more non-linearity, and curiously an outlier.\n \n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\ndata(plane_nonlin)\nplane_nonlin_pca <- prcomp(plane_nonlin)\nggscree(plane_nonlin_pca, q = 5) + theme_minimal()\n```\n\n::: {.cell-output-display}\n![Scree plot of the non-linear data suggests three PCs.](4-pca_files/figure-html/fig-plane-nonlin-scree-1.png){#fig-plane-nonlin-scree width=80%}\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to generate tour\"}\nanimate_xy(plane_nonlin_pca$x[,1:3])\nrender_gif(plane_nonlin_pca$x[,1:3], \n grand_tour(), \n display_xy(half_range=0.8),\n gif_file=\"gifs/plane_nonlin_pca.gif\",\n frames=500,\n width=200,\n height=200)\n```\n:::\n\n\n::: {.content-visible when-format=\"html\"}\n::: {#fig-plane-nonlin-html fig-align=\"center\" layout-ncol=2}\n\n![Non-linear relationship between several variables seen in a tour on all five variables.](gifs/plane_nonlin.gif){#fig-nonlinear2 width=250}\n\n![The first three principal components reveal a strong non-linear relationship.](gifs/plane_nonlin_pca.gif){#fig-plane-nonlin-pca width=250}\n\nComparison of the full data and first three principal components. Some of the non-linearity is clearly visible in the reduced dimension space, but the full data has more complexities.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n::: {#fig-plane-nonlin-pdf fig-align=\"center\" layout-ncol=2}\n\n![Non-linear relationship between several variables seen in a tour on all five variables.](images/plane_nonlin_61.png){#fig-nonlinear2 width=250}\n\n![The first three principal components reveal a strong non-linear relationship.](images/plane_nonlin_pca_129.png){#fig-plane-nonlin-pca width=250}\n\nComparison of the full data and first three principal components. Some of the non-linearity is clearly visible in the reduced dimension space, but the full data has more complexities.\n:::\n:::\n\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to make scatterplot matrix\"}\nggscatmat(plane_nonlin_pca$x)\n```\n\n::: {.cell-output-display}\n![From the scatterplot matrix we can see that the there is a non-linear relationship visible in PC1 and PC2, with perhaps a small contribution from PC3. However, we can see that when the data is reduced to three PCs, it misses catching all on the non-linear relationships and also interestingly it seems that there is an unusual observation also.](4-pca_files/figure-html/fig-plane-nonlin-pairs-1.png){#fig-plane-nonlin-pairs width=672}\n:::\n:::\n\n\n::: {.content-visible when-format=\"html\"}\n::: info\nOne of the dangers of PCA is that interesting and curious details of the data only emerge in the lowest PCs, that are usually discarded. The tour, and examining the smaller PCs, can help to discover them.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\infobox{One of the dangers of PCA is that interesting and curious details of the data only emerge in the lowest PCs, that are usually discarded. The tour, and examining the smaller PCs, can help to discover them.}\n:::\n\n## Exercises {-}\n\n1. Make a scatterplot matrix of the first four PCs of the `aflw` data. Is the branch pattern visible in any pair?\n2. Construct five new variables to measure these skills offense, defense, playing time, ball movement, errors. Using the tour, examine the relationship between these variables. Map out how a few players could be characterised based on these directions of skills.\n3. Symmetrise any `aflw` variables that have skewed distributions using a log or square root transformation. Then re-do the PCA. What do we learn that is different about associations between the skill variables?\n4. Examine the `bushfires` data using a grand tour on the numeric variables, ignoring the `cause` (class) variable. Note any issues such as outliers, or skewness that might affect PCA. How many principal components would be recommended by the scree plot? Examine this PCA model with the data, and explain how well it does or doesn't fit.\n5. Use the `pca_tour` to examine the first five PCs of the `bushfires` data. How do all of the variables contribute to this reduced space?\n6. Reduce the dimension of the `sketches` data to 12 PCs. How much variation does this explain? Is there any obvious clustering in this lower dimensional space?\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n:::\n\n\n## Project {-}\n\nLinear dimension reduction can optimise for other criteria, and here we will explore one example: the algorithm implemented in the `dobin` package finds a basis in which the first few directions are optimized for the detection of outliers in the data. We will examine how it performs for the `plane_noise_outliers` data (the example where outliers were hidden in the first four principal components.)\n\n1. Start by looking up the documentation of `dobin::dobin`. How many parameters does the method depend on?\n2. We first apply the function to the `plane_noise_outliers` data using default values for all parameters.\n3. Recall that the outliers were added in rows 101 and 102 of the data. Make a scatter plots showing the projection onto the first, second and third component, using color to highlight the outliers. Are they visible as outliers with three components?\n4. Adjust the `frac` parameter of the `dobin` function to `frac = 0.99` and repeat the graphical evaluation from point 3. How does it compare to the previous solution?\n\n\n::: {.cell}\n\n:::\n", + "supporting": [ + "4-pca_files/figure-html" + ], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": { + "include-in-header": [ + "\n\n\n\n\n\n\n\n" + ] + }, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/4-pca/execute-results/tex.json b/_freeze/4-pca/execute-results/tex.json index 6bf6a87..97b3c5e 100644 --- a/_freeze/4-pca/execute-results/tex.json +++ b/_freeze/4-pca/execute-results/tex.json @@ -1,8 +1,8 @@ { - "hash": "11276e1daaa13b0167f7668c1822fbe8", + "hash": "99ea68edea8390ac5005f636def53dc0", "result": { "engine": "knitr", - "markdown": "## Principal component analysis \n\\index{dimension reduction!principal component analysis (PCA)}\n\nReducing dimensionality using principal component analysis (PCA) dates back to @pearson-pca and @hotelling-pca, and @joliffe2016 provides a current overview. The goal is to find a smaller set of variables, $q (< p)$, that contain as much information as the original as possible. The new set of variables, known as principal components (PCs), are linear combinations of the original variables. The PCs can be used to represent the data in a lower-dimensional space.\n\nThe process is essentially an optimisation procedure, although PCA has an analytical solution. It solves the problem of \n\n$$\n\\max_{a_k} ~\\text{Var} (Xa_k),\n$$\nwhere $X$ is the $n \\times p$ data matrix, $a_k (k=1, ..., p)$ is a 1D projection vector, called an eigenvector, and the $\\text{Var} (Xa_k)$ is called an eigenvalue. So PCA is a sequential process, that will find the direction in the high-dimensional space (as given by the first eigenvector) where the data is most varied, and then find the second most varied direction, and so on. The eigenvectors define the combination of the original variables, and the eigenvalues define the amount of variance explained by the reduced number of variables.\n\\index{dimension reduction!eigenvalue}\n\\index{dimension reduction!eigenvector}\n\nPCA is very broadly useful for summarising linear association by using combinations of variables that are highly correlated. However, high correlation can also occur when there are outliers, or clustering. PCA is commonly used to detect these patterns also.\n\n::: {.content-visible when-format=\"html\"}\n::: info\nWith visualisation we want to assess whether it is appropriate to use PCA to summarise any linear association by using combinations of variables that are highly correlated. It can help to detect other patterns that might affect the PCA results such as outliers, clustering or non-linear dependence. \n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\infobox{With visualisation we want to assess whether it is appropriate to use PCA to summarise any linear association by using combinations of variables that are highly correlated. It can help to detect other patterns that might affect the PCA results such as outliers, clustering or non-linear dependence.}\n:::\n\n\nPCA is not very effective when the distribution of the variables is highly skewed, so it can be helpful to transform variables to make them more symmetrically distributed before conducting PCA. It is also possible to summarise different types of structure by generalising the optimisation criteria to any function of projected data, $f(XA)$, which is called *projection pursuit* (PP). PP has a long history (@Kr64a, @FT74, @DF84, @JS87, @Hu85), and there are regularly new developments (e.g. @Lee2009, @perisic2009, @Lee2013, @loperfido, @bickel2018, @zhang2023). \n\n## Determining how many dimensions\n\nWe would start by examining the data using a grand tour. The goal is to check whether there might be potential issues for PCA, such as skewness, outliers or clustering, or even non-linear dependencies.\n\nWe'll start be showing PCA on the simulated data from @sec-dimension-overview. The scree plots show that PCA supports that the data are 2D, 3D and 5D respectively.\n\n\\index{dimension reduction!scree plot}\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\n# Conduct PCA and make the scree plot for \n# the 2-, 3- and 5-D planar data\nlibrary(dplyr)\nlibrary(ggplot2)\nlibrary(mulgar)\ndata(plane)\ndata(box)\nlibrary(geozoo)\ncube5d <- data.frame(cube.solid.random(p=5, n=300)$points)\ncolnames(cube5d) <- paste0(\"x\", 1:5)\ncube5d <- data.frame(apply(cube5d, 2, \n function(x) (x-mean(x))/sd(x)))\np_pca <- prcomp(plane)\nb_pca <- prcomp(box)\nc_pca <- prcomp(cube5d)\np_scree <- ggscree(p_pca, q = 5) + theme_minimal()\n\nb_scree <- ggscree(b_pca, q = 5) + theme_minimal()\nc_scree <- ggscree(c_pca, q = 5) + theme_minimal()\n```\n:::\n\n::: {.cell}\n::: {.cell-output-display}\n![Scree plots for the three simulated data sets shown in Figure 3.2. The 2D in 5D is clearly recognised by PCA to be 2D because the variance drops substantially between 2-3 principal components. The 3D in 5D is possibly 3D because the variance drops from 3-4 principal components. The fully 5D data has no drop in variance, and all values are close to the typical value one would observe if the data was fully 5D.](4-pca_files/figure-pdf/fig-2D-pca-1.pdf){#fig-2D-pca width=100%}\n:::\n:::\n\n\n\n\nThe next step is to look at the coefficients for the selected number of PCs. @tbl-plane-pcs shows the coefficients for the first two PCs of the `plane` data. All five variables contribute, with `x1`, `x2`, `x3` contributing more to `PC1`, and `x4`, `x5` contributing more to `PC2`. @tbl-box-pcs shows the coefficients for the first three PCs. Variables `x1`, `x2`, `x3` contribute strongly to `PC1`, `PC2` has contributions from all variables except `x3` and variables `x4` and `x5` contribute strongly to `PC3`. \n\n\\index{dimension reduction!coefficients}\n\\index{dimension reduction!principal components}\n\n\n\n\n::: {#tbl-plane-pcs .cell tbl-cap='Coefficients for the first two PCs for the plane data.'}\n::: {.cell-output-display}\n\\begin{longtable}{lrr}\n\\toprule\nVariable & PC1 & PC2 \\\\ \n\\midrule\nx1 & $0.58$ & $-0.06$ \\\\ \nx2 & $-0.55$ & $0.21$ \\\\ \nx3 & $0.47$ & $-0.41$ \\\\ \nx4 & $0.25$ & $0.64$ \\\\ \nx5 & $-0.29$ & $-0.62$ \\\\ \n\\bottomrule\n\\end{longtable}\n\n:::\n:::\n\n::: {#tbl-box-pcs .cell tbl-cap='Coefficients for the first three PCs for the box data.'}\n::: {.cell-output-display}\n\\begin{longtable}{lrrr}\n\\toprule\nVariable & PC1 & PC2 & PC3 \\\\ \n\\midrule\nx1 & $-0.51$ & $0.46$ & $0.11$ \\\\ \nx2 & $0.51$ & $0.46$ & $0.00$ \\\\ \nx3 & $-0.65$ & $-0.09$ & $0.23$ \\\\ \nx4 & $-0.22$ & $0.36$ & $-0.87$ \\\\ \nx5 & $0.02$ & $0.66$ & $0.43$ \\\\ \n\\bottomrule\n\\end{longtable}\n\n:::\n:::\n\n\n\nIn each of these simulated data sets, all five variables contributed to the dimension reduction. If we added two purely noise variables to the plane data, as done in @sec-dimension-overview, the scree plot would indicate that the data is now 4D, and we would get a different interpretation of the coefficients from the PCA. We see that `PC1` and `PC2` are approximately the same as before, with main variables being (`x1`, `x2`, `x3`) and (`x4`, `x5`) respectively. `PC3` and `PC4` are both `x6` and `x7`. \n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Additional noise variables expands the data to 4D.](4-pca_files/figure-pdf/fig-plane-noise-scree-1.pdf){#fig-plane-noise-scree width=80%}\n:::\n:::\n\n::: {#tbl-plane-noise-pcs .cell tbl-cap='Coefficients for the first four PCs for the box data.'}\n::: {.cell-output-display}\n\\begin{longtable}{lrrrr}\n\\toprule\nVariable & PC1 & PC2 & PC3 & PC4 \\\\ \n\\midrule\nx1 & $0.58$ & $0.04$ & $0.01$ & $0.00$ \\\\ \nx2 & $-0.55$ & $-0.18$ & $-0.03$ & $0.07$ \\\\ \nx3 & $0.47$ & $0.37$ & $0.05$ & $-0.20$ \\\\ \nx4 & $0.24$ & $-0.62$ & $-0.06$ & $0.17$ \\\\ \nx5 & $-0.28$ & $0.60$ & $0.07$ & $-0.14$ \\\\ \nx6 & $0.05$ & $0.29$ & $-0.58$ & $0.76$ \\\\ \nx7 & $-0.02$ & $-0.08$ & $-0.81$ & $-0.58$ \\\\ \n\\bottomrule\n\\end{longtable}\n\n:::\n:::\n\n\n\n### Example: pisa\n\\index{data!pisa}\n\nThe `pisa` data contains simulated data from math, reading and science scores, totalling 30 variables. PCA is used here to examine the association. We might expect that it is 3D, but what we see suggests it is primarily 1D. This means that a student that scores well in math, will also score well in reading and science. \n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\ndata(pisa)\npisa_std <- pisa %>%\n filter(CNT == \"Australia\") %>%\n select(-CNT) %>%\n mutate_all(mulgar:::scale2)\npisa_pca <- prcomp(pisa_std)\npisa_scree <- ggscree(pisa_pca, q = 15) + theme_minimal()\n```\n:::\n\n\n\nThe scree plot in @fig-pisa-pca-pdf shows a big drop from one to two PCs in the amount of variance explained. A grand tour on the 30 variables can be run using `animate_xy()`: \n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nanimate_xy(pisa_std, half_range=1)\n```\n:::\n\n\n\nor rendered as an animated gif using `render_gif()`:\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nrender_gif(pisa_std, \n grand_tour(), \n display_xy(half_range=0.9),\n gif_file=\"gifs/pisa_gt.gif\",\n frames=500,\n width=400,\n height=400,\n loop=FALSE)\n```\n:::\n\n\n\nand we can see that the data is elliptical in most projections, sometimes shrinking to be a small circle. This pattern strongly indicates that there is one primary direction of variation in the data, with only small variation in any direction away from it. Shrinking to the small circle is analogous to to how *a pencil or cigar or water bottle in 3D looks from some angles*.\n\n::: {.content-visible when-format=\"html\"}\n::: {#fig-pisa-pca-html fig-align=\"center\" layout-ncol=2}\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Scree plot](4-pca_files/figure-pdf/fig-pisa-scree-html-1.pdf){#fig-pisa-scree-html width=80%}\n:::\n:::\n\n\n\n![Grand tour](gifs/pisa_gt.gif){#fig-pisa-gt fig-alt=\"Tour showing lots of linear projections of the pisa data. You can see strong linear dependence.\" fig.align=\"center\"}\n\nScree plot and tour of the `pisa` data, with 30 variables being the plausible scores for Australian students. In combination, these suggest that the data is effectively 1D.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n::: {#fig-pisa-pca-pdf fig-align=\"center\" layout-ncol=2}\n\n![Scree plot](images/fig-pisa-scree-1.png){#fig-pisa-scree-pdf fig-alt=\"FIX ME\"}\n\n![Grand tour frame](images/pisa_gt_249.png){#fig-pisa-gt fig-alt=\"Selected linear projection of the pisa data from a grand tour. You can see strong linear dependence.\" fig.align=\"center\"}\n\nScree plot and a frame from a tour of the `pisa` data, with 30 variables being the plausible scores for Australian students. In combination, these suggest that the data is effectively 1D.\n:::\n:::\n\nThe coefficients of the first PC (first eigenvector) are roughly equal in magnitude (as shown below), which tells us that all variables roughly contribute. Interestingly, they are all negative, which is not actually meaningful. With different software these could easily have been all positive. The sign of the coefficients can be reversed, as long as all are reversed, which is the same as an arrow pointing one way, changing and pointing the other way. \n\n\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to print PC coefficients\"}\nround(pisa_pca$rotation[,1], 2)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n PV1MATH PV2MATH PV3MATH PV4MATH PV5MATH PV6MATH \n -0.18 -0.18 -0.18 -0.18 -0.18 -0.18 \n PV7MATH PV8MATH PV9MATH PV10MATH PV1READ PV2READ \n -0.18 -0.18 -0.18 -0.18 -0.19 -0.18 \n PV3READ PV4READ PV5READ PV6READ PV7READ PV8READ \n -0.19 -0.19 -0.19 -0.19 -0.19 -0.19 \n PV9READ PV10READ PV1SCIE PV2SCIE PV3SCIE PV4SCIE \n -0.19 -0.19 -0.18 -0.18 -0.19 -0.18 \n PV5SCIE PV6SCIE PV7SCIE PV8SCIE PV9SCIE PV10SCIE \n -0.19 -0.18 -0.19 -0.18 -0.19 -0.18 \n```\n\n\n:::\n:::\n\n\n\n::: {.content-visible when-format=\"html\"}\n::: insight\nThe tour verifies that the `pisa` data is primarily 1D, indicating that a student who scores well in math, probably scores well in reading and science, too. More interestingly, the regular shape of the data strongly indicates that it is \"synthetic\", simulated rather than observed.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\insightbox{The tour verifies that the `pisa` data is primarily 1D, indicating that a student who scores well in math, probably scores well in reading and science, too. More interestingly, the regular shape of the data strongly indicates that it is \"synthetic\", simulated rather than observed.}\n:::\n\n### Example: aflw\n\\index{data!aflw}\n\nThis data has player statistics for all the matches in the 2021 season. We would be interested to know which variables contain similar information, and thus might be combined into single variables. We would expect that many statistics to group into a few small sets, such as offensive and defensive skills. We might also expect that some of the statistics are skewed, most players have low values and just a handful of players are stellar. It is also possible that there are some extreme values. These are interesting features, but they will distract from the main purpose of grouping the statistics. Thus the tour is used to check for potential problems with the data prior to conducting PCA.\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nlibrary(tourr)\ndata(aflw)\naflw_std <- aflw %>%\n mutate_if(is.numeric, function(x) (x-\n mean(x, na.rm=TRUE))/\n sd(x, na.rm=TRUE))\n```\n:::\n\n\n\nTo look at all of the 29 player statistics in a grand tour in @fig-aflw-gt-pdf.\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to generate tour\"}\nanimate_xy(aflw_std[,7:35], half_range=0.9)\nrender_gif(aflw_std[,7:35], \n grand_tour(), \n display_xy(half_range=0.9),\n gif_file=\"gifs/aflw_gt.gif\",\n frames=500,\n loop=FALSE)\n```\n:::\n\n\n\n::: {.content-visible when-format=\"html\"}\n::: {#fig-aflw-gt-html}\n\n![](gifs/aflw_gt.gif){fig-alt=\"Tour showing lots of linear projections of the aflw data. You can see linear dependence, and some outliers.\" fig.align=\"center\"}\n\nGrand tour of the AFLW player statistics. Most player statistics concentrate near the centre, indicating most players are \"average\"! There are a few outliers appearing in different combinations of the skills, which one would expect to be the star players for particular skill sets. \n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n::: {#fig-aflw-gt-pdf fig.align=\"center\" fig-alt=\"Example linear projection of the aflw data from a grand tour. You can see linear dependence, and some outliers.\" layout-ncol=2}\n\n![](images/aflw_gt_70.png){width=228}\n\n![](images/aflw_gt_329.png){width=228}\n\nTwo frames from a grand tour of the AFLW player statistics. Most player statistics concentrate near the centre, indicating most players are \"average\"! There are a few outliers appearing in different combinations of the skills, which one would expect to be the star players for particular skill sets. \n:::\n:::\n\nNo major surprises! There is a small amount of skewness, and there are no major outliers. Skewness indicates that most players have reasonably similar skills (bunching of points), except for some key players (the moderate outliers). The skewness could be reduced by applying a log or square root transformation to some variables prior to running the PCA. However, we elect not to do this because the moderate outliers are of interest. These correspond to talented players that we'd like to explore further with the analysis.\n\nBelow we have the conventional summary of the PCA, a scree plot showing the reduction in variance to be explained when each additional PC is considered. It is also conventional to look at a table summarising the proportions of variance explained by PCs, but with almost 30 variables it is easier to make some decision on the number of PCs needed based on the scree plot.\n\n\n\n::: {.cell alt-text='Scree plot showing variance vertically against PC number horizontally. Variance drops from close to 10 for PC 1 to about 1.2 for PC 4 then slowly decays through to PC 29'}\n\n```{.r .cell-code code-summary=\"Code to make screeplot\"}\naflw_pca <- prcomp(aflw_std[,7:35], \n scale = FALSE, \n retx=TRUE)\n\nggscree(aflw_pca, q = 29) + theme_minimal()\n```\n\n::: {.cell-output-display}\n![Scree plot showing decay in variance of PCs.](4-pca_files/figure-pdf/fig-aflw-pca-1.pdf){#fig-aflw-pca fig-pos='H' width=80%}\n:::\n:::\n\n\n\n\\index{dimension reduction!scree plot}\n\nFrom the scree plot in @fig-aflw-pca, we see a sharp drop from one to two, two to three and then smaller drops. After four PCs the variance drops again at six PCs and then gradually decays. We will choose four PCs to examine more closely. This explains 67.2% of the variance.\n\n\n\n::: {#tbl-aflw-pcs .cell tbl-cap='Coefficients for the first four PCs.'}\n::: {.cell-output-display}\n\\begin{longtable}{lrrrr}\n\\toprule\nVariable & PC1 & PC2 & PC3 & PC4 \\\\ \n\\midrule\ndisposals & $0.31$ & $-0.05$ & $-0.03$ & $0.07$ \\\\ \npossessions & $0.31$ & $-0.03$ & $-0.07$ & $0.09$ \\\\ \nkicks & $0.29$ & $-0.04$ & $0.09$ & $-0.12$ \\\\ \nmetres & $0.28$ & $-0.03$ & $0.10$ & $-0.15$ \\\\ \ncontested & $0.28$ & $0.01$ & $-0.12$ & $0.23$ \\\\ \nuncontested & $0.28$ & $-0.06$ & $-0.01$ & $-0.05$ \\\\ \nturnovers & $0.27$ & $-0.01$ & $-0.01$ & $-0.29$ \\\\ \nclearances & $0.23$ & $0.00$ & $-0.29$ & $0.19$ \\\\ \nclangers & $0.23$ & $-0.02$ & $-0.06$ & $-0.33$ \\\\ \nhandballs & $0.23$ & $-0.04$ & $-0.19$ & $0.31$ \\\\ \nfrees\\_for & $0.21$ & $0.02$ & $-0.13$ & $0.18$ \\\\ \nmarks & $0.21$ & $0.03$ & $0.32$ & $0.02$ \\\\ \ntackles & $0.20$ & $0.01$ & $-0.28$ & $0.09$ \\\\ \ntime\\_pct & $0.16$ & $-0.04$ & $0.35$ & $-0.02$ \\\\ \nintercepts & $0.13$ & $-0.28$ & $0.24$ & $0.03$ \\\\ \nrebounds\\_in50 & $0.13$ & $-0.28$ & $0.24$ & $-0.06$ \\\\ \nfrees\\_against & $0.13$ & $0.03$ & $-0.16$ & $-0.23$ \\\\ \nassists & $0.09$ & $0.23$ & $0.00$ & $0.05$ \\\\ \nbounces & $0.09$ & $0.03$ & $0.02$ & $-0.28$ \\\\ \nbehinds & $0.09$ & $0.32$ & $0.08$ & $-0.02$ \\\\ \nshots & $0.08$ & $0.38$ & $0.12$ & $-0.03$ \\\\ \ntackles\\_in50 & $0.07$ & $0.27$ & $-0.18$ & $0.03$ \\\\ \nmarks\\_in50 & $0.06$ & $0.34$ & $0.18$ & $0.04$ \\\\ \ncontested\\_marks & $0.05$ & $0.16$ & $0.34$ & $0.15$ \\\\ \ngoals & $0.04$ & $0.37$ & $0.16$ & $0.03$ \\\\ \naccuracy & $0.04$ & $0.34$ & $0.10$ & $0.06$ \\\\ \none\\_pct & $0.03$ & $-0.21$ & $0.33$ & $0.08$ \\\\ \ndisposal & $0.02$ & $-0.13$ & $0.20$ & $0.50$ \\\\ \nhitouts & $-0.04$ & $0.00$ & $-0.03$ & $0.32$ \\\\ \n\\bottomrule\n\\end{longtable}\n\n:::\n:::\n\n\n\nWhen there are as many variables as this, it can be hard to digest the combinations of variables most contributing to each PC. Rearranging the table by sorting on a selected PC can help. @tbl-aflw-pcs has been sorted according to the PC 1 coefficients.\n\nPC 1 is primarily composed of `disposals`, `possessions`, `kicks`, `metres`, `uncontested`, `contested`, .... Actually almost all variables positively contribute, albeit in different amounts! It is quite common in PCA for the first PC to be a combination of all variables, although it might commonly be a closer to equal contribution, and it tells us that there is one main direction of variation in the data. For PC 1 in the `aflw` data, PCA is telling us that the primary variation is through a combination of skills, and this maps to basic football playing skills, where some skills (e.g. disposals, possessions, kicks, ...) are more important.\n\nThus the second PC might be the more interesting. PC 2 is primarily a combination of `shots`, `goals`, `marks_in50`, `accuracy`, and `behinds` contrasted against `rebounds_in50` and `intercepts`. The negative coefficients are primary offensive skills and the positive coefficients are defensive skills. This PC is reasonable measure of the offensive vs defensive skills of a player.\n\n\\index{dimension reduction!interpretation}\n\nWe would continue to interpret each PC by examining large coefficients to help decide how many PCs are a suitable summary of the information in the data. Briefly, PC 3 is a measure of worth of the player because `time_pct` has a large coefficient, so players that are on the field longer will contribute strongly to this new variable. It also has large (and opposite) contributions from `clearances`, `tackles`, `contested_marks`. PC 4 appears to be related to aggressive play with `clangers`, `turnovers`, `bounces` and `frees_against` featuring. So all four PCs have useful information. (Note, if we had continued to examine large coefficients on PC 5 we would find that all variables already have had reasonably large coefficients on PC 1-4, which supports restricting attention to the first four.)\n\n\n\nIdeally, when we tour the four PCs, we'd like to be able to stop and identify players. This involves creating a pre-computed animation, with additional mouse-over, made possible by `plotly`. This is only feasible with a small number of observations, like the `aflw` data, because all of the animation frames are constructed in a single object. This object gets large very quickly!\n\n::: {.content-visible when-format=\"html\"}\nThe result is shown in @fig-aflw-pcatour. We can see that the shape of the four PCs is similar to that of all the variables, bunching of points in the centre with a lot of moderate outliers.\n:::\n\n\n::: {.content-visible when-format=\"pdf\"}\nThe code to make this animation, and the interactive plot is in the online version of the book.\n:::\n\n\n\n\n::: {.cell}\n\n:::\n\n\n\n::: {.content-visible when-format=\"html\"}\n::: {#fig-aflw-pcatour}\n\n\n\nAnimation of four PCs of the aflw data with interactive labelling.\n:::\n:::\n\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n:::\n\n\n\n::: {.content-visible when-format=\"html\"}\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplotly(pg18, width=500, height=500)\n```\n\n::: {.cell-output-display}\n![Frame 18 re-plotted so that players can be identified on mouse-over.](4-pca_files/figure-pdf/fig-aflw-pcaplots-html-1.pdf){#fig-aflw-pcaplots-html fig-pos='H' width=80%}\n:::\n:::\n\n\n\nFor any particular frame, like 18 re-plotted in @fig-aflw-pcaplots-html, we can investigate further. Here there is a branching pattern, where the branch points in the direction of PC 1. Mouse-over the players at the tip of this branch and we find players like Alyce Parker, Brittany Bonnici, Dana Hooker, Kiara Bowers. If you look up the bios of these players you'll find they all have generally good player descriptions like \"elite disposals\", \"powerful left foot\", \"hard-running midfielder\", \"best and fairest\".\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Frame 18 re-plotted so that players can be identified, ideally using on mouse-over. Here some points are labelled.](4-pca_files/figure-pdf/fig-aflw-pcaplots-pdf-1.pdf){#fig-aflw-pcaplots-pdf width=80%}\n:::\n:::\n\n\n\nFor any particular frame, like 18 re-plotted in @fig-aflw-pcaplots-pdf, we can investigate further. Here there is a branching pattern, where the branch points in the direction of PC 1. Mouse-over the players at the tip of this branch and we find players like Alyce Parker, Brittany Bonnici, Dana Hooker, Kiara Bowers. If you look up the bios of these players you'll find they all have generally good player descriptions like \"elite disposals\", \"powerful left foot\", \"hard-running midfielder\", \"best and fairest\".\n:::\n\nIn the direction of PC 2, you'll find players like Lauren Ahrens, Stacey Livingstone who are star defenders. Players in this end of PC 1, have high scores on `intercepts` and `rebounds_in50`.\n\nAnother interesting frame for inspecting PC 2 is 59. PC 2 at one end has players with high goal scoring skills, and the other good defending skills. So mousing over the other end of PC 2 finds players like Gemma Houghton and Katie Brennan who are known for their goal scoring. The branch pattern is an interesting one, because it tells us there is some combination of skills that are lacking among all players, primarily this appears to be there some distinction between defenders skills and general playing skills. It's not as simple as this because the branching is only visible when PC 1 and PC 2 are examined with PC 3.\n\nPCA is useful for getting a sense of the variation in a high-dimensional data set. Interpreting the principal components is often useful, but it can be discombobulating. For the `aflw` data it would be good to think about it as a guide to the main directions of variation and to follow with a more direct engineering of variables into interesting player characteristics. For example, calculate offensive skill as an equal combination of goals, accuracy, shots, behinds. A set of new variables specifically computed to measure particular skills would make explaining an analysis easier.\n\n::: {.content-visible when-format=\"html\"}\n::: insight\nThe tour verifies that PCA on the `aflw` data is complicated and doesn't capture all of the variation. However, it does provide useful insights. It detected outstanding players, and indicated the different skills sets of top goal scorers and top defensive players.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\insightbox{The tour verifies that PCA on the `aflw` data is complicated and doesn't capture all of the variation. However, it does provide useful insights. It detected outstanding players, and indicated the different skills sets of top goal scorers and top defensive players.}\n:::\n\n## Examining the PCA model in the data space\n\\index{model-in-the-data-space}\n\nWhen you choose a smaller number of PCs $(k)$ than the number of original variables, this is essentially producing a model for the data. The model is the lower dimensional $k$-D space. It is analogous to a linear regression model, except that the residuals from the model are $(p-k)$-D. \n\nIt is common to show the model, that is the data projected into the $k$-D model space. When $k=2$ this is called a \"biplot\". For the `plane` and `plane_noise` data the biplots are shown in @fig-plane-biplot. This is useful for checking which variables contribute most to the new principal component variables, and also to check for any problems that might have affected the fit, such as outliers, clusters or non-linearity. Interestingly, biplots are typically only made in 2D, even if the data should be summarised by more than two PCs. Occasionally you will see the biplot made for PC $j$ vs PC $k$ also. With the `pca_tour()` function in the `tourr` package you can view a $k$-D biplot. This will display the $k$ PCs with the axes displaying the original variables, and thus see their contribution to the PCs. \n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Biplots of the plane (a) and plane + noise (b) data. All five variables contribute strongly to the two principal components in (a): PC1 is primarily `x1`, `x2` and `x3` and PC2 is primarily `x4` and `x5`. In (b) the same four variables contribute in almost the same way, with variables `x6` and `x7` contributing very little. The data was constructed this way, that these two dimensions were purely noise.](4-pca_files/figure-pdf/fig-plane-biplot-1.pdf){#fig-plane-biplot width=100%}\n:::\n:::\n\n\n\n\nIt can be useful to examine this model using the tour. The model is simply a plane in high dimensions. This would be considered to be the model in the data space. The reason to do this is to check how well the model fits the data. The plane corresponding to the model should be oriented along the main direction of the points, and the spread of points around the plane should be small. We should also be able to see if there has been any strong non-linear relationship missed by the model, or outliers and clusters.\n\nThe function `pca_model()` from the `mulgar` package can be used to represent the model as a $k$-D wire-frame plane. @fig-plane-box-model-pdf shows the models for the `plane` and `box` data, 2D and 3D respectively.\n\n::: {.content-visible when-format=\"html\"}\n::: info\nWe look at the model in the data space to check how well the model fits the data. If it fits well, the points will cluster tightly around the model representation, with little spread in other directions.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\infobox{We look at the model in the data space to check how well the model fits the data. If it fits well, the points will cluster tightly around the model representation, with little spread in other directions.}\n:::\n\n\n\n::: {.cell}\n\n:::\n\n\n\n\n::: {.content-visible when-format=\"html\"}\n::: {#fig-plane-box-model-html fig-align=\"center\" layout-ncol=2}\n\n![Model for the 2D in 5D data.](gifs/plane_model.gif){#fig-plane-model fig-alt=\"FIX ME.\" fig.align=\"center\"}\n\n![Model for the 3D in 5D data.](gifs/box_model.gif){#fig-box-model fig-alt=\"FIX ME.\" fig.align=\"center\"}\n\nPCA model overlaid on the data for the 2D in 5D, and 3D in 5D simulated data. \n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n::: {#fig-plane-box-model-pdf fig-align=\"center\" layout-ncol=2}\n\n![Model for the 2D in 5D data.](images/plane_model_55.png){#fig-plane-model fig-alt=\"FIX ME.\" fig.align=\"center\"}\n\n![Model for the 3D in 5D data.](images/box_model_13.png){#fig-box-model fig-alt=\"FIX ME.\" fig.align=\"center\"}\n\nPCA model overlaid on the data for the 2D in 5D, and 3D in 5D simulated data. \n:::\n:::\n\n### Example: pisa\n\\index{data!pisa}\n\nThe model for the `pisa` data is a 1D vector, shown in @fig-pisa-model-pdf. \n\n\n\n::: {.cell}\n\n:::\n\n\n\n::: {.content-visible when-format=\"html\"}\n::: {#fig-pisa-model-html}\n\n![](gifs/pisa_model.gif){fig-alt=\"Something here\" fig.align=\"center\"}\n\nPCA model of the `pisa` data. The 1D model captures the primary variation in the data and there is a small amount of spread in all directions away from the model. \n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n::: {#fig-pisa-model-pdf fig-alt=\"Something here\" fig.align=\"center\"}\n\n![](images/pisa_model_17.png){width=300}\n\nPCA model of the `pisa` data. The 1D model captures the primary variation in the data and there is a small amount of spread in all directions away from the model. \n:::\n:::\n\n::: {.content-visible when-format=\"html\"}\n::: insight\nThe `pisa` data fits fairly closely to the 1D PCA model. The variance of points away from the model is symmetric and relatively small. These suggest the 1D model is a reasonably summary of the test scores.\n:::\n:::\n\n::: {.content-visible when-format=\"html\"}\n\\insightbox{The `pisa` data fits fairly closely to the 1D PCA model. The variance of points away from the model is symmetric and relatively small. These suggest the 1D model is a reasonably summary of the test scores.}\n:::\n\n### Example: aflw\n\\index{data!aflw}\n\nIt is less useful to examine the PCA model for the `aflw` data, because the main patterns that were of interest were the exceptional players. However, we will do it anyway! @fig-aflw-model-pdf shows the 4D PCA model overlain on the data. Even though the distribution of points is not as symmetric and balanced as the other examples, we can see that the cube structure mirrors the variation. We can see that the relationships between variables are not strictly linear, because the spread extends unevenly away from the box. \n\n\n\n::: {.cell}\n\n:::\n\n\n\n::: {.content-visible when-format=\"html\"}\n::: {#fig-aflw-model-html}\n\n![](gifs/aflw_model.gif){ fig-alt=\"Something here\" fig.align=\"center\"}\n\nPCA model of the `aflw` data. The linear model is not ideal for this data, which has other patterns like outliers, and some branching. However, the model roughly captures the linear associations, and leaves unexplained and unequal variation in different directions.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n::: {#fig-aflw-model-pdf fig-alt=\"Something here\" fig.align=\"center\"}\n\n![](images/aflw_model_70.png){width=300}\n\nPCA model of the `aflw` data. The linear model is not ideal for this data, which has other patterns like outliers, and some branching. However, the model roughly captures the linear associations, and leaves unexplained variation in different directions.\n:::\n:::\n\n::: {.content-visible when-format=\"html\"}\n::: insight\nFrom the tour we see that the 4D model leaves substantial variation unexplained. It is also not symmetric, and there is some larger variation away from the model in some combinations of variables than others. \n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\insightbox{From the tour we see that the 4D model leaves substantial variation unexplained. It is also not symmetric, and there is some larger variation away from the model in some combinations of variables than others.}\n:::\n\n## When relationships are not linear \n\n### Example: outliers\n\\index{outliers}\n\n@fig-plane-n-o-scree shows the scree plot for the planar data with noise and outliers. It is very similar to the scree plot on the data without the outliers (@fig-plane-noise-scree). However, what we see from @fig-p-o-pca-pdf is that PCA loses the outliers. The animation in (a) shows the full data, and the outliers marked by colour and labels 1, 2, are clearly unusual in some projections. When we examine the tour of the first four PCs (as suggested by the scree plot) the outliers are not unusual. They are almost contained in the point cloud. The reason is clear when all the PCs are plotted, and the outliers can be seen to be clearly detected only in PC5, PC6 and PC7. \n\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nplane_n_o_pca <- prcomp(plane_noise_outliers)\nggscree(plane_n_o_pca, q = 7) + theme_minimal()\n```\n\n::: {.cell-output-display}\n![Scree plot of the planar data with noise and an outlier. It is almost the same as the data without the outliers.](4-pca_files/figure-pdf/fig-plane-n-o-scree-1.pdf){#fig-plane-n-o-scree fig-pos='H' width=80%}\n:::\n:::\n\n::: {.cell}\n\n:::\n\n\n\n::: {.content-visible when-format=\"html\"}\n::: {#fig-p-o-pca-html fig-align=\"center\" layout-ncol=2}\n\n![Outliers clearly visible](gifs/plane_n_o_clr.gif){#fig-plane-n-o-clr width=250}\n\n![Outliers not clearly visible in PC1-4](gifs/plane_n_o_pca.gif){#fig-plane-n-o-pca width=250}\n\nExamining the handling of outliers in the PCA of the planar data with noise variables and two outliers. PCA has lost these two extreme values.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n::: {#fig-p-o-pca-pdf fig-align=\"center\" layout-ncol=2}\n\n![Outliers clearly visible](images/plane_n_o_clr_181.png){#fig-plane-n-o-clr width=230}\n\n![Outliers not clearly visible in PC1-4](images/plane_n_o_pca_181.png){#fig-plane-n-o-pca width=230}\n\nExamining the handling of outliers in the PCA of the planar data with noise variables and two outliers. PCA has lost these two extreme values.\n:::\n:::\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![From the scatterplot matrix we can see that the outliers are present in PC5, PC6 and PC7. That means by reducing the dimensionality to the first four PCs the model has missed some important characteristics in the data.](4-pca_files/figure-pdf/fig-plane-o-n-pairs-1.pdf){#fig-plane-o-n-pairs width=80%}\n:::\n:::\n\n\n\n### Example: Non-linear associations\n\\index{nonlinearity}\n\n@fig-plane-nonlin-pdf shows the tour of the full 5D data containing non-linear relationships in comparison with a tour of the first three PCs, as recommended by the scree plot (@fig-plane-nonlin-scree). The PCs capture some clear and very clean non-linear relationship, but it looks like it has missed some of the complexities of the relationships. The scatterplot matrix of all 5 PCs (@fig-plane-nonlin-pairs) shows that PC4 and PC5 contain interesting features: more non-linearity, and curiously an outlier.\n \n\n\n::: {.cell}\n::: {.cell-output-display}\n![Scree plot of the non-linear data suggests three PCs.](4-pca_files/figure-pdf/fig-plane-nonlin-scree-1.pdf){#fig-plane-nonlin-scree width=80%}\n:::\n:::\n\n::: {.cell}\n\n:::\n\n\n\n::: {.content-visible when-format=\"html\"}\n::: {#fig-plane-nonlin-html fig-align=\"center\" layout-ncol=2}\n\n![All five variables](gifs/plane_nonlin.gif){#fig-nonlinear2 width=230}\n\n![First three PCs](gifs/plane_nonlin_pca.gif){#fig-plane-nonlin-pca width=230}\n\nComparison of the full data and first three principal components. Non-linear relationships between several variables can be seen in a tour on all five variables. The first three principal components reveal a strong non-linear relationship. Some of the non-linearity is clearly visible in the reduced dimension space, but the full data has more complexities. \n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n::: {#fig-plane-nonlin-pdf fig-align=\"center\" layout-ncol=2}\n\n![All five variables](images/plane_nonlin_61.png){#fig-nonlinear2 width=250}\n\n![First three PCs](images/plane_nonlin_pca_129.png){#fig-plane-nonlin-pca width=250}\n\nComparison of the full data and first three principal components. Non-linear relationships between several variables can be seen in a tour on all five variables. The first three principal components reveal a strong non-linear relationship. Some of the non-linearity is clearly visible in the reduced dimension space, but the full data has more complexities. \n:::\n:::\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![From the scatterplot matrix we can see that the there is a non-linear relationship visible in PC1 and PC2, with perhaps a small contribution from PC3. However, we can see that when the data is reduced to three PCs, it misses catching all on the non-linear relationships and also interestingly it seems that there is an unusual observation also.](4-pca_files/figure-pdf/fig-plane-nonlin-pairs-1.pdf){#fig-plane-nonlin-pairs width=80%}\n:::\n:::\n\n\n\n::: {.content-visible when-format=\"html\"}\n::: info\nOne of the dangers of PCA is that interesting and curious details of the data only emerge in the lowest PCs, that are usually discarded. The tour, and examining the smaller PCs, can help to discover them.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\infobox{One of the dangers of PCA is that interesting and curious details of the data only emerge in the lowest PCs, that are usually discarded. The tour, and examining the smaller PCs, can help to discover them.}\n:::\n\n## Exercises {-}\n\n1. Make a scatterplot matrix of the first four PCs of the `aflw` data. Is the branch pattern visible in any pair?\n2. Construct five new variables to measure these skills offense, defense, playing time, ball movement, errors. Using the tour, examine the relationship between these variables. Map out how a few players could be characterised based on these directions of skills.\n3. Symmetrise any `aflw` variables that have skewed distributions using a log or square root transformation. Then re-do the PCA. What do we learn that is different about associations between the skill variables?\n4. Examine the `bushfires` data using a grand tour on the numeric variables, ignoring the `cause` (class) variable. Note any issues such as outliers, or skewness that might affect PCA. How many principal components would be recommended by the scree plot? Examine this PCA model with the data, and explain how well it does or doesn't fit.\n5. Use the `pca_tour` to examine the first five PCs of the `bushfires` data. How do all of the variables contribute to this reduced space?\n6. Reduce the dimension of the `sketches` data to 12 PCs. How much variation does this explain? Is there any obvious clustering in this lower dimensional space?\n\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n:::\n\n\n\n## Project {-}\n\nLinear dimension reduction can optimise for other criteria, and here we will explore one example: the algorithm implemented in the `dobin` package finds a basis in which the first few directions are optimized for the detection of outliers in the data. We will examine how it performs for the `plane_noise_outliers` data (the example where outliers were hidden in the first four principal components.)\n\n1. Start by looking up the documentation of `dobin::dobin`. How many parameters does the method depend on?\n2. We first apply the function to the `plane_noise_outliers` data using default values for all parameters.\n3. Recall that the outliers were added in rows 101 and 102 of the data. Make a scatter plots showing the projection onto the first, second and third component, using color to highlight the outliers. Are they visible as outliers with three components?\n4. Adjust the `frac` parameter of the `dobin` function to `frac = 0.99` and repeat the graphical evaluation from point 3. How does it compare to the previous solution?\n\n\n\n::: {.cell}\n\n:::\n", + "markdown": "## Principal component analysis \n\\index{dimension reduction!principal component analysis (PCA)}\n\nReducing dimensionality using principal component analysis (PCA) dates back to @pearson-pca and @hotelling-pca, and @joliffe2016 provides a current overview. The goal is to find a smaller set of variables, $q (< p)$, that contain as much information as the original as possible. The new set of variables, known as principal components (PCs), are linear combinations of the original variables. The PCs can be used to represent the data in a lower-dimensional space.\n\nThe process is essentially an optimisation procedure, although PCA has an analytical solution. It solves the problem of \n\n$$\n\\max_{a_k} ~\\text{Var} (Xa_k),\n$$\nwhere $X$ is the $n \\times p$ data matrix, $a_k (k=1, ..., p)$ is a 1D projection vector, called an eigenvector, and the $\\text{Var} (Xa_k)$ is called an eigenvalue. So PCA is a sequential process, that will find the direction in the high-dimensional space (as given by the first eigenvector) where the data is most varied, and then find the second most varied direction, and so on. The eigenvectors define the combination of the original variables, and the eigenvalues define the amount of variance explained by the reduced number of variables.\n\\index{dimension reduction!eigenvalue}\n\\index{dimension reduction!eigenvector}\n\nPCA is very broadly useful for summarising linear association by using combinations of variables that are highly correlated. However, high correlation can also occur when there are outliers, or clustering. PCA is commonly used to detect these patterns also.\n\n::: {.content-visible when-format=\"html\"}\n::: info\nWith visualisation we want to assess whether it is appropriate to use PCA to summarise any linear association by using combinations of variables that are highly correlated. It can help to detect other patterns that might affect the PCA results such as outliers, clustering or non-linear dependence. \n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\infobox{With visualisation we want to assess whether it is appropriate to use PCA to summarise any linear association by using combinations of variables that are highly correlated. It can help to detect other patterns that might affect the PCA results such as outliers, clustering or non-linear dependence.}\n:::\n\n\nPCA is not very effective when the distribution of the variables is highly skewed, so it can be helpful to transform variables to make them more symmetrically distributed before conducting PCA. It is also possible to summarise different types of structure by generalising the optimisation criteria to any function of projected data, $f(XA)$, which is called *projection pursuit* (PP). PP has a long history (@Kr64a, @FT74, @DF84, @JS87, @Hu85), and there are regularly new developments (e.g. @Lee2009, @perisic2009, @Lee2013, @loperfido, @bickel2018, @zhang2023). \n\n## Determining how many dimensions\n\nWe would start by examining the data using a grand tour. The goal is to check whether there might be potential issues for PCA, such as skewness, outliers or clustering, or even non-linear dependencies.\n\nWe'll start by showing PCA on the simulated data from @sec-dimension-overview. The scree plots are produced using the `mulgar::ggscree()` function, and include a grey guideline to help decide how many PCs are sufficient. This guideline is generated by taking the median value from of the eigenvalues generated by doing PCA on 100 samples from a standard multivariate normal distribution. Any values much lower than this line would indicate that those PCs are not contributing to the explanation of variation. For these three simulated examples, the scree plots illustrate that PCA supports that the data are 2D, 3D and 5D respectively.\n\n\\index{dimension reduction!scree plot}\n\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n::: {.cell-output-display}\n![Scree plots for the three simulated data sets shown in Figure 3.2. The 2D in 5D is clearly recognised by PCA to be 2D because the variance drops substantially between 2-3 principal components. The 3D in 5D is possibly 3D because the variance drops from 3-4 principal components. The fully 5D data has no drop in variance, and all values are close to the typical value one would observe if the data was fully 5D.](4-pca_files/figure-pdf/fig-2D-pca-1.pdf){#fig-2D-pca width=100%}\n:::\n:::\n\n\n\n\nThe next step is to look at the coefficients for the selected number of PCs. @tbl-plane-pcs shows the coefficients for the first two PCs of the `plane` data. All five variables contribute, with `x1`, `x2`, `x3` contributing more to `PC1`, and `x4`, `x5` contributing more to `PC2`. @tbl-box-pcs shows the coefficients for the first three PCs of the `box` data. Variables `x1`, `x2`, `x3` contribute strongly to `PC1`, `PC2` has contributions from all variables except `x3` and variables `x4` and `x5` contribute strongly to `PC3`. \n\n\\index{dimension reduction!coefficients}\n\\index{dimension reduction!principal components}\n\n\n\n\n::: {#tbl-plane-pcs .cell tbl-cap='Coefficients for the first two PCs for the plane data.'}\n::: {.cell-output-display}\n\\begin{longtable}{lrr}\n\\toprule\nVariable & PC1 & PC2 \\\\ \n\\midrule\nx1 & $0.58$ & $-0.06$ \\\\ \nx2 & $-0.55$ & $0.21$ \\\\ \nx3 & $0.47$ & $-0.41$ \\\\ \nx4 & $0.25$ & $0.64$ \\\\ \nx5 & $-0.29$ & $-0.62$ \\\\ \n\\bottomrule\n\\end{longtable}\n\n:::\n:::\n\n::: {#tbl-box-pcs .cell tbl-cap='Coefficients for the first three PCs for the box data.'}\n::: {.cell-output-display}\n\\begin{longtable}{lrrr}\n\\toprule\nVariable & PC1 & PC2 & PC3 \\\\ \n\\midrule\nx1 & $-0.51$ & $0.46$ & $0.11$ \\\\ \nx2 & $0.51$ & $0.46$ & $0.00$ \\\\ \nx3 & $-0.65$ & $-0.09$ & $0.23$ \\\\ \nx4 & $-0.22$ & $0.36$ & $-0.87$ \\\\ \nx5 & $0.02$ & $0.66$ & $0.43$ \\\\ \n\\bottomrule\n\\end{longtable}\n\n:::\n:::\n\n\n\nIn each of these simulated data sets, all five variables contributed to the dimension reduction. If we added two purely noise variables to the plane data, as done in @sec-dimension-overview, the scree plot in @fig-plane-noise-scree would indicate that the data is now 4D, and we would get a different interpretation of the coefficients from the PCA, see @tbl-plane-noise-pcs. We see that `PC1` and `PC2` are approximately the same as before, with main variables being (`x1`, `x2`, `x3`) and (`x4`, `x5`) respectively. `PC3` and `PC4` are both `x6` and `x7`. \n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Additional noise variables expands the plane data to 4D.](4-pca_files/figure-pdf/fig-plane-noise-scree-1.pdf){#fig-plane-noise-scree width=80%}\n:::\n:::\n\n::: {#tbl-plane-noise-pcs .cell tbl-cap='Coefficients for PCs 1-4 of the plane plus noise data.'}\n::: {.cell-output-display}\n\\begin{longtable}{lrrrr}\n\\toprule\nVariable & PC1 & PC2 & PC3 & PC4 \\\\ \n\\midrule\nx1 & $0.58$ & $0.04$ & $0.01$ & $0.00$ \\\\ \nx2 & $-0.55$ & $-0.18$ & $-0.03$ & $0.07$ \\\\ \nx3 & $0.47$ & $0.37$ & $0.05$ & $-0.20$ \\\\ \nx4 & $0.24$ & $-0.62$ & $-0.06$ & $0.17$ \\\\ \nx5 & $-0.28$ & $0.60$ & $0.07$ & $-0.14$ \\\\ \nx6 & $0.05$ & $0.29$ & $-0.58$ & $0.76$ \\\\ \nx7 & $-0.02$ & $-0.08$ & $-0.81$ & $-0.58$ \\\\ \n\\bottomrule\n\\end{longtable}\n\n:::\n:::\n\n\n\n### Example: pisa\n\\index{data!pisa}\n\nThe `pisa` data contains simulated data from math, reading and science scores, totalling 30 variables. PCA is used here to examine the association. We might expect that it is 3D, but what we see suggests it is primarily 1D. This means that a student that scores well in math, will also score well in reading and science. \n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\ndata(pisa)\npisa_std <- pisa %>%\n filter(CNT == \"Australia\") %>%\n select(-CNT) %>%\n mutate_all(mulgar:::scale2)\npisa_pca <- prcomp(pisa_std)\npisa_scree <- ggscree(pisa_pca, q = 15) + theme_minimal()\n```\n:::\n\n\n\nThe scree plot in @fig-pisa-pca-pdf shows a big drop from one to two PCs in the amount of variance explained. A grand tour on the 30 variables can be run using `animate_xy()`: \n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nanimate_xy(pisa_std, half_range=1)\n```\n:::\n\n\n\nor rendered as an animated gif using `render_gif()`:\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nrender_gif(pisa_std, \n grand_tour(), \n display_xy(half_range=0.9),\n gif_file=\"gifs/pisa_gt.gif\",\n frames=500,\n width=400,\n height=400,\n loop=FALSE)\n```\n:::\n\n\n\nand we can see that the data is elliptical in most projections, sometimes shrinking to be a small circle. This pattern strongly indicates that there is one primary direction of variation in the data, with only small variation in any direction away from it. Shrinking to the small circle is analogous to to how *a pencil or cigar or water bottle in 3D looks from some angles*.\n\n::: {.content-visible when-format=\"html\"}\n::: {#fig-pisa-pca-html fig-align=\"center\" layout-ncol=2}\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Scree plot](4-pca_files/figure-pdf/fig-pisa-scree-html-1.pdf){#fig-pisa-scree-html width=80%}\n:::\n:::\n\n\n\n![Grand tour](gifs/pisa_gt.gif){#fig-pisa-gt fig-alt=\"Tour showing lots of linear projections of the pisa data. You can see strong linear dependence.\" fig.align=\"center\"}\n\nScree plot and tour of the `pisa` data, with 30 variables being the plausible scores for Australian students. In combination, these suggest that the data is effectively 1D.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n::: {#fig-pisa-pca-pdf fig-align=\"center\" layout-ncol=2}\n\n![Scree plot](images/fig-pisa-scree-1.png){#fig-pisa-scree-pdf fig-alt=\"FIX ME\"}\n\n![Grand tour frame](images/pisa_gt_249.png){#fig-pisa-gt fig-alt=\"Selected linear projection of the pisa data from a grand tour. You can see strong linear dependence.\" fig.align=\"center\"}\n\nScree plot and a frame from a tour of the `pisa` data, with 30 variables being the plausible scores for Australian students. In combination, these suggest that the data is effectively 1D. {{< fa play-circle >}}\n:::\n:::\n\nThe coefficients of the first PC (first eigenvector) are roughly equal in magnitude (as shown below), which tells us that all variables roughly contribute. Interestingly, they are all negative, which is not actually meaningful. With different software these could easily have been all positive. The sign of the coefficients can be reversed, as long as all are reversed, which is the same as an arrow pointing one way, changing and pointing the other way. \n\n\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to print PC coefficients\"}\nround(pisa_pca$rotation[,1], 2)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n PV1MATH PV2MATH PV3MATH PV4MATH PV5MATH PV6MATH \n -0.18 -0.18 -0.18 -0.18 -0.18 -0.18 \n PV7MATH PV8MATH PV9MATH PV10MATH PV1READ PV2READ \n -0.18 -0.18 -0.18 -0.18 -0.19 -0.18 \n PV3READ PV4READ PV5READ PV6READ PV7READ PV8READ \n -0.19 -0.19 -0.19 -0.19 -0.19 -0.19 \n PV9READ PV10READ PV1SCIE PV2SCIE PV3SCIE PV4SCIE \n -0.19 -0.19 -0.18 -0.18 -0.19 -0.18 \n PV5SCIE PV6SCIE PV7SCIE PV8SCIE PV9SCIE PV10SCIE \n -0.19 -0.18 -0.19 -0.18 -0.19 -0.18 \n```\n\n\n:::\n:::\n\n\n\n::: {.content-visible when-format=\"html\"}\n::: insight\nThe tour verifies that the `pisa` data is primarily 1D, indicating that a student who scores well in math, probably scores well in reading and science, too. More interestingly, the regular shape of the data strongly indicates that it is \"synthetic\", simulated rather than observed.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\insightbox{The tour verifies that the `pisa` data is primarily 1D, indicating that a student who scores well in math, probably scores well in reading and science, too. More interestingly, the regular shape of the data strongly indicates that it is \"synthetic\", simulated rather than observed.}\n:::\n\n### Example: aflw\n\\index{data!aflw}\n\nThis data has player statistics for all the matches in the 2021 season. We would be interested to know which variables contain similar information, and thus might be combined into single variables. We would expect that many statistics group into a few small sets, such as offensive and defensive skills. We might also expect that some of the statistics are skewed, most players have low values and just a handful of players are stellar. It is also possible that there are some extreme values. These are interesting features, but they will distract from the main purpose of grouping the statistics. Thus the tour is used to check for potential problems with the data prior to conducting PCA.\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nlibrary(tourr)\ndata(aflw)\naflw_std <- aflw %>%\n mutate_if(is.numeric, function(x) (x-\n mean(x, na.rm=TRUE))/\n sd(x, na.rm=TRUE))\n```\n:::\n\n\n\nTo look at all of the 29 player statistics in a grand tour in @fig-aflw-gt-pdf.\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to generate tour\"}\nanimate_xy(aflw_std[,7:35], half_range=0.9)\nrender_gif(aflw_std[,7:35], \n grand_tour(), \n display_xy(half_range=0.9),\n gif_file=\"gifs/aflw_gt.gif\",\n frames=500,\n loop=FALSE)\n```\n:::\n\n\n\n::: {.content-visible when-format=\"html\"}\n::: {#fig-aflw-gt-html}\n\n![](gifs/aflw_gt.gif){fig-alt=\"Tour showing lots of linear projections of the aflw data. You can see linear dependence, and some outliers.\" fig.align=\"center\"}\n\nGrand tour of the AFLW player statistics. Most player statistics concentrate near the centre, indicating most players are \"average\"! There are a few outliers appearing in different combinations of the skills, which one would expect to be the star players for particular skill sets. \n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n::: {#fig-aflw-gt-pdf fig.align=\"center\" fig-alt=\"Example linear projection of the aflw data from a grand tour. You can see linear dependence, and some outliers.\" layout-ncol=2}\n\n![](images/aflw_gt_70.png){width=228}\n\n![](images/aflw_gt_329.png){width=228}\n\nTwo frames from a grand tour of the AFLW player statistics. Most player statistics concentrate near the centre, indicating most players are \"average\"! There are a few outliers appearing in different combinations of the skills, which one would expect to be the star players for particular skill sets. {{< fa play-circle >}}\n:::\n:::\n\nNo major surprises! There is a small amount of skewness, and there are no major outliers. Skewness indicates that most players have reasonably similar skills (bunching of points), except for some key players (the moderate outliers). The skewness could be reduced by applying a log or square root transformation to some variables prior to running the PCA. However, we elect not to do this because the moderate outliers are of interest. These correspond to talented players that we'd like to explore further with the analysis.\n\nBelow we have the conventional summary of the PCA, a scree plot showing the reduction in variance to be explained when each additional PC is considered. It is also conventional to look at a table summarising the proportions of variance explained by PCs, but with almost 30 variables it is easier to make some decision on the number of PCs needed based on the scree plot.\n\n\n\n::: {.cell alt-text='Scree plot showing variance vertically against PC number horizontally. Variance drops from close to 10 for PC 1 to about 1.2 for PC 4 then slowly decays through to PC 29.'}\n::: {.cell-output-display}\n![Scree plot showing decay in variance of PCs. There are sharp drops for the first four PCs, and then smaller declines.](4-pca_files/figure-pdf/fig-aflw-pca-1.pdf){#fig-aflw-pca width=80%}\n:::\n:::\n\n\n\n\\index{dimension reduction!scree plot}\n\nFrom the scree plot in @fig-aflw-pca, we see a sharp drop from one to two, two to three and then smaller drops. After four PCs the variance drops again at six PCs and then gradually decays. We will choose four PCs to examine more closely. This explains 67.2% of the variance.\n\n\n\n::: {#tbl-aflw-pcs .cell tbl-cap='Coefficients for the first four PCs. PC 1 contrasts some with PC 1, with the first having large coefficients primarily on field play statistics, and the second having large coefficients on the scoring statistics.'}\n::: {.cell-output-display}\n\\begin{longtable}{lrrrr}\n\\toprule\nVariable & PC1 & PC2 & PC3 & PC4 \\\\ \n\\midrule\ndisposals & $0.31$ & $-0.05$ & $-0.03$ & $0.07$ \\\\ \npossessions & $0.31$ & $-0.03$ & $-0.07$ & $0.09$ \\\\ \nkicks & $0.29$ & $-0.04$ & $0.09$ & $-0.12$ \\\\ \nmetres & $0.28$ & $-0.03$ & $0.10$ & $-0.15$ \\\\ \ncontested & $0.28$ & $0.01$ & $-0.12$ & $0.23$ \\\\ \nuncontested & $0.28$ & $-0.06$ & $-0.01$ & $-0.05$ \\\\ \nturnovers & $0.27$ & $-0.01$ & $-0.01$ & $-0.29$ \\\\ \nclearances & $0.23$ & $0.00$ & $-0.29$ & $0.19$ \\\\ \nclangers & $0.23$ & $-0.02$ & $-0.06$ & $-0.33$ \\\\ \nhandballs & $0.23$ & $-0.04$ & $-0.19$ & $0.31$ \\\\ \nfrees\\_for & $0.21$ & $0.02$ & $-0.13$ & $0.18$ \\\\ \nmarks & $0.21$ & $0.03$ & $0.32$ & $0.02$ \\\\ \ntackles & $0.20$ & $0.01$ & $-0.28$ & $0.09$ \\\\ \ntime\\_pct & $0.16$ & $-0.04$ & $0.35$ & $-0.02$ \\\\ \nintercepts & $0.13$ & $-0.28$ & $0.24$ & $0.03$ \\\\ \nrebounds\\_in50 & $0.13$ & $-0.28$ & $0.24$ & $-0.06$ \\\\ \nfrees\\_against & $0.13$ & $0.03$ & $-0.16$ & $-0.23$ \\\\ \nassists & $0.09$ & $0.23$ & $0.00$ & $0.05$ \\\\ \nbounces & $0.09$ & $0.03$ & $0.02$ & $-0.28$ \\\\ \nbehinds & $0.09$ & $0.32$ & $0.08$ & $-0.02$ \\\\ \nshots & $0.08$ & $0.38$ & $0.12$ & $-0.03$ \\\\ \ntackles\\_in50 & $0.07$ & $0.27$ & $-0.18$ & $0.03$ \\\\ \nmarks\\_in50 & $0.06$ & $0.34$ & $0.18$ & $0.04$ \\\\ \ncontested\\_marks & $0.05$ & $0.16$ & $0.34$ & $0.15$ \\\\ \ngoals & $0.04$ & $0.37$ & $0.16$ & $0.03$ \\\\ \naccuracy & $0.04$ & $0.34$ & $0.10$ & $0.06$ \\\\ \none\\_pct & $0.03$ & $-0.21$ & $0.33$ & $0.08$ \\\\ \ndisposal & $0.02$ & $-0.13$ & $0.20$ & $0.50$ \\\\ \nhitouts & $-0.04$ & $0.00$ & $-0.03$ & $0.32$ \\\\ \n\\bottomrule\n\\end{longtable}\n\n:::\n:::\n\n\n\nWhen there are as many variables as this, it can be hard to digest the combinations of variables most contributing to each PC. Rearranging the table by sorting on a selected PC can help. @tbl-aflw-pcs has been sorted according to the PC 1 coefficients.\n\nPC 1 is primarily composed of `disposals`, `possessions`, `kicks`, `metres`, `uncontested`, `contested`, .... primarily the field play statistics! It is quite common in PCA for the first PC to be a combination of all variables, which suggests that there is one main direction of variation in the data. Here it is not quite that. PCA suggests that the primary variation is through a combination of field skills, or basic football playing skills.\n\nThus the second PC contrasts the first, because it is primarily a combination of `shots`, `goals`, `marks_in50`, `accuracy`, and `behinds` contrasted against `rebounds_in50` and `intercepts`. The positive coefficients are primary offensive skills and the negative coefficients are defensive skills. This PC is reasonable measure of the offensive vs defensive skills of a player.\n\n\\index{dimension reduction!interpretation}\n\nWe could continue to interpret each PC by examining large coefficients to help decide how many PCs are a suitable summary of the information in the data. Briefly, PC 3 mixed but it is possibly a measure of worth of the player because `time_pct` has a large coefficient, so players that are on the field longer will contribute strongly to this new variable. It also has large (and opposite) contributions from `clearances`, `tackles`, `contested_marks`. PC 4 appears to be related to aggressive play with `clangers`, `turnovers`, `bounces` and `frees_against` featuring. All four PCs have useful information. \n\nFor deeper exploration, when we tour the four PCs, we'd like to be able to stop and identify players. This can be done by creating a pre-computed animation, with additional mouse-over, made using `plotly`. However, it is not size-efficient, can is only feasible with a small number of observations. Because all of the animation frames with the fully projected data in each, are composed into a single object, which gets large very quickly.\n\n::: {.content-visible when-format=\"html\"}\nThe result is shown in @fig-aflw-pcatour. We can see that the shape of the four PCs is similar to that of all the variables, bunching of points in the centre with a lot of moderate outliers.\n:::\n\n\n\n::: {.cell}\n\n:::\n\n\n\n::: {.content-visible when-format=\"html\"}\n::: {#fig-aflw-pcatour}\n\n\n\nAnimation of four PCs of the aflw data with interactive labelling.\n:::\n:::\n\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n:::\n\n\n\n::: {.content-visible when-format=\"html\"}\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(plotly)\nggplotly(pg18, width=500, height=500)\n```\n:::\n\n\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Frame 18 re-plotted so that players can be identified. Here some players are labelled, but ideally this plot is interactive and any player can be identified. {{< fa play-circle >}}](4-pca_files/figure-pdf/fig-aflw-pcaplots-pdf-1.pdf){#fig-aflw-pcaplots-pdf width=60%}\n:::\n:::\n\n\n:::\n\nFor any particular frame, like 18 re-plotted in @fig-aflw-pcaplots-pdf, we can investigate further. Here there is a branching pattern, where the branch points in the direction of PC 1. Mouse-over the players at the tip of this branch and we find players like Alyce Parker, Brittany Bonnici, Dana Hooker, Kiara Bowers. If you look up the bios of these players you'll find they all have generally good player descriptions like \"elite disposals\", \"powerful left foot\", \"hard-running midfielder\", \"best and fairest\".\n\nIn the direction of PC 2, you'll find players like Lauren Ahrens, Stacey Livingstone who are star defenders. Players in this end of PC 2, have high scores on `intercepts` and `rebounds_in50`.\n\nAnother interesting frame for inspecting PC 2 is 59. PC 2 at one end has players with high goal scoring skills, and the other good defending skills. So mousing over the other end of PC 2 finds players like Gemma Houghton and Katie Brennan who are known for their goal scoring. The branch pattern is an interesting one, because it tells us there is some combination of skills that are lacking among all players, primarily this appears to be there some distinction between defenders skills and general playing skills. It's not as simple as this because the branching is only visible when PC 1 and PC 2 are examined with PC 3.\n\nPCA is useful for getting a sense of the variation in a high-dimensional data set. Interpreting the principal components is often useful, but it can be discombobulating. For the `aflw` data it would be good to think about it as a guide to the main directions of variation and to follow with a more direct engineering of variables into interesting player characteristics. For example, calculate offensive skill as an equal combination of goals, accuracy, shots, behinds. A set of new variables specifically computed to measure particular skills would make explaining an analysis easier.\n\n::: {.content-visible when-format=\"html\"}\n::: insight\nThe tour verifies that PCA on the `aflw` data is complicated and doesn't capture all of the variation. However, it does provide useful insights. It detected outstanding players, and indicated the different skills sets of top goal scorers and top defensive players.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\insightbox{The tour verifies that PCA on the `aflw` data is complicated and doesn't capture all of the variation. However, it does provide useful insights. It detected outstanding players, and indicated the different skills sets of top goal scorers and top defensive players.}\n:::\n\n## Examining the PCA model in the data space\n\\index{model-in-the-data-space}\n\nWhen you choose a smaller number of PCs $(k)$ than the number of original variables, this is essentially producing a model for the data. The model is the lower dimensional $k$-D space. It is analogous to a linear regression model, except that the residuals from the model are $(p-k)$-D. \n\nIt is common to show the model, that is the data projected into the $k$-D model space. When $k=2$ this is called a \"biplot\". For the `plane` and `plane_noise` data the biplots are shown in @fig-plane-biplot. This is useful for checking which variables contribute most to the new principal component variables, and also to check for any problems that might have affected the fit, such as outliers, clusters or non-linearity. Interestingly, biplots are typically only made in 2D, even if the data should be summarised by more than two PCs. Occasionally you will see the biplot made for PC $j$ vs PC $k$ also. With the `pca_tour()` function in the `tourr` package you can view a $k$-D biplot. This will display the $k$ PCs with the axes displaying the original variables, and thus show their contribution to the PCs. \n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Biplots of the plane (a) and plane + noise (b) data. All five variables contribute strongly to the two principal components in (a): PC1 is primarily `x1`, `x2` and `x3` and PC2 is primarily `x4` and `x5`. In (b) the same four variables contribute in almost the same way, with variables `x6` and `x7` contributing very little. The data was constructed this way, that these two dimensions were purely noise.](4-pca_files/figure-pdf/fig-plane-biplot-1.pdf){#fig-plane-biplot width=100%}\n:::\n:::\n\n\n\n\nIt can be useful to examine this model using the tour. The model is simply a plane in high dimensions. This would be considered to be the model in the data space. The reason to do this is to check how well the model fits the data. The plane corresponding to the model should be oriented along the main direction of the points, and the spread of points around the plane should be small. We should also be able to see if there has been any strong non-linear relationship missed by the model, or outliers and clusters.\n\nThe function `pca_model()` from the `mulgar` package can be used to represent the model as a $k$-D wire-frame plane. @fig-plane-box-model-pdf shows the models for the `plane` and `box` data, 2D and 3D respectively.\n\n::: {.content-visible when-format=\"html\"}\n::: info\nWe look at the model in the data space to check how well the model fits the data. If it fits well, the points will cluster tightly around the model representation, with little spread in other directions.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\infobox{We look at the model in the data space to check how well the model fits the data. If it fits well, the points will cluster tightly around the model representation, with little spread in other directions.}\n:::\n\n\n\n::: {.cell}\n\n:::\n\n\n\n\n::: {.content-visible when-format=\"html\"}\n::: {#fig-plane-box-model-html fig-align=\"center\" layout-ncol=2}\n\n![Model for the 2D in 5D data.](gifs/plane_model.gif){#fig-plane-model fig-alt=\"FIX ME.\" fig.align=\"center\"}\n\n![Model for the 3D in 5D data.](gifs/box_model.gif){#fig-box-model fig-alt=\"FIX ME.\" fig.align=\"center\"}\n\nPCA model overlaid on the data for the 2D in 5D, and 3D in 5D simulated data. \n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n::: {#fig-plane-box-model-pdf fig-align=\"center\" layout-ncol=2}\n\n![Model for the 2D in 5D data.](images/plane_model_55.png){#fig-plane-model fig-alt=\"FIX ME.\" fig.align=\"center\"}\n\n![Model for the 3D in 5D data.](images/box_model_13.png){#fig-box-model fig-alt=\"FIX ME.\" fig.align=\"center\"}\n\nPCA model overlaid on the data for the 2D in 5D, and 3D in 5D simulated data. {{< fa play-circle >}}\n:::\n:::\n\n### Example: pisa\n\\index{data!pisa}\n\nThe model for the `pisa` data is a 1D vector, shown in @fig-pisa-model-pdf. In this example there is a good agreement between the model and the data.\n\n\n\n::: {.cell}\n\n:::\n\n\n\n::: {.content-visible when-format=\"html\"}\n::: {#fig-pisa-model-html}\n\n![](gifs/pisa_model.gif){fig-alt=\"Something here\" fig.align=\"center\"}\n\nPCA model of the `pisa` data. The 1D model captures the primary variation in the data and there is a small amount of spread in all directions away from the model. \n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n::: {#fig-pisa-model-pdf fig-alt=\"Something here\" fig.align=\"center\"}\n\n![](images/pisa_model_17.png){width=300}\n\nPCA model of the `pisa` data. The 1D model captures the primary variation in the data and there is a small amount of spread in all directions away from the model. {{< fa play-circle >}}\n:::\n:::\n\n::: {.content-visible when-format=\"html\"}\n::: insight\nThe `pisa` data fits fairly closely to the 1D PCA model. The variance of points away from the model is symmetric and relatively small. These suggest the 1D model is a reasonably summary of the test scores.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\insightbox{The `pisa` data fits fairly closely to the 1D PCA model. The variance of points away from the model is symmetric and relatively small. These suggest the 1D model is a reasonably summary of the test scores.}\n:::\n\n### Example: aflw\n\\index{data!aflw}\n\nIt is less useful to examine the PCA model for the `aflw` data, because the main patterns that were of interest were the exceptional players. However, we will do it anyway! @fig-aflw-model-pdf shows the 4D PCA model overlain on the data. Even though the distribution of points is not as symmetric and balanced as the other examples, we can see that the cube structure mirrors the variation. We can see that the relationships between variables are not strictly linear, because the spread extends unevenly away from the box. \n\n\n\n::: {.cell}\n\n:::\n\n\n\n::: {.content-visible when-format=\"html\"}\n::: {#fig-aflw-model-html}\n\n![](gifs/aflw_model.gif){ fig-alt=\"Something here\" fig.align=\"center\"}\n\nPCA model of the `aflw` data. The linear model is not ideal for this data, which has other patterns like outliers, and some branching. However, the model roughly captures the linear associations, and leaves unexplained and unequal variation in different directions.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n::: {#fig-aflw-model-pdf fig-alt=\"Something here\" fig.align=\"center\"}\n\n![](images/aflw_model_70.png){width=300}\n\nPCA model of the `aflw` data. The linear model is not ideal for this data, which has other patterns like outliers, and some branching. However, the model roughly captures the linear associations, and leaves unexplained variation in different directions. {{< fa play-circle >}}\n:::\n:::\n\n::: {.content-visible when-format=\"html\"}\n::: insight\nFrom the tour we see that the 4D model leaves substantial variation unexplained. It is also not symmetric, and there is some larger variation away from the model in some combinations of variables than others. \n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\insightbox{From the tour we see that the 4D model leaves substantial variation unexplained. It is also not symmetric, and there is some larger variation away from the model in some combinations of variables than others.}\n:::\n\n## When relationships are not linear \n\n### Example: outliers\n\\index{outliers}\n\n@fig-plane-n-o-scree shows the scree plot for the planar data with noise and outliers. It is very similar to the scree plot on the data without the outliers (@fig-plane-noise-scree). However, what we see from @fig-p-o-pca-pdf is that PCA loses the outliers. The animation in (a) shows the full data, and the outliers marked by colour and labels 1, 2, are clearly unusual in some projections. When we examine the tour of the first four PCs (as suggested by the scree plot) the outliers are not unusual. They are almost contained in the point cloud. The reason is clear when all the PCs are plotted, and the outliers can be seen to be clearly detected only in PC5, PC6 and PC7. \n\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code for screeplot\"}\nplane_n_o_pca <- prcomp(plane_noise_outliers)\nggscree(plane_n_o_pca, q = 7) + theme_minimal()\n```\n\n::: {.cell-output-display}\n![Scree plot of the planar data with noise and an outlier. It is almost the same as the data without the outliers.](4-pca_files/figure-pdf/fig-plane-n-o-scree-1.pdf){#fig-plane-n-o-scree fig-pos='H' width=80%}\n:::\n:::\n\n::: {.cell}\n\n:::\n\n\n\n::: {.content-visible when-format=\"html\"}\n::: {#fig-p-o-pca-html fig-align=\"center\" layout-ncol=2}\n\n![Outliers clearly visible](gifs/plane_n_o_clr.gif){#fig-plane-n-o-clr width=250}\n\n![Outliers not clearly visible in PC1-4](gifs/plane_n_o_pca.gif){#fig-plane-n-o-pca width=250}\n\nExamining the handling of outliers in the PCA of the planar data with noise variables and two outliers. PCA has lost these two extreme values.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n::: {#fig-p-o-pca-pdf fig-align=\"center\" layout-ncol=2}\n\n![Outliers clearly visible](images/plane_n_o_clr_181.png){#fig-plane-n-o-clr width=210}\n\n![Outliers not clearly visible in PC1-4](images/plane_n_o_pca_181.png){#fig-plane-n-o-pca width=210}\n\nExamining the handling of outliers in the PCA of the planar data with noise variables and two outliers. PCA has lost these two extreme values. {{< fa play-circle >}}\n:::\n:::\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![From the scatterplot matrix we can see that the outliers are present in PC5, PC6 and PC7. That means by reducing the dimensionality to the first four PCs the model has missed some important characteristics in the data.](4-pca_files/figure-pdf/fig-plane-o-n-pairs-1.pdf){#fig-plane-o-n-pairs width=80%}\n:::\n:::\n\n\n\n### Example: Non-linear associations\n\\index{nonlinearity}\n\n@fig-plane-nonlin-pdf shows the tour of the full 5D data containing non-linear relationships in comparison with a tour of the first three PCs, as recommended by the scree plot (@fig-plane-nonlin-scree). The PCs capture some clear and very clean non-linear relationship, but it looks like it has missed some of the complexities of the relationships. The scatterplot matrix of all 5 PCs (@fig-plane-nonlin-pairs) shows that PC4 and PC5 contain interesting features: more non-linearity, and curiously an outlier.\n \n\n\n::: {.cell}\n::: {.cell-output-display}\n![Scree plot of the non-linear data suggests three PCs.](4-pca_files/figure-pdf/fig-plane-nonlin-scree-1.pdf){#fig-plane-nonlin-scree width=80%}\n:::\n:::\n\n::: {.cell}\n\n:::\n\n\n\n::: {.content-visible when-format=\"html\"}\n::: {#fig-plane-nonlin-html fig-align=\"center\" layout-ncol=2}\n\n![All five variables](gifs/plane_nonlin.gif){#fig-nonlinear2 width=230}\n\n![First three PCs](gifs/plane_nonlin_pca.gif){#fig-plane-nonlin-pca width=230}\n\nComparison of the full data and first three principal components. Non-linear relationships between several variables can be seen in a tour on all five variables. The first three principal components reveal a strong non-linear relationship. Some of the non-linearity is clearly visible in the reduced dimension space, but the full data has more complexities. \n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n::: {#fig-plane-nonlin-pdf fig-align=\"center\" layout-ncol=2}\n\n![All five variables](images/plane_nonlin_61.png){#fig-nonlinear2 width=210}\n\n![First three PCs](images/plane_nonlin_pca_129.png){#fig-plane-nonlin-pca width=210}\n\nComparison of the full data and first three principal components. Non-linear relationships between several variables can be seen in a tour on all five variables. The first three principal components reveal a strong non-linear relationship. Some of the non-linearity is clearly visible in the reduced dimension space, but the full data has more complexities. {{< fa play-circle >}}\n:::\n:::\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![From the scatterplot matrix we can see that the there is a non-linear relationship visible in PC1 and PC2, with perhaps a small contribution from PC3. However, we can see that when the data is reduced to three PCs, it misses catching all on the non-linear relationships and also interestingly it seems that there is an unusual observation also.](4-pca_files/figure-pdf/fig-plane-nonlin-pairs-1.pdf){#fig-plane-nonlin-pairs width=80%}\n:::\n:::\n\n\n\n::: {.content-visible when-format=\"html\"}\n::: info\nOne of the dangers of PCA is that interesting and curious details of the data only emerge in the lowest PCs, that are usually discarded. The tour, and examining the smaller PCs, can help to discover them.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\infobox{One of the dangers of PCA is that interesting and curious details of the data only emerge in the lowest PCs, that are usually discarded. The tour, and examining the smaller PCs, can help to discover them.}\n:::\n\n## Exercises {-}\n\n1. Make a scatterplot matrix of the first four PCs of the `aflw` data. Is the branch pattern visible in any pair?\n2. Construct five new variables to measure these skills offense, defense, playing time, ball movement, errors. Using the tour, examine the relationship between these variables. Map out how a few players could be characterised based on these directions of skills.\n3. Symmetrise any `aflw` variables that have skewed distributions using a log or square root transformation. Then re-do the PCA. What do we learn that is different about associations between the skill variables?\n4. Examine the `bushfires` data using a grand tour on the numeric variables, ignoring the `cause` (class) variable. Note any issues such as outliers, or skewness that might affect PCA. How many principal components would be recommended by the scree plot? Examine this PCA model with the data, and explain how well it does or doesn't fit.\n5. Use the `pca_tour` to examine the first five PCs of the `bushfires` data. How do all of the variables contribute to this reduced space?\n6. Reduce the dimension of the `sketches` data to 12 PCs. How much variation does this explain? Is there any obvious clustering in this lower dimensional space?\n\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n:::\n\n\n\n## Project {-}\n\nLinear dimension reduction can optimise for other criteria, and here we will explore one example: the algorithm implemented in the `dobin` package finds a basis in which the first few directions are optimized for the detection of outliers in the data. We will examine how it performs for the `plane_noise_outliers` data (the example where outliers were hidden in the first four principal components.)\n\n1. Start by looking up the documentation of `dobin::dobin`. How many parameters does the method depend on?\n2. We first apply the function to the `plane_noise_outliers` data using default values for all parameters.\n3. Recall that the outliers were added in rows 101 and 102 of the data. Make a scatter plots showing the projection onto the first, second and third component found by `dobin`, using color to highlight the outliers. Are they visible as outliers with three components?\n4. Adjust the `frac` parameter of the `dobin` function to `frac = 0.99` and repeat the graphical evaluation from point 3. How does it compare to the previous solution?\n\n\n\n::: {.cell}\n\n:::\n", "supporting": [ "4-pca_files/figure-pdf" ], diff --git a/_freeze/4-pca/figure-html/fig-2D-pca-1.png b/_freeze/4-pca/figure-html/fig-2D-pca-1.png new file mode 100644 index 0000000..14a9343 Binary files /dev/null and b/_freeze/4-pca/figure-html/fig-2D-pca-1.png differ diff --git a/_freeze/4-pca/figure-html/fig-aflw-pca-1.png b/_freeze/4-pca/figure-html/fig-aflw-pca-1.png new file mode 100644 index 0000000..1485c72 Binary files /dev/null and b/_freeze/4-pca/figure-html/fig-aflw-pca-1.png differ diff --git a/_freeze/4-pca/figure-html/fig-plane-biplot-1.png b/_freeze/4-pca/figure-html/fig-plane-biplot-1.png new file mode 100644 index 0000000..562be6c Binary files /dev/null and b/_freeze/4-pca/figure-html/fig-plane-biplot-1.png differ diff --git a/_freeze/4-pca/figure-html/fig-plane-n-o-scree-1.png b/_freeze/4-pca/figure-html/fig-plane-n-o-scree-1.png new file mode 100644 index 0000000..ee43545 Binary files /dev/null and b/_freeze/4-pca/figure-html/fig-plane-n-o-scree-1.png differ diff --git a/_freeze/4-pca/figure-html/fig-plane-noise-scree-1.png b/_freeze/4-pca/figure-html/fig-plane-noise-scree-1.png new file mode 100644 index 0000000..f3dde65 Binary files /dev/null and b/_freeze/4-pca/figure-html/fig-plane-noise-scree-1.png differ diff --git a/_freeze/4-pca/figure-html/fig-plane-nonlin-pairs-1.png b/_freeze/4-pca/figure-html/fig-plane-nonlin-pairs-1.png new file mode 100644 index 0000000..00ab88f Binary files /dev/null and b/_freeze/4-pca/figure-html/fig-plane-nonlin-pairs-1.png differ diff --git a/_freeze/4-pca/figure-html/fig-plane-nonlin-scree-1.png b/_freeze/4-pca/figure-html/fig-plane-nonlin-scree-1.png new file mode 100644 index 0000000..930942f Binary files /dev/null and b/_freeze/4-pca/figure-html/fig-plane-nonlin-scree-1.png differ diff --git a/_freeze/4-pca/figure-html/fig-plane-o-n-pairs-1.png b/_freeze/4-pca/figure-html/fig-plane-o-n-pairs-1.png new file mode 100644 index 0000000..48321b3 Binary files /dev/null and b/_freeze/4-pca/figure-html/fig-plane-o-n-pairs-1.png differ diff --git a/_freeze/4-pca/figure-html/unnamed-chunk-10-1.png b/_freeze/4-pca/figure-html/unnamed-chunk-10-1.png new file mode 100644 index 0000000..0295712 Binary files /dev/null and b/_freeze/4-pca/figure-html/unnamed-chunk-10-1.png differ diff --git a/_freeze/4-pca/figure-pdf/fig-2D-pca-1.pdf b/_freeze/4-pca/figure-pdf/fig-2D-pca-1.pdf index 1ffe934..7623fb6 100644 Binary files a/_freeze/4-pca/figure-pdf/fig-2D-pca-1.pdf and b/_freeze/4-pca/figure-pdf/fig-2D-pca-1.pdf differ diff --git a/_freeze/4-pca/figure-pdf/fig-aflw-pca-1.pdf b/_freeze/4-pca/figure-pdf/fig-aflw-pca-1.pdf index 09f5b1f..8cf32f3 100644 Binary files a/_freeze/4-pca/figure-pdf/fig-aflw-pca-1.pdf and b/_freeze/4-pca/figure-pdf/fig-aflw-pca-1.pdf differ diff --git a/_freeze/4-pca/figure-pdf/fig-aflw-pcaplots-html-1.pdf b/_freeze/4-pca/figure-pdf/fig-aflw-pcaplots-html-1.pdf new file mode 100644 index 0000000..54eddb8 Binary files /dev/null and b/_freeze/4-pca/figure-pdf/fig-aflw-pcaplots-html-1.pdf differ diff --git a/_freeze/4-pca/figure-pdf/fig-aflw-pcaplots-pdf-1.pdf b/_freeze/4-pca/figure-pdf/fig-aflw-pcaplots-pdf-1.pdf new file mode 100644 index 0000000..cea7005 Binary files /dev/null and b/_freeze/4-pca/figure-pdf/fig-aflw-pcaplots-pdf-1.pdf differ diff --git a/_freeze/4-pca/figure-pdf/fig-pisa-scree-html-1.pdf b/_freeze/4-pca/figure-pdf/fig-pisa-scree-html-1.pdf new file mode 100644 index 0000000..24c0f5c Binary files /dev/null and b/_freeze/4-pca/figure-pdf/fig-pisa-scree-html-1.pdf differ diff --git a/_freeze/4-pca/figure-pdf/fig-plane-biplot-1.pdf b/_freeze/4-pca/figure-pdf/fig-plane-biplot-1.pdf index 465bacc..3fd2a11 100644 Binary files a/_freeze/4-pca/figure-pdf/fig-plane-biplot-1.pdf and b/_freeze/4-pca/figure-pdf/fig-plane-biplot-1.pdf differ diff --git a/_freeze/4-pca/figure-pdf/fig-plane-n-o-scree-1.pdf b/_freeze/4-pca/figure-pdf/fig-plane-n-o-scree-1.pdf index d4191cc..2800d08 100644 Binary files a/_freeze/4-pca/figure-pdf/fig-plane-n-o-scree-1.pdf and b/_freeze/4-pca/figure-pdf/fig-plane-n-o-scree-1.pdf differ diff --git a/_freeze/4-pca/figure-pdf/fig-plane-noise-scree-1.pdf b/_freeze/4-pca/figure-pdf/fig-plane-noise-scree-1.pdf index b573df2..771189f 100644 Binary files a/_freeze/4-pca/figure-pdf/fig-plane-noise-scree-1.pdf and b/_freeze/4-pca/figure-pdf/fig-plane-noise-scree-1.pdf differ diff --git a/_freeze/4-pca/figure-pdf/fig-plane-nonlin-pairs-1.pdf b/_freeze/4-pca/figure-pdf/fig-plane-nonlin-pairs-1.pdf index 17fa570..282a1d3 100644 Binary files a/_freeze/4-pca/figure-pdf/fig-plane-nonlin-pairs-1.pdf and b/_freeze/4-pca/figure-pdf/fig-plane-nonlin-pairs-1.pdf differ diff --git a/_freeze/4-pca/figure-pdf/fig-plane-nonlin-scree-1.pdf b/_freeze/4-pca/figure-pdf/fig-plane-nonlin-scree-1.pdf index c4ab179..3b8ce5e 100644 Binary files a/_freeze/4-pca/figure-pdf/fig-plane-nonlin-scree-1.pdf and b/_freeze/4-pca/figure-pdf/fig-plane-nonlin-scree-1.pdf differ diff --git a/_freeze/4-pca/figure-pdf/fig-plane-o-n-pairs-1.pdf b/_freeze/4-pca/figure-pdf/fig-plane-o-n-pairs-1.pdf index bdb44f8..0f91b91 100644 Binary files a/_freeze/4-pca/figure-pdf/fig-plane-o-n-pairs-1.pdf and b/_freeze/4-pca/figure-pdf/fig-plane-o-n-pairs-1.pdf differ diff --git a/_freeze/4-pca/figure-pdf/unnamed-chunk-10-1.pdf b/_freeze/4-pca/figure-pdf/unnamed-chunk-10-1.pdf new file mode 100644 index 0000000..819e8c6 Binary files /dev/null and b/_freeze/4-pca/figure-pdf/unnamed-chunk-10-1.pdf differ diff --git a/_freeze/4-pca/figure-pdf/unnamed-chunk-11-1.pdf b/_freeze/4-pca/figure-pdf/unnamed-chunk-11-1.pdf new file mode 100644 index 0000000..5081933 Binary files /dev/null and b/_freeze/4-pca/figure-pdf/unnamed-chunk-11-1.pdf differ diff --git a/_freeze/5-nldr/execute-results/html.json b/_freeze/5-nldr/execute-results/html.json new file mode 100644 index 0000000..d37413e --- /dev/null +++ b/_freeze/5-nldr/execute-results/html.json @@ -0,0 +1,17 @@ +{ + "hash": "9ba4a5cf846206a549b5fc0d85e4ecb1", + "result": { + "engine": "knitr", + "markdown": "## Non-linear dimension reduction\n\n## Explanation of NLDR methods\n\nNon-linear dimension reduction (NLDR) aims to find a low-dimensional representation of the high-dimensional data that shows the main features of the data. In statistics, it dates back to @Kr64a's work on multidimensional scaling (MDS). Some techniques only require an interpoint similarity or distance matrix as the main ingredient, rather than the full data. We'll focus on when the full data is available here, so we can also compare structure perceived using the tour on the high-dimensional space, relative to structure revealed in the low-dimensional embedding.\n\nThere are many methods available for generating non-linear low dimensional representations of the data. MDS is a classical technique that minimises the difference between two interpoint distance matrices, the distance between points in the high-dimensions, and in the low-dimensional representations. A good resource for learning about MDS is @BG05.\n\n\\index{dimension reduction!t-SNE}\n\\index{dimension reduction!UMAP}\n\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to generate the 2D non-linear representation\"}\nlibrary(mulgar)\nlibrary(Rtsne)\nlibrary(uwot)\nlibrary(ggplot2)\nlibrary(patchwork)\nset.seed(42)\ncnl_tsne <- Rtsne(clusters_nonlin)\ncnl_umap <- umap(clusters_nonlin)\nn1 <- ggplot(as.data.frame(cnl_tsne$Y), aes(x=V1, y=V2)) +\n geom_point() + \n ggtitle(\"(a) t-SNE\") +\n theme_minimal() + \n theme(aspect.ratio=1)\nn2 <- ggplot(as.data.frame(cnl_umap), aes(x=V1, y=V2)) +\n geom_point() + \n ggtitle(\"(b) UMAP\") +\n theme_minimal() + \n theme(aspect.ratio=1)\nn1 + n2\n```\n\n::: {.cell-output-display}\n![Two non-linear embeddings of the non-linear clusters data: (a) t-SNE, (b) UMAP. Both suggest four clusters, with two being non-linear in some form.](5-nldr_files/figure-html/fig-nldr-clusters-1.png){#fig-nldr-clusters fig-alt='FIXME' width=768}\n:::\n:::\n\n\n@fig-nldr-clusters show two NLDR views of the `clusters_nonlin` data set from the `mulgar` package. Both suggest that there are four clusters, and that some clusters are non-linearly shaped. They disagree on the type of non-linear pattern, where t-SNE represents one cluster as a wavy-shape and UMAP both have a simple parabolic shape. Popular methods in current use include t-SNE [@Maaten2008], UMAP [@McInnes2018] and PHATE [@Moon2019].\n\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to create animated gif\"}\nlibrary(tourr)\nrender_gif(clusters_nonlin, \n grand_tour(),\n display_xy(),\n gif_file = \"gifs/clusters_nonlin.gif\",\n frames = 500,\n width = 300, \n height = 300)\n```\n:::\n\n\n::: {.content-visible when-format=\"html\"}\n\n![Grand tour of the nonlinear clusters data set, shows four clusters. Two are very small and spherical in shape. One is large, and has a sine wave shape, and the other is fairly small with a bent rod shape.](gifs/clusters_nonlin.gif){#fig-clusters-nonlin-html}\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\n::: {#fig-clusters-nonlin-pdf layout-ncol=2}\n\n![](images/clusters_nonlin_60.png){width=250}\n\n![](images/clusters_nonlin_233.png){width=250}\n\nTwo frames from a grand tour of the nonlinear clusters data set, shows four clusters. Two are very small and spherical in shape. One is large, and has a sine wave shape, and the other is fairly small with a bent rod shape.\n:::\n\nThe full 4D data is shown with a grand tour in @fig-clusters-nonlin-html @. The four clusters suggested by the NLDR methods can be seen. We also get a better sense of the relative size and proximity of the clusters. There are two small spherical clusters, one quite close to the end of the large sine wave cluster. The fourth cluster is relatively small, and has a slight curve, like a bent rod. The t-SNE representation is slightly more accurate than the UMAP representation. We would expect that the wavy cluster is the sine wave seen in the tour. \n\n\n\n::: info\nNLDR can provide useful low-dimensional summaries of high-dimensional structure but you need to check whether it is a sensible and accurate representation by comparing with what is perceivd from a tour.\n:::\n\n## Assessing reliability of the NLDR representation \n\nNLDR can produce useful low-dimensional summaries of structure in high-dimensional data, like those shown in @fig-nldr-clusters. However, there are numerous pitfalls. The fitting procedure can produce very different representations depending on the parameter choices, and even the random number seeding the fit. (You can check this by changing the `set.seed` in the code above, and by changing from the default parameters.) Also, it may not be possible to represent the high-dimensional structures faithfully low dimensions. For these reasons, one needs to connect the NLDR view with a tour of the data, to help assess its usefulness and accuracy. For example, with this data, we would want to know which of the two curved clusters in the UMAP representation correspond to the sine wave cluster. \n\n### Using `liminal`\n\\index{liminal}\n\n@fig-liminal-clusters-nonlin shows how the NLDR plot can be linked to a tour view, using the `liminal` package, to better understand how well the structure of the data is represented. Here we see learn that the smile in the UMAP embedding is the small bent rod cluster, and that the unibrow is the sine wave. \n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nlibrary(liminal)\numap_df <- data.frame(umapX = cnl_umap[, 1],\n umapY = cnl_umap[, 2])\nlimn_tour_link(\n umap_df,\n clusters_nonlin,\n cols = x1:x4\n)\n```\n:::\n\n\n::: {#fig-liminal-clusters-nonlin layout-ncol=1}\n\n![Smile matches bent rod.](images/liminal-clusters-nonlin1.png){#fig-smile}\n\n![Unibrow matches sine wave.](images/liminal-clusters-nonlin2.png){#fig-unibrow}\n\nTwo screenshots from liminal showing which clusters match between the UMAP representation and the tour animation. The smile corresponds to the small bent rod cluster. The unibrow matches to the sine wave cluster.\n:::\n\n### Using `detourr`\n\\index{detourr}\n\n@fig-detourr-clusters-nonlin shows how the linking is achieved using `detourr`. It uses a shared data object, as made possible by the `crosstalk` package, and the UMAP view is made interactive using `plotly`. \n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nlibrary(detourr)\nlibrary(dplyr)\nlibrary(crosstalk)\nlibrary(plotly)\numap_df <- data.frame(umapX = cnl_umap[, 1],\n umapY = cnl_umap[, 2])\ncnl_df <- bind_cols(clusters_nonlin, umap_df)\nshared_cnl <- SharedData$new(cnl_df)\n\ndetour_plot <- detour(shared_cnl, tour_aes(\n projection = starts_with(\"x\"))) |>\n tour_path(grand_tour(2), \n max_bases=50, fps = 60) |>\n show_scatter(alpha = 0.7, axes = FALSE,\n width = \"100%\", height = \"450px\")\n\numap_plot <- plot_ly(shared_cnl,\n x = ~umapX, \n y = ~umapY,\n color = I(\"black\"),\n height = 450) %>%\n highlight(on = \"plotly_selected\", \n off = \"plotly_doubleclick\") %>%\n add_trace(type = \"scatter\", \n mode = \"markers\")\n\nbscols(\n detour_plot, umap_plot,\n widths = c(5, 6)\n )\n```\n:::\n\n\n\n![Screenshot from detourr showing which clusters match between the UMAP representation and the tour animation. The smile corresponds to the small bent rod cluster.](images/detourr-clusters-nonlin.png){#fig-detourr-clusters-nonlin}\n\n## Example: `fake_trees`\n\\index{data!fake trees}\n\n@fig-liminal-trees shows a more complex example, using the `fake_trees` data. We know that the 10D data has a main branch, and 9 branches (clusters) attached to it, absed on our explorations in the earlier chapters. The t-SNE view, where points are coloured by the known branch ids, is very helpful for seeing the linear branch structure. \n\nWhat we can't tell is that there is a main branch from which all of the others extend. We also can't tell which of the clusters corresponds to this branch. Linking the plot with a tour helps with this. Although, not shown in the sequence of snapshots in @fig-liminal-trees, the main branch is actually the dark blue cluster, which is separated into three pieces by t-SNE.\n\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to run liminal on the fake trees data\"}\nlibrary(liminal)\nlibrary(Rtsne)\ndata(fake_trees)\nset.seed(2020)\ntsne <- Rtsne::Rtsne(dplyr::select(fake_trees, dplyr::starts_with(\"dim\")))\ntsne_df <- data.frame(tsneX = tsne$Y[, 1],\n tsneY = tsne$Y[, 2])\nlimn_tour_link(\n tsne_df,\n fake_trees,\n cols = dim1:dim10,\n color = branches\n)\n```\n:::\n\n\n::: {#fig-liminal-trees layout-ncol=1}\n\n![Linked views of t-SNE dimension reduction with a tour of the fake trees data. The t-SNE view clearly shows ten 1D non-linear clusters, while the tour of the full 100 variables suggests a lot more variation in the data, and less difference between clusters. ](images/fake_trees1.png){#fig-trees1}\n\n\n![Focus on the green cluster which is split by t-SNE. The shape as viewed in many linear projections shown by the tour shows that it is a single curved cluster. The split is an artifact of the t-SNE mapping.](images/fake_trees2.png){#fig-trees2}\n\n\n\n![Focus on the purple cluster which splits the green cluster in the t-SNE view. The tour shows that these two clusters are distinct, but are close in one neighbourhood of the 100D space. The close proximity in the t-SNE view is reasonable, though.](images/fake_trees3.png){#fig-trees3}\n\nThree snapshots of using the `liminal` linked views to explore how t-SNE has summarised the `fake_trees` data in 2D.\n:::\n\n\n::: insight\nThe t-SNE representation clearly shows the linear structures of the data, but viewing this 10D data with the tour shows that t-SNE makes several inaccurate breaks of some of the branches. \n:::\n\n\n## Exercises {-}\n\n1. Using the `penguins_sub` data generate a 2D representation using t-SNE. Plot the points mapping the colour to species. What is most surprising? (Hint: Are the three species represented by three distinct clusters?)\n2. Re-do the t-SNE representation with different parameter choices. Are the results different each time, or could they be considered to be equivalent?\n3. Use `liminal` or `detourr` to link the t-SNE representation to a tour of the penguins. Highlight the points that have been placed in an awkward position by t-SNE from others in their species. Watch them relative to the others in their species in the tour view, and think about whether there is any rationale for the awkward placement.\n4. Use UMAP to make the 2D representation, and use `liminal` or `detourr` to link with a tour to explore the result.\n5. Conduct your best t-SNE and UMAP representations of the `aflw` data. Compare and contrast what is learned relative to a tour on the principal component analysis. \n\n\n::: {.cell}\n\n:::\n\n\n\n", + "supporting": [ + "5-nldr_files/figure-html" + ], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": {}, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/5-nldr/execute-results/tex.json b/_freeze/5-nldr/execute-results/tex.json index 267e56b..05ecb78 100644 --- a/_freeze/5-nldr/execute-results/tex.json +++ b/_freeze/5-nldr/execute-results/tex.json @@ -1,8 +1,8 @@ { - "hash": "d5a50a67c35ba4d1a1a5245f666bc1a6", + "hash": "c0159d931d6caeeb17042cb34c0d50c4", "result": { "engine": "knitr", - "markdown": "## Non-linear dimension reduction\n\n## Explanation of NLDR methods\n\nNon-linear dimension reduction (NLDR) aims to find a low-dimensional representation of the high-dimensional data that shows the main features of the data. In statistics, it dates back to @Kr64a's work on multidimensional scaling (MDS). Some techniques only require an interpoint similarity or distance matrix as the main ingredient, rather than the full data. We'll focus on when the full data is available here, so we can also compare structure perceived using the tour on the high-dimensional space, relative to structure revealed in the low-dimensional embedding.\n\nThere are many methods available for generating non-linear low dimensional representations of the data. MDS is a classical technique that minimises the difference between two interpoint distance matrices, the distance between points in the high-dimensions, and in the low-dimensional representations. A good resource for learning about MDS is @BG05.\n\n\\index{dimension reduction!t-SNE}\n\\index{dimension reduction!UMAP}\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to generate the 2D non-linear representation\"}\nlibrary(mulgar)\nlibrary(Rtsne)\nlibrary(uwot)\nlibrary(ggplot2)\nlibrary(patchwork)\nset.seed(42)\ncnl_tsne <- Rtsne(clusters_nonlin)\ncnl_umap <- umap(clusters_nonlin)\nn1 <- ggplot(as.data.frame(cnl_tsne$Y), aes(x=V1, y=V2)) +\n geom_point() + \n ggtitle(\"(a) t-SNE\") +\n theme_minimal() + \n theme(aspect.ratio=1)\nn2 <- ggplot(as.data.frame(cnl_umap), aes(x=V1, y=V2)) +\n geom_point() + \n ggtitle(\"(b) UMAP\") +\n theme_minimal() + \n theme(aspect.ratio=1)\nn1 + n2\n```\n\n::: {.cell-output-display}\n![Two non-linear embeddings of the non-linear clusters data: (a) t-SNE, (b) UMAP. Both suggest four clusters, with two being non-linear in some form.](5-nldr_files/figure-pdf/fig-nldr-clusters-1.pdf){#fig-nldr-clusters fig-pos='H' fig-alt='FIXME' width=80%}\n:::\n:::\n\n\n\n@fig-nldr-clusters show two NLDR views of the `clusters_nonlin` data set from the `mulgar` package. Both suggest that there are four clusters, and that some clusters are non-linearly shaped. They disagree on the type of non-linear pattern, where t-SNE represents one cluster as a wavy-shape and UMAP both have a simple parabolic shape. Popular methods in current use include t-SNE [@Maaten2008], UMAP [@McInnes2018] and PHATE [@Moon2019].\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to create animated gif\"}\nlibrary(tourr)\nrender_gif(clusters_nonlin, \n grand_tour(),\n display_xy(),\n gif_file = \"gifs/clusters_nonlin.gif\",\n frames = 500,\n width = 300, \n height = 300)\n```\n:::\n\n\n\n::: {.content-visible when-format=\"html\"}\n\n![Grand tour of the nonlinear clusters data set, shows four clusters. Two are very small and spherical in shape. One is large, and has a sine wave shape, and the other is fairly small with a bent rod shape.](gifs/clusters_nonlin.gif){#fig-clusters-nonlin-html}\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\n::: {#fig-clusters-nonlin-pdf layout-ncol=2}\n\n![](images/clusters_nonlin_60.png){width=250}\n\n![](images/clusters_nonlin_233.png){width=250}\n\nTwo frames from a grand tour of the nonlinear clusters data set, shows four clusters. Two are very small and spherical in shape. One is large, and has a sine wave shape, and the other is fairly small with a bent rod shape.\n:::\n:::\n\nThe full 4D data is shown with a grand tour in @fig-clusters-nonlin-pdf @. The four clusters suggested by the NLDR methods can be seen. We also get a better sense of the relative size and proximity of the clusters. There are two small spherical clusters, one quite close to the end of the large sine wave cluster. The fourth cluster is relatively small, and has a slight curve, like a bent rod. The t-SNE representation is slightly more accurate than the UMAP representation. We would expect that the wavy cluster is the sine wave seen in the tour. \n\n\n\n::: info\nNLDR can provide useful low-dimensional summaries of high-dimensional structure but you need to check whether it is a sensible and accurate representation by comparing with what is perceived from a tour.\n:::\n\n## Assessing reliability of the NLDR representation \n\nNLDR can produce useful low-dimensional summaries of structure in high-dimensional data, like those shown in @fig-nldr-clusters. However, there are numerous pitfalls. The fitting procedure can produce very different representations depending on the parameter choices, and even the random number seeding the fit. (You can check this by changing the `set.seed` in the code above, and by changing from the default parameters.) Also, it may not be possible to represent the high-dimensional structures faithfully low dimensions. For these reasons, one needs to connect the NLDR view with a tour of the data, to help assess its usefulness and accuracy. For example, with this data, we would want to know which of the two curved clusters in the UMAP representation correspond to the sine wave cluster. \n\n### Using `liminal`\n\\index{liminal}\n\n@fig-liminal-clusters-nonlin shows how the NLDR plot can be linked to a tour view, using the `liminal` package, to better understand how well the structure of the data is represented. Here we see learn that the smile in the UMAP embedding is the small bent rod cluster, and that the unibrow is the sine wave. \n\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nlibrary(liminal)\numap_df <- data.frame(umapX = cnl_umap[, 1],\n umapY = cnl_umap[, 2])\nlimn_tour_link(\n umap_df,\n clusters_nonlin,\n cols = x1:x4\n)\n```\n:::\n\n\n\n::: {#fig-liminal-clusters-nonlin layout-ncol=1}\n\n![Smile matches bent rod.](images/liminal-clusters-nonlin1.png){#fig-smile}\n\n![Unibrow matches sine wave.](images/liminal-clusters-nonlin2.png){#fig-unibrow}\n\nTwo screenshots from liminal showing which clusters match between the UMAP representation and the tour animation. The smile corresponds to the small bent rod cluster. The unibrow matches to the sine wave cluster.\n:::\n\n### Using `detourr`\n\\index{detourr}\n\n@fig-detourr-clusters-nonlin shows how the linking is achieved using `detourr`. It uses a shared data object, as made possible by the `crosstalk` package, and the UMAP view is made interactive using `plotly`. \n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nlibrary(detourr)\nlibrary(dplyr)\nlibrary(crosstalk)\nlibrary(plotly)\numap_df <- data.frame(umapX = cnl_umap[, 1],\n umapY = cnl_umap[, 2])\ncnl_df <- bind_cols(clusters_nonlin, umap_df)\nshared_cnl <- SharedData$new(cnl_df)\n\ndetour_plot <- detour(shared_cnl, tour_aes(\n projection = starts_with(\"x\"))) |>\n tour_path(grand_tour(2), \n max_bases=50, fps = 60) |>\n show_scatter(alpha = 0.7, axes = FALSE,\n width = \"100%\", height = \"450px\")\n\numap_plot <- plot_ly(shared_cnl,\n x = ~umapX, \n y = ~umapY,\n color = I(\"black\"),\n height = 450) %>%\n highlight(on = \"plotly_selected\", \n off = \"plotly_doubleclick\") %>%\n add_trace(type = \"scatter\", \n mode = \"markers\")\n\nbscols(\n detour_plot, umap_plot,\n widths = c(5, 6)\n )\n```\n:::\n\n\n\n\n![Screenshot from detourr showing which clusters match between the UMAP representation and the tour animation. The smile corresponds to the small bent rod cluster.](images/detourr-clusters-nonlin.png){#fig-detourr-clusters-nonlin}\n\n## Example: `fake_trees`\n\\index{data!fake trees}\n\n@fig-liminal-trees shows a more complex example, using the `fake_trees` data. We know that the 10D data has a main branch, and 9 branches (clusters) attached to it, based on our explorations in the earlier chapters. The t-SNE view, where points are coloured by the known branch ids, is very helpful for seeing the linear branch structure. \n\nWhat we can't tell is that there is a main branch from which all of the others extend. We also can't tell which of the clusters corresponds to this branch. Linking the plot with a tour helps with this. Although, not shown in the sequence of snapshots in @fig-liminal-trees, the main branch is actually the dark blue cluster, which is separated into three pieces by t-SNE.\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to run liminal on the fake trees data\"}\nlibrary(liminal)\nlibrary(Rtsne)\ndata(fake_trees)\nset.seed(2020)\ntsne <- Rtsne::Rtsne(\n dplyr::select(fake_trees,\n dplyr::starts_with(\"dim\")))\ntsne_df <- data.frame(tsneX = tsne$Y[, 1],\n tsneY = tsne$Y[, 2])\nlimn_tour_link(\n tsne_df,\n fake_trees,\n cols = dim1:dim10,\n color = branches\n)\n```\n:::\n\n\n\n::: {#fig-liminal-trees layout-ncol=1}\n\n![Linked views of t-SNE dimension reduction with a tour of the fake trees data. The t-SNE view clearly shows ten 1D non-linear clusters, while the tour of the full 100 variables suggests a lot more variation in the data, and less difference between clusters. ](images/fake_trees1.png){#fig-trees1}\n\n\n![Focus on the green cluster which is split by t-SNE. The shape as viewed in many linear projections shown by the tour shows that it is a single curved cluster. The split is an artifact of the t-SNE mapping.](images/fake_trees2.png){#fig-trees2}\n\n\n\n![Focus on the purple cluster which splits the green cluster in the t-SNE view. The tour shows that these two clusters are distinct, but are close in one neighbourhood of the 100D space. The close proximity in the t-SNE view is reasonable, though.](images/fake_trees3.png){#fig-trees3}\n\nThree snapshots of using the `liminal` linked views to explore how t-SNE has summarised the `fake_trees` data in 2D.\n:::\n\n\n::: insight\nThe t-SNE representation clearly shows the linear structures of the data, but viewing this 10D data with the tour shows that t-SNE makes several inaccurate breaks of some of the branches. \n:::\n\n\n## Exercises {-}\n\n1. Using the `penguins_sub` data generate a 2D representation using t-SNE. Plot the points mapping the colour to species. What is most surprising? (Hint: Are the three species represented by three distinct clusters?)\n2. Re-do the t-SNE representation with different parameter choices. Are the results different each time, or could they be considered to be equivalent?\n3. Use `liminal` or `detourr` to link the t-SNE representation to a tour of the penguins. Highlight the points that have been placed in an awkward position by t-SNE from others in their species. Watch them relative to the others in their species in the tour view, and think about whether there is any rationale for the awkward placement.\n4. Use UMAP to make the 2D representation, and use `liminal` or `detourr` to link with a tour to explore the result.\n5. Conduct your best t-SNE and UMAP representations of the `aflw` data. Compare and contrast what is learned relative to a tour on the principal component analysis. \n\n\n\n::: {.cell}\n\n:::\n\n\n\n\n", + "markdown": "## Non-linear dimension reduction\n\n## Explanation of NLDR methods\n\nNon-linear dimension reduction (NLDR) aims to find a low-dimensional representation of the high-dimensional data that shows the main features of the data. In statistics, it dates back to the work of @Kr64a on multidimensional scaling (MDS). Some techniques only require an interpoint similarity or distance matrix as the main ingredient, rather than the full data. We'll focus on when the full data is available here, so we can also compare structure perceived using the tour on the high-dimensional space, relative to structure revealed in the low-dimensional embedding.\n\nThere are many methods available for generating non-linear low dimensional representations of the data. Classically, MDS minimises some function of the difference between two interpoint distance matrices, the distance between points in the high-dimensions, and in the low-dimensional representations. \n\n$$\n\\mbox{Stress}_D(x_1, ..., x_n) = \\left(\\sum_{i, j=1; i\\neq j}^n (d_{ij} - d_k(i,j))^2\\right)^{1/2}\n$$\nwhere $D$ is an $n\\times n$ matrix of distances $(d_{ij})$ between all pairs of points, and $d_k(i,j)$ is the distance between the points in the low-dimensional space. PCA is a special case of MDS. The result from PCA is a linear projection, but generally MDS can provide non-linear transformations to represent unusual high-dimensional patterns. A good resource for learning about MDS is @BG05.\n\n\\index{dimension reduction!t-SNE}\n\\index{dimension reduction!UMAP}\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Two non-linear embeddings of the non-linear clusters data: (a) t-SNE, (b) UMAP. Both suggest four clusters, with two being non-linear in some form.](5-nldr_files/figure-pdf/fig-nldr-clusters-1.pdf){#fig-nldr-clusters fig-alt='FIXME' width=80%}\n:::\n:::\n\n\n\nPopular methods in current use for NLDR include t-SNE [@Maaten2008] and UMAP [@McInnes2018]. The approach of t-SNE is to compare interpoint distances with a standard probability distribution (eg $t$-distribution) to exaggerate local neighbourhood differences. UMAP compares the interpoint distances with what might be expected if the data was uniformly distributed in the high-dimensions.\n\n@fig-nldr-clusters shows two NLDR views of the `clusters_nonlin` data set from the `mulgar` package. Both suggest that there are four clusters, and that some clusters are non-linearly shaped. They disagree on the type of non-linear pattern, where t-SNE represents one cluster as a wavy-shape and UMAP both have a simple parabolic shape.\n\n\n\n::: {.cell}\n\n:::\n\n\n\n::: {.content-visible when-format=\"html\"}\n\n![Grand tour of the nonlinear clusters data set, shows four clusters. Two are very small and spherical in shape. One is large, and has a sine wave shape, and the other is fairly small with a bent rod shape.](gifs/clusters_nonlin.gif){#fig-clusters-nonlin-html}\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\n::: {#fig-clusters-nonlin-pdf layout-ncol=2}\n\n![](images/clusters_nonlin_60.png){width=220}\n\n![](images/clusters_nonlin_233.png){width=220}\n\nTwo frames from a grand tour of the nonlinear clusters data set, shows four clusters. Two are very small and spherical in shape. One is large, and has a sine wave shape, and the other is fairly small with a bent rod shape. {{< fa play-circle >}}\n:::\n:::\n\nThe full 4D data is shown with a grand tour in @fig-clusters-nonlin-pdf. The four clusters suggested by the NLDR methods can be seen. We also get a better sense of the relative size and proximity of the clusters. There are two small spherical clusters, one quite close to the end of the large sine wave cluster. The fourth cluster is relatively small, and has a slight curve, like a bent rod. The t-SNE representation is slightly more accurate than the UMAP representation. We would expect that the wavy cluster is the sine wave seen in the tour. \n\n\n::: {.content-visible when-format=\"html\"}\n::: info\nNLDR can provide useful low-dimensional summaries of high-dimensional structure but you need to check whether it is a sensible and accurate representation by comparing with what is perceived from a tour.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\infobox{NLDR can provide useful low-dimensional summaries of high-dimensional structure but you need to check whether it is a sensible and accurate representation by comparing with what is perceived from a tour.}\n:::\n\n## Assessing reliability of the NLDR representation \n\nNLDR can produce useful low-dimensional summaries of structure in high-dimensional data, like those shown in @fig-nldr-clusters. However, there are numerous pitfalls. The fitting procedure can produce very different representations depending on the parameter choices, and even the random number seeding the fit. (You can check this by changing the `set.seed` in the code above, and by changing from the default parameters.) Also, it may not be possible to represent the high-dimensional structures faithfully in low dimensions. For these reasons, one needs to connect the NLDR view with a tour of the data, to help assess its usefulness and accuracy. For example, with this data, we would want to know which of the two curved clusters in the UMAP representation correspond to the sine wave cluster. \n\n### Using `liminal`\n\\index{liminal}\n\n@fig-liminal-clusters-nonlin shows how the NLDR plot can be linked to a tour view, using the `liminal` package, to better understand how well the structure of the data is represented. Here we learn that the smile in the UMAP embedding is the small bent rod cluster, and that the unibrow is the sine wave. \n\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nlibrary(liminal)\numap_df <- data.frame(umapX = cnl_umap[, 1],\n umapY = cnl_umap[, 2])\nlimn_tour_link(\n umap_df,\n clusters_nonlin,\n cols = x1:x4\n)\n```\n:::\n\n\n\n::: {#fig-liminal-clusters-nonlin layout-ncol=1}\n\n![Smile matches bent rod.](images/liminal-clusters-nonlin1.png){#fig-smile}\n\n![Unibrow matches sine wave.](images/liminal-clusters-nonlin2.png){#fig-unibrow}\n\nTwo screenshots from liminal showing which clusters match between the UMAP representation and the tour animation. The smile corresponds to the small bent rod cluster. The unibrow matches to the sine wave cluster.\n:::\n\n### Using `detourr`\n\\index{detourr}\n\n@fig-detourr-clusters-nonlin shows how the linking is achieved using `detourr`. It uses a shared data object, as made possible by the `crosstalk` package, and the UMAP view is made interactive using `plotly`. \n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nlibrary(detourr)\nlibrary(dplyr)\nlibrary(crosstalk)\nlibrary(plotly)\numap_df <- data.frame(umapX = cnl_umap[, 1],\n umapY = cnl_umap[, 2])\ncnl_df <- bind_cols(clusters_nonlin, umap_df)\nshared_cnl <- SharedData$new(cnl_df)\n\ndetour_plot <- detour(shared_cnl, tour_aes(\n projection = starts_with(\"x\"))) |>\n tour_path(grand_tour(2), \n max_bases=50, fps = 60) |>\n show_scatter(alpha = 0.7, axes = FALSE,\n width = \"100%\", height = \"450px\")\n\numap_plot <- plot_ly(shared_cnl,\n x = ~umapX, \n y = ~umapY,\n color = I(\"black\"),\n height = 450) %>%\n highlight(on = \"plotly_selected\", \n off = \"plotly_doubleclick\") %>%\n add_trace(type = \"scatter\", \n mode = \"markers\")\n\nbscols(\n detour_plot, umap_plot,\n widths = c(5, 6)\n )\n```\n:::\n\n\n\n\n![Screenshot from detourr showing which clusters match between the UMAP representation and the tour animation. The smile corresponds to the small bent rod cluster.](images/detourr-clusters-nonlin.png){#fig-detourr-clusters-nonlin}\n\n## Example: `fake_trees`\n\\index{data!fake trees}\n\n@fig-liminal-trees shows a more complex example, using the `fake_trees` data. We know that the 10D data has a main branch, and 9 branches (clusters) attached to it, based on our explorations in the earlier chapters. The t-SNE view, where points are coloured by the known branch ids, is very helpful for seeing the linear branch structure. \n\nWhat we can't tell is that there is a main branch from which all of the others extend. We also can't tell which of the clusters corresponds to this branch. Linking the plot with a tour helps with this. Although, not shown in the sequence of snapshots in @fig-liminal-trees, the main branch is actually the dark blue cluster, which is separated into three pieces by t-SNE.\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to run liminal on the fake trees data\"}\nlibrary(liminal)\nlibrary(Rtsne)\ndata(fake_trees)\nset.seed(2020)\ntsne <- Rtsne::Rtsne(\n dplyr::select(fake_trees,\n dplyr::starts_with(\"dim\")))\ntsne_df <- data.frame(tsneX = tsne$Y[, 1],\n tsneY = tsne$Y[, 2])\nlimn_tour_link(\n tsne_df,\n fake_trees,\n cols = dim1:dim10,\n color = branches\n)\n```\n:::\n\n\n\n::: {#fig-liminal-trees layout-ncol=1}\n\n![Linked views of t-SNE dimension reduction with a tour of the fake trees data. The t-SNE view clearly shows ten 1D non-linear clusters, while the tour of the full 100 variables suggests a lot more variation in the data, and less difference between clusters. ](images/fake_trees1.png){#fig-trees1 width=300}\n\n\n![Focus on the green cluster which is split by t-SNE. The shape as viewed in many linear projections shown by the tour shows that it is a single curved cluster. The split is an artifact of the t-SNE mapping.](images/fake_trees2.png){#fig-trees2 width=300}\n\n\n\n![Focus on the purple cluster which splits the green cluster in the t-SNE view. The tour shows that these two clusters are distinct, but are close in one neighbourhood of the 100D space. The close proximity in the t-SNE view is reasonable, though.](images/fake_trees3.png){#fig-trees3 width=300}\n\nThree snapshots of using the `liminal` linked views to explore how t-SNE has summarised the `fake_trees` data in 2D.\n:::\n\n::: {.content-visible when-format=\"html\"}\n::: insight\nThe t-SNE representation clearly shows the linear structures of the data, but viewing this 10D data with the tour shows that t-SNE makes several inaccurate breaks of some of the branches. \n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\insightbox{The t-SNE representation clearly shows the linear structures of the data, but viewing this 10D data with the tour shows that t-SNE makes several inaccurate breaks of some of the branches. }\n:::\n\n## Exercises {-}\n\n1. This question uses the `penguins_sub` data \n\na. Generate a 2D representation using t-SNE. Plot the points mapping the colour to species. What is most surprising? (Hint: Are the three species represented by three distinct clusters?)\nb. Re-do the t-SNE representation with different parameter choices, including using different random seeds. Are the results different each time, or do you think that they could be considered to be equivalent?\nc. Use `liminal` or `detourr` to link the t-SNE representation to a tour of the penguins. Highlight the points that have been placed in an awkward position by t-SNE from others in their species. Watch them relative to the others in their species in the tour view, and think about whether there is any rationale for the awkward placement.\nd. Try again using UMAP to make the 2D representation, and use `liminal` or `detourr` to link with a tour to explore the result.\n2. Conduct your best t-SNE and UMAP representations of the `aflw` data. Compare and contrast what is learned relative to a tour or the principal component analysis. \n\n\n\n::: {.cell}\n\n:::\n\n\n\n## Project {-}\n\nGene expressions measured as scRNA-Seq of 2622 human peripheral blood mononuclear cells data is available from the `Seurat` R package [@seurat1, @seurat2, @seurat3, @seurat4]. The paper web site has code to extract and pre-process the data, which follow the tutorial at https://satijalab.org/seurat/articles/pbmc3k_tutorial.html. The processed data, containing the first 50 PCs is provided with the book, as `pbmc_pca_50.rds`. \n\nThe original paper [@chen2023] used UMAP on the first 15 PCs to find a representation of the data to illustrate the clustering. They used the default settings of the `RunUMAP()` function in `Seurat`, without setting a seed.\n\nGenerate the t-SNE and UMAP representations of the first 9 PCs of data, using their default settings. They should be quite different. (We use 9 PCs because the scree plot in the data pre-processing suggests that 15 is too many.) Based on your examination of the data in a tour, which method yields the more accurate representation? Explain what the structure in the 2D is relative to that seen in the tour.\n\n\n\n\n::: {.cell}\n\n:::\n", "supporting": [ "5-nldr_files/figure-pdf" ], diff --git a/_freeze/5-nldr/figure-html/fig-nldr-clusters-1.png b/_freeze/5-nldr/figure-html/fig-nldr-clusters-1.png new file mode 100644 index 0000000..cbd7c96 Binary files /dev/null and b/_freeze/5-nldr/figure-html/fig-nldr-clusters-1.png differ diff --git a/_freeze/5-nldr/figure-pdf/fig-nldr-clusters-1.pdf b/_freeze/5-nldr/figure-pdf/fig-nldr-clusters-1.pdf index 3cfb3ce..956139c 100644 Binary files a/_freeze/5-nldr/figure-pdf/fig-nldr-clusters-1.pdf and b/_freeze/5-nldr/figure-pdf/fig-nldr-clusters-1.pdf differ diff --git a/_freeze/6-intro-clust/execute-results/html.json b/_freeze/6-intro-clust/execute-results/html.json new file mode 100644 index 0000000..a75c114 --- /dev/null +++ b/_freeze/6-intro-clust/execute-results/html.json @@ -0,0 +1,17 @@ +{ + "hash": "ef7cc61b3240aa37cc08d4c1f9c7844d", + "result": { + "engine": "knitr", + "markdown": "# Overview\n\nUnsupervised classification, or cluster analysis, organizes observations into similar groups. Cluster analysis is a commonly used, appealing, and conceptually intuitive statistical method. Some of its uses include market segmentation, where customers are grouped into clusters with similar attributes for targeted marketing; gene expression analysis, where genes with similar expression patterns are grouped together; and the creation of taxonomies for animals, insects, or plants. Clustering can be used as a way of reducing a massive amount of data because observations within a cluster can be summarized by its centre. Also, clustering effectively subsets the data thus simplifying analysis because observations in each cluster can be analyzed separately.\n\n## What are clusters?\n\nOrganizing objects into groups is a common task to help make sense of the world around us. Perhaps this is why it is an appealing method of data analysis. However, cluster analysis is more complex than it initially appears. Many people imagine that it will produce neatly separated clusters like those in @fig-ideal-clusters(a), but it almost never does. Such ideal clusters are rarely encountered in real data, so we often need to modify our objective from *find the natural clusters in this data*. Instead, we need to organize the *cases into groups that are similar in some way*. Even though this may seem disappointing when compared with the ideal, it is still often an effective means of simplifying and understanding a dataset.\n\n::: {.content-visible when-format=\"html\"}\n::: info\nKnowing what shapes are in your data helps to decide on the best method and to diagnose the result. For example, if the clusters are elliptical model-based clustering is recommended.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\infobox{Knowing what shapes are in your data helps to decide on the best method and to diagnose the result. For example, if the clusters are elliptical model-based clustering is recommended.}\n:::\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Different structures in data impact cluster analysis. When there are well-separated groups (a), it is simple to group similar observations. Even when there are not, partitioning observations into groups may still be useful. There may be nuisance observations (b) or nuisance variables (c) that affect the interpoint distance calculations and distract the clustering algorithm, and there may oddly shaped clusters (d) which are hard to numerically describe.](6-intro-clust_files/figure-html/fig-ideal-clusters-1.png){#fig-ideal-clusters width=100%}\n:::\n:::\n\n\nAt the heart of the clustering process is the work of discovering which variables are most important for defining the groups. It is often true that we only require a subset of the variables for finding clusters, whereas another subset (called *nuisance variables*) has no impact. In the bottom left plot of @fig-ideal-clusters, it is clear that the variable plotted horizontally is important for splitting this data into two clusters, whereas the variable plotted vertically is a nuisance variable. Nuisance is an apt term for these variables, because they can radically change the interpoint distances and impair the clustering\nprocess. \\index{cluster analysis!interpoint distance}\n\\index{cluster analysis!nuisance variable}\n\nDynamic graphical methods help us to find and understand the cluster structure in high dimensions. With the tools in our toolbox, primarily tours, along with linked scatterplots and parallel coordinate plots, we can see clusters in high-dimensional spaces. We can detect gaps between clusters, the shape and relative positions of clusters, and the presence of nuisance variables. We can even find unusually shaped clusters, like those in the bottom right plot in @fig-ideal-clusters. In simple\nsituations we can use graphics alone to group observations into clusters, using a \"spin and brush\" method. In more difficult data problems, we can assess and refine numerical solutions using graphics.\\index{brushing!persistent}\n\\index{cluster analysis!spin-and-brush}\n\nThis part of the book discusses the use of interactive and dynamic graphics in the clustering of data. @sec-clust-bg introduces cluster analysis, focusing on interpoint distance measures. @sec-clust-graphics describes an example of a purely graphical approach to cluster analysis, the spin and brush method. In the example shown in that section, we were able to find simplifications of the data that had not been found using numerical clustering methods, and to find a variety of structures in high-dimensional space. @sec-hclust describes methods for reducing the interpoint distance matrix to an intercluster distance matrix using hierarchical algorithms, @sec-mclust covers model-based clustering, and @sec-som described clustering with self-organising maps. Each of these chapters shows how graphical tools can be used to assess the results of numerical methods. @sec-clust-compare summarizes the chapter and revisits the data analysis strategies used in the examples. Additional references that provide good companions to the material presented in these chapters are @VR02, @HOML, @hennig, @giordani, @kassambara, and the CRAN Task View [@ctv-clustering]. @sec-clust-compare summarizes the chapter and revisits the data analysis strategies used in the examples.\n\n## The importance of defining similar {#sec-clust-bg}\n\nBefore we can begin finding groups of cases that are similar[^3], we need to decide how to define or measure whether they are close together or far apart. Consider a dataset with three cases $(a_1, a_2, a_3)$ and four variables $(V_1, V_2, V_3, V_4)$, described in matrix format as\n\n[^3]: Both *similarity* and *dissimilarity* measures are used for defining how similar cases are. It can be confusing! They measure similar in opposite directions. With a dissimilarity measure, a smaller number means the cases are closer, as in a distance metric. A similarity measure usually ranges between 0 and 1, with 1 indicating that the cases are closer, for example, correlation.\n\n\\begin{align*}\nX = \\begin{bmatrix}\n& {\\color{grey} V_1} & {\\color{grey} V_2} & {\\color{grey} V_3} & {\\color{grey} V_4} \\\\\\hline\n{\\color{grey} a_1} | & x_{11} & x_{12} & x_{13} & x_{14} \\\\\n{\\color{grey} a_2} | & x_{21} & x_{22} & x_{23} & x_{24} \\\\\n{\\color{grey} a_3} | & x_{31} & x_{32} & x_{33} & x_{34} \n\\end{bmatrix}\n= \\begin{bmatrix}\n& {\\color{grey} V_1} & {\\color{grey} V_2} & {\\color{grey} V_3} & {\\color{grey} V_4} \\\\\\hline\n{\\color{grey} a_1} | & 7.3 & 7.6 & 7.7 & 8.0 \\\\\n{\\color{grey} a_2} | & 7.4 & 7.2 & 7.3 & 7.2 \\\\\n{\\color{grey} a_3} | & 4.1 & 4.6 & 4.6 & 4.8 \n\\end{bmatrix}\n\\end{align*}\n\n\n\\noindent which is plotted in @fig-similarity1. The Euclidean distance between two cases (rows of the matrix) with $p$ elements is defined as\n\n\\begin{align*}\nd_{\\rm Euc}(a_i,a_j) &=& ||a_i-a_j|| %\\\\\n% &=& \\sqrt{(x_{i1}-x_{j1})^2+\\dots + (x_{ip}-x_{jp})^2},\n~~~~~~i,j=1,\\dots, n,\n\\end{align*}\n\n\\noindent where $||x_i||=\\sqrt{x_{i1}^2+x_{i2}^2+\\dots +x_{ip}^2}$. For example, the Euclidean distance between cases 1 and 2 in the above data, is\n\n\\begin{align*}\nd_{\\rm Euc}(a_1,a_2) &= \\sqrt{(7.3-7.4)^2+(7.6-7.2)^2+ (7.7-7.3)^2+(8.0-7.2)^2} \\\\\n&= 1.0 \n\\end{align*}\n\n\\index{cluster analysis!interpoint distance}\n\n\\noindent For the three cases, the interpoint Euclidean distance matrix is\n\n::: {.content-visible when-format=\"html\"}\n::: {.hidden}\n $$\n \\require{mathtools}\n \\definecolor{grey}{RGB}{192, 192, 192}\n $$\n:::\n:::\n\n\n\\begin{align*}\nd_{\\rm Euc} = \\begin{bmatrix}\n& {\\color{grey} a_1} & {\\color{grey} a_2} & {\\color{grey} a_3} \\\\\\hline\n{\\color{grey} a_1} | & 0.0 & 1.0 & 6.3 \\\\\n{\\color{grey} a_2} | & 1.0 & 0.0 & 5.5 \\\\\n{\\color{grey} a_3} | & 6.3 & 5.5 & 0.0\n\\end{bmatrix}\n\\end{align*}\n\n::: {#fig-similarity1 layout-ncol=2}\n\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code for plot\"}\nx <- data.frame(V1 = c(7.3, 7.4, 4.1),\n V2 = c(7.6, 7.2, 4.6),\n V3 = c(7.7, 7.3, 4.6),\n V4 = c(8.0, 7.2, 4.8),\n point = factor(c(\"a1\", \"a2\", \"a3\")))\nlibrary(GGally)\nlibrary(colorspace)\nlibrary(gridExtra)\npscat <- ggpairs(x, columns=1:4,\n upper=list(continuous=\"points\"),\n diag=list(continuous=\"blankDiag\"),\n axisLabels=\"internal\",\n ggplot2::aes(colour=point)) +\n scale_colour_discrete_divergingx(\n palette = \"Zissou 1\", nmax=4) +\n xlim(3.7, 8.5) + ylim(3.7, 8.5) + \n theme_minimal() +\n theme(aspect.ratio=1)\npscat\n```\n\n::: {.cell-output-display}\n![](6-intro-clust_files/figure-html/unnamed-chunk-2-1.png){width=384}\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code for plot\"}\nppar <- ggparcoord(x, columns=1:4, \n groupColumn = 5, \n scale = \"globalminmax\") +\n scale_colour_discrete_divergingx(\n palette = \"Zissou 1\", nmax=4) +\n xlab(\"\") + ylab(\"\") + \n theme_minimal() + \n theme(axis.ticks.y = element_blank(),\n axis.text.y = element_blank(),\n legend.title = element_blank())\nppar\n```\n\n::: {.cell-output-display}\n![](6-intro-clust_files/figure-html/unnamed-chunk-3-1.png){width=288}\n:::\n:::\n\n\nThe scatterplot matrix (left) shows that cases $a_1$ and $a_2$ have similar values. The parallel coordinate plot (right) allows a comparison of other structure, which shows the similarity in the trend of the profiles on cases $a_1$ and $a_3$. \n:::\n\n\n\\noindent Cases $a_1$ and $a_2$ are more similar to each other than they are to case $a_3$, because the Euclidean distance between cases $a_1$ and $a_2$ is much smaller than the distance between cases $a_1$ and $a_3$ and between cases $a_2$ and $a_3$.\n\nThere are many different ways to calculate similarity. Similarity measures based on correlation distance can be useful. It is typically used where similarity of structure or shape is more important than similarity in magnitude.\n\n\\index{parallel coordinate plot}\n\nAs an example, see the parallel coordinate plot of the sample data at the right of @fig-similarity1. Cases $a_1$ and $a_3$ are widely separated, but their shapes are similar (low, medium, medium, high). Case $a_2$, although\noverlapping with case $a_1$, has a very different shape (high, medium, medium, low). The Pearson correlation between two cases, $\\rho(a_i,a_j)$, is defined as\n\n\\begin{align*}\n\\rho(a_i,a_j) = \\frac{(a_i-c_i)^\\top(a_j-c_j)}\n{\\sqrt(a_i-c_i)^\\top(a_i-c_i) \\sqrt(a_j-c_j)^\\top(a_j-c_j)}\n\\label{corc}\n\\end{align*}\n\n\\noindent Typically, $c_i, c_j$ are the sample means of each case, $\\bar{a}_i,\\bar{a}_j$. For these three observations, $c_1=\\bar{a}_1=7.650, c_2=\\bar{a}_2=7.275, c_3=\\bar{a}_3=4.525$. An interesting geometric fact, is that if $c_i, c_j$ are set to be 0, as is commonly done, $\\rho$ is a generalized correlation that describes the angle between the two data vectors. The correlation is then converted to a distance metric, with one possibility being as follows:\n\n\n\\begin{align*}\nd_{\\rm Cor}(a_i,a_j) = \\sqrt{2(1-\\rho(a_i,a_j))}\n\\end{align*}\n\nThis distance metric will treat cases that are strongly negatively correlated as the most distant. If you want to consider strong negative correlation as close, then you could take the absolute value of $\\rho(a_i,a_j)$ in the above equation, and remove the multiplication by 2.\n\nThe interpoint distance matrix for the sample data using $d_{\\rm Cor}$ and the Pearson correlation coefficient is\n\n\\begin{align*}\nd_{\\rm Cor} = \\begin{bmatrix}\n& {\\color{grey} a_1} & {\\color{grey} a_2} & {\\color{grey} a_3} \\\\\\hline\n{\\color{grey} a_1} | & 0.0 & 3.6 & 0.1 \\\\\n{\\color{grey} a_2} | & 3.6 & 0.0 & 3.8 \\\\\n{\\color{grey} a_3} | & 0.1 & 3.8 & 0.0\n\\end{bmatrix}\n\\end{align*}\n\n\\noindent By this metric, cases $a_1$ and $a_3$ are the most similar, because the correlation distance is smaller between these two cases than the other pairs of cases. \\index{cluster analysis!interpoint distance}\n\nNote that these interpoint distances differ dramatically from those for Euclidean distance. As a consequence, the way the cases would be clustered is also very different. Choosing the appropriate distance measure is an important part of a cluster analysis.\n\nAfter a distance metric has been chosen and a cluster analysis has been performed, the analyst must evaluate the results, and this is actually a difficult task. A cluster analysis does not generate $p$-values or other numerical criteria, and the process tends to produce hypotheses rather than testing them. Even the most determined attempts to produce the \"best\" results using modeling and validation techniques may result in clusters that, although seemingly significant, are useless for practical purposes. As a result, cluster analysis is best thought of as an exploratory technique, and it can be quite useful despite the lack of formal validation because of its power in data simplification.\n\n::: {.content-visible when-format=\"html\"}\n::: info\nDefining an appropriate distance metric from the context of the problem is a most important decision. For example, if your variables are all numeric, and on the same scale then Euclidean distance might be best. If your variables are categorical, you might need to use something like Hamming distance.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\infobox{Defining an appropriate distance metric from the context of the problem is a most important decision. For example, if your variables are all numeric, and on the same scale then Euclidean distance might be best. If your variables are categorical, you might need to use something like Hamming distance.}\n:::\n\nThe context in which the data arises is the key to assessing the results. If the clusters can be characterized in a sensible manner, and they increase our knowledge of the data, then we are on the right track. To use an even more pragmatic criterion, if a company can gain an economic advantage by using a particular clustering method to carve up their customer database, then that is the method they should use.\n\n## Exercises {-}\n\nUse the following data to answer these questions:\n\n\n::: {.cell}\n::: {.cell-output .cell-output-stdout}\n\n```\n x1 x2 x3\na1 0.13 0.21 0.09\na2 0.91 0.95 0.85\na3 0.62 0.73 0.65\na4 0.21 0.92 0.43\n```\n\n\n:::\n:::\n\n\n1. Compute the Euclidean distance between cases `a1`, `a2`, `a3`, `a4`.\n\n2. Compute the correlation distance (as defined above) between cases `a1`, `a2`, `a3`, `a4`.\n\n\n3. Which two points have the (a) biggest (b) smallest Mahalanobis (statistical) distance, assuming that the covariance matrix is:\n\n\n::: {.cell}\n::: {.cell-output .cell-output-stdout}\n\n```\n x1 x2 x3\nx1 1.0 0.8 0.8\nx2 0.8 1.0 0.8\nx3 0.8 0.8 1.0\n```\n\n\n:::\n:::\n\n\n(The function `mahalanobis` will calculate this in R. Technically this gives distance between each case and the mean vector.)\n\n4. Is the ordering of distance between cases the same if Manhattan distance is used instead of Euclidean?\n\n5. Compute the Chebychev distance between cases `a1`, `a2`, `a3`, `a4`.\n\n6. Compute Bray-Curtis distance between cases `a1`, `a2`, `a3`, `a4`.\n\n7. Make a plot of the data, and write a paragraph describing how the different distance metrics agree and disagree on how close or far the cases are from each other. \n\n\n::: {.cell}\n\n:::", + "supporting": [ + "6-intro-clust_files/figure-html" + ], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": {}, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/6-intro-clust/execute-results/tex.json b/_freeze/6-intro-clust/execute-results/tex.json index 36d69b3..92326e0 100644 --- a/_freeze/6-intro-clust/execute-results/tex.json +++ b/_freeze/6-intro-clust/execute-results/tex.json @@ -1,8 +1,8 @@ { - "hash": "736433817ec76afab78e7f270ef23529", + "hash": "cb2bf3de58ee34e16a56a98937431042", "result": { "engine": "knitr", - "markdown": "# Introduction to clustering \n\nUnsupervised classification, or cluster analysis, organizes observations into similar groups. Cluster analysis is a commonly used, appealing, and conceptually intuitive statistical method. Some of its uses include market segmentation, where customers are grouped into clusters with similar attributes for targeted marketing; gene expression analysis, where genes with similar expression patterns are grouped together; and the creation of taxonomies for animals, insects, or plants. Clustering can be used as a way of reducing a massive amount of data because observations within a cluster can be summarized by its centre. Also, clustering effectively subsets the data thus simplifying analysis because observations in each cluster can be analyzed separately.\n\n## What are clusters?\n\nOrganizing objects into groups is a common task to help make sense of the world around us. Perhaps this is why it is an appealing method of data analysis. However, cluster analysis is more complex than it initially appears. Many people imagine that it will produce neatly separated clusters like those in @fig-ideal-clusters(a), but it almost never does. Such ideal clusters are rarely encountered in real data, so we often need to modify our objective from *find the natural clusters in this data*. Instead, we need to organize the *cases into groups that are similar in some way*. Even though this may seem disappointing when compared with the ideal, it is still often an effective means of simplifying and understanding a dataset.\n\n::: {.content-visible when-format=\"html\"}\n::: info\nKnowing what shapes are in your data helps to decide on the best method and to diagnose the result. For example, if the clusters are elliptical model-based clustering is recommended.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\infobox{Knowing what shapes are in your data helps to decide on the best method and to diagnose the result. For example, if the clusters are elliptical model-based clustering is recommended.}\n:::\n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Different structures in data impact cluster analysis. When there are well-separated groups (a), it is simple to group similar observations. Even when there are not, partitioning observations into groups may still be useful. There may be nuisance observations (b) or nuisance variables (c) that affect the interpoint distance calculations and distract the clustering algorithm, and there may oddly shaped clusters (d) which are hard to numerically describe.](6-intro-clust_files/figure-pdf/fig-ideal-clusters-1.pdf){#fig-ideal-clusters width=100%}\n:::\n:::\n\n\n\nAt the heart of the clustering process is the work of discovering which variables are most important for defining the groups. It is often true that we only require a subset of the variables for finding clusters, whereas another subset (called *nuisance variables*) has no impact. In the bottom left plot of @fig-ideal-clusters, it is clear that the variable plotted horizontally is important for splitting this data into two clusters, whereas the variable plotted vertically is a nuisance variable. Nuisance is an apt term for these variables, because they can radically change the interpoint distances and impair the clustering\nprocess. \\index{cluster analysis!interpoint distance}\n\\index{cluster analysis!nuisance variable}\n\nDynamic graphical methods help us to find and understand the cluster structure in high dimensions. With the tools in our toolbox, primarily tours, along with linked scatterplots and parallel coordinate plots, we can see clusters in high-dimensional spaces. We can detect gaps between clusters, the shape and relative positions of clusters, and the presence of nuisance variables. We can even find unusually shaped clusters, like those in the bottom right plot in @fig-ideal-clusters. In simple\nsituations we can use graphics alone to group observations into clusters, using a \"spin and brush\" method. In more difficult data problems, we can assess and refine numerical solutions using graphics.\\index{brushing!persistent}\n\\index{cluster analysis!spin-and-brush}\n\nThis part of the book discusses the use of interactive and dynamic graphics in the clustering of data. @sec-clust-bg introduces cluster analysis, focusing on interpoint distance measures. @sec-clust-graphics describes an example of a purely graphical approach to cluster analysis, the spin and brush method. In the example shown in that section, we were able to find simplifications of the data that had not been found using numerical clustering methods, and to find a variety of structures in high-dimensional space. @sec-hclust describes methods for reducing the interpoint distance matrix to an intercluster distance matrix using hierarchical algorithms, @sec-mclust covers model-based clustering, and @sec-som described clustering with self-organising maps. Each of these chapters shows how graphical tools can be used to assess the results of numerical methods. @sec-clust-compare summarizes the chapter and revisits the data analysis strategies used in the examples. Additional references that provide good companions to the material presented in these chapters are @VR02, @HOML, @hennig, @giordani, @kassambara, and the CRAN Task View [@ctv-clustering]. @sec-clust-compare summarizes the chapter and revisits the data analysis strategies used in the examples.\n\n## The importance of defining similar {#sec-clust-bg}\n\nBefore we can begin finding groups of cases that are similar[^3], we need to decide how to define or measure whether they are close together or far apart. Consider a dataset with three cases $(a_1, a_2, a_3)$ and four variables $(V_1, V_2, V_3, V_4)$, described in matrix format as\n\n[^3]: Both *similarity* and *dissimilarity* measures are used for defining how similar cases are. It can be confusing! They measure similar in opposite directions. With a dissimilarity measure, a smaller number means the cases are closer, as in a distance metric. A similarity measure usually ranges between 0 and 1, with 1 indicating that the cases are closer, for example, correlation.\n\n\\begin{align*}\nX = \\begin{bmatrix}\n& {\\color{grey} V_1} & {\\color{grey} V_2} & {\\color{grey} V_3} & {\\color{grey} V_4} \\\\\\hline\n{\\color{grey} a_1} | & x_{11} & x_{12} & x_{13} & x_{14} \\\\\n{\\color{grey} a_2} | & x_{21} & x_{22} & x_{23} & x_{24} \\\\\n{\\color{grey} a_3} | & x_{31} & x_{32} & x_{33} & x_{34} \n\\end{bmatrix}\n= \\begin{bmatrix}\n& {\\color{grey} V_1} & {\\color{grey} V_2} & {\\color{grey} V_3} & {\\color{grey} V_4} \\\\\\hline\n{\\color{grey} a_1} | & 7.3 & 7.6 & 7.7 & 8.0 \\\\\n{\\color{grey} a_2} | & 7.4 & 7.2 & 7.3 & 7.2 \\\\\n{\\color{grey} a_3} | & 4.1 & 4.6 & 4.6 & 4.8 \n\\end{bmatrix}\n\\end{align*}\n\n\n\\noindent which is plotted in @fig-similarity1. The Euclidean distance between two cases (rows of the matrix) with $p$ elements is defined as\n\n\\begin{align*}\nd_{\\rm Euc}(a_i,a_j) &=& ||a_i-a_j|| %\\\\\n% &=& \\sqrt{(x_{i1}-x_{j1})^2+\\dots + (x_{ip}-x_{jp})^2},\n~~~~~~i,j=1,\\dots, n,\n\\end{align*}\n\n\\noindent where $||x_i||=\\sqrt{x_{i1}^2+x_{i2}^2+\\dots +x_{ip}^2}$. For example, the Euclidean distance between cases 1 and 2 in the above data, is\n\n\\begin{align*}\nd_{\\rm Euc}(a_1,a_2) &= \\sqrt{(7.3-7.4)^2+(7.6-7.2)^2+ (7.7-7.3)^2+(8.0-7.2)^2} \\\\\n&= 1.0 \n\\end{align*}\n\n\\index{cluster analysis!interpoint distance}\n\n\\noindent For the three cases, the interpoint Euclidean distance matrix is\n\n::: {.content-visible when-format=\"html\"}\n::: {.hidden}\n $$\n \\require{mathtools}\n \\definecolor{grey}{RGB}{192, 192, 192}\n $$\n:::\n:::\n\n\n\\begin{align*}\nd_{\\rm Euc} = \\begin{bmatrix}\n& {\\color{grey} a_1} & {\\color{grey} a_2} & {\\color{grey} a_3} \\\\\\hline\n{\\color{grey} a_1} | & 0.0 & 1.0 & 6.3 \\\\\n{\\color{grey} a_2} | & 1.0 & 0.0 & 5.5 \\\\\n{\\color{grey} a_3} | & 6.3 & 5.5 & 0.0\n\\end{bmatrix}\n\\end{align*}\n\n::: {#fig-similarity1 layout-ncol=2}\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](6-intro-clust_files/figure-pdf/unnamed-chunk-2-1.pdf){width=80%}\n:::\n:::\n\n::: {.cell}\n::: {.cell-output-display}\n![](6-intro-clust_files/figure-pdf/unnamed-chunk-3-1.pdf){width=80%}\n:::\n:::\n\n\n\nThe scatterplot matrix (left) shows that cases $a_1$ and $a_2$ have similar values. The parallel coordinate plot (right) allows a comparison of other structure, which shows the similarity in the trend of the profiles on cases $a_1$ and $a_3$. \n:::\n\n\n\\noindent Cases $a_1$ and $a_2$ are more similar to each other than they are to case $a_3$, because the Euclidean distance between cases $a_1$ and $a_2$ is much smaller than the distance between cases $a_1$ and $a_3$ and between cases $a_2$ and $a_3$.\n\nThere are many different ways to calculate similarity. Similarity measures based on correlation distance can be useful. It is typically used where similarity of structure or shape is more important than similarity in magnitude.\n\n\\index{parallel coordinate plot}\n\nAs an example, see the parallel coordinate plot of the sample data at the right of @fig-similarity1. Cases $a_1$ and $a_3$ are widely separated, but their shapes are similar (low, medium, medium, high). Case $a_2$, although\noverlapping with case $a_1$, has a very different shape (high, medium, medium, low). The Pearson correlation between two cases, $\\rho(a_i,a_j)$, is defined as\n\n\\begin{align*}\n\\rho(a_i,a_j) = \\frac{(a_i-c_i)^\\top(a_j-c_j)}\n{\\sqrt(a_i-c_i)^\\top(a_i-c_i) \\sqrt(a_j-c_j)^\\top(a_j-c_j)}\n\\label{corc}\n\\end{align*}\n\n\\noindent Typically, $c_i, c_j$ are the sample means of each case, $\\bar{a}_i,\\bar{a}_j$. For these three observations, $c_1=\\bar{a}_1=7.650, c_2=\\bar{a}_2=7.275, c_3=\\bar{a}_3=4.525$. An interesting geometric fact, is that if $c_i, c_j$ are set to be 0, as is commonly done, $\\rho$ is a generalized correlation that describes the angle between the two data vectors. The correlation is then converted to a distance metric, with one possibility being as follows:\n\n\n\\begin{align*}\nd_{\\rm Cor}(a_i,a_j) = \\sqrt{2(1-\\rho(a_i,a_j))}\n\\end{align*}\n\nThis distance metric will treat cases that are strongly negatively correlated as the most distant. If you want to consider strong negative correlation as close, then you could take the absolute value of $\\rho(a_i,a_j)$ in the above equation, and remove the multiplication by 2.\n\nThe interpoint distance matrix for the sample data using $d_{\\rm Cor}$ and the Pearson correlation coefficient is\n\n\\begin{align*}\nd_{\\rm Cor} = \\begin{bmatrix}\n& {\\color{grey} a_1} & {\\color{grey} a_2} & {\\color{grey} a_3} \\\\\\hline\n{\\color{grey} a_1} | & 0.0 & 3.6 & 0.1 \\\\\n{\\color{grey} a_2} | & 3.6 & 0.0 & 3.8 \\\\\n{\\color{grey} a_3} | & 0.1 & 3.8 & 0.0\n\\end{bmatrix}\n\\end{align*}\n\n\\noindent By this metric, cases $a_1$ and $a_3$ are the most similar, because the correlation distance is smaller between these two cases than the other pairs of cases. \\index{cluster analysis!interpoint distance}\n\nNote that these interpoint distances differ dramatically from those for Euclidean distance. As a consequence, the way the cases would be clustered is also very different. Choosing the appropriate distance measure is an important part of a cluster analysis.\n\nAfter a distance metric has been chosen and a cluster analysis has been performed, the analyst must evaluate the results, and this is actually a difficult task. A cluster analysis does not generate $p$-values or other numerical criteria, and the process tends to produce hypotheses rather than testing them. Even the most determined attempts to produce the \"best\" results using modeling and validation techniques may result in clusters that, although seemingly significant, are useless for practical purposes. As a result, cluster analysis is best thought of as an exploratory technique, and it can be quite useful despite the lack of formal validation because of its power in data simplification.\n\n::: {.content-visible when-format=\"html\"}\n::: info\nDefining an appropriate distance metric from the context of the problem is a most important decision. For example, if your variables are all numeric, and on the same scale then Euclidean distance might be best. If your variables are categorical, you might need to use something like Hamming distance.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\infobox{Defining an appropriate distance metric from the context of the problem is a most important decision. For example, if your variables are all numeric, and on the same scale then Euclidean distance might be best. If your variables are categorical, you might need to use something like Hamming distance.}\n:::\n\nThe context in which the data arises is the key to assessing the results. If the clusters can be characterized in a sensible manner, and they increase our knowledge of the data, then we are on the right track. To use an even more pragmatic criterion, if a company can gain an economic advantage by using a particular clustering method to carve up their customer database, then that is the method they should use.\n\n## Exercises {-}\n\nUse the following data to answer these questions:\n\n\n\n::: {.cell}\n::: {.cell-output .cell-output-stdout}\n\n```\n x1 x2 x3\na1 0.13 0.21 0.09\na2 0.91 0.95 0.85\na3 0.62 0.73 0.65\na4 0.21 0.92 0.43\n```\n\n\n:::\n:::\n\n\n\n1. Compute the Euclidean distance between cases `a1`, `a2`, `a3`, `a4`.\n\n2. Compute the correlation distance (as defined above) between cases `a1`, `a2`, `a3`, `a4`.\n\n\n3. Which two points have the (a) biggest (b) smallest Mahalanobis (statistical) distance, assuming that the covariance matrix is:\n\n\n\n::: {.cell}\n::: {.cell-output .cell-output-stdout}\n\n```\n x1 x2 x3\nx1 1.0 0.8 0.8\nx2 0.8 1.0 0.8\nx3 0.8 0.8 1.0\n```\n\n\n:::\n:::\n\n\n\n(The function `mahalanobis` will calculate this in R. Technically this gives distance between each case and the mean vector.)\n\n4. Is the ordering of distance between cases the same if Manhattan distance is used instead of Euclidean?\n\n5. Compute the Chebychev distance between cases `a1`, `a2`, `a3`, `a4`.\n\n6. Compute Bray-Curtis distance between cases `a1`, `a2`, `a3`, `a4`.\n\n7. Make a plot of the data, and write a paragraph describing how the different distance metrics agree and disagree on how close or far the cases are from each other. \n\n\n\n::: {.cell}\n\n:::", + "markdown": "# Introduction to clustering \n\nUnsupervised classification, or cluster analysis, organizes observations into similar groups. Cluster analysis is a commonly used, appealing, and conceptually intuitive statistical method. Some of its uses include market segmentation, where customers are grouped into clusters with similar attributes for targeted marketing; gene expression analysis, where genes with similar expression patterns are grouped together; and the creation of taxonomies for animals, insects, or plants. Clustering can be used as a way of reducing a massive amount of data because observations within a cluster can be summarized by its centre. Also, clustering effectively subsets the data thus simplifying analysis because observations in each cluster can be analyzed separately.\n\n## What are clusters?\n\nOrganizing objects into groups is a common task to help make sense of the world around us. Perhaps this is why it is an appealing method of data analysis. However, cluster analysis is more complex than it initially appears. Many people imagine that it will produce neatly separated clusters like those in @fig-ideal-clusters(a), but it almost never does. Such ideal clusters are rarely encountered in real data, so we often need to modify our objective from *find the natural clusters in this data*. Instead, we need to organize the *cases into groups that are similar in some way*. Even though this may seem disappointing when compared with the ideal, it is still often an effective means of simplifying and understanding a dataset.\n\n::: {.content-visible when-format=\"html\"}\n::: info\nKnowing what shapes are in your data helps to decide on the best method and to diagnose the result. For example, if the clusters are elliptical model-based clustering is recommended.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\infobox{Knowing what shapes are in your data helps to decide on the best method and to diagnose the result. For example, if the clusters are elliptical model-based clustering is recommended.}\n:::\n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Different structures in data impact cluster analysis. When there are well-separated groups (a), it is simple to group similar observations. Even when there are not, partitioning observations into groups may still be useful. There may be nuisance observations (b) or nuisance variables (c) that affect the interpoint distance calculations and distract the clustering algorithm, and there may oddly shaped clusters (d) which are hard to numerically describe.](6-intro-clust_files/figure-pdf/fig-ideal-clusters-1.pdf){#fig-ideal-clusters width=100%}\n:::\n:::\n\n\n\nAt the heart of the clustering process is the work of discovering which variables are most important for defining the groups. It is often true that we only require a subset of the variables for finding clusters, whereas another subset (called *nuisance variables*) has no impact. In the bottom left plot of @fig-ideal-clusters, it is clear that the variable plotted horizontally is important for splitting this data into two clusters, whereas the variable plotted vertically is a nuisance variable. Nuisance is an apt term for these variables, because they can radically change the interpoint distances and impair the clustering\nprocess. \\index{cluster analysis!interpoint distance}\n\\index{cluster analysis!nuisance variable}\n\nDynamic graphical methods help us to find and understand the cluster structure in high dimensions. With the tools in our toolbox, primarily tours, along with linked scatterplots and parallel coordinate plots, we can see clusters in high-dimensional spaces. We can detect gaps between clusters, the shape and relative positions of clusters, and the presence of nuisance variables. We can even find unusually shaped clusters, like those in the bottom right plot in @fig-ideal-clusters. In simple\nsituations we can use graphics alone to group observations into clusters, using a \"spin and brush\" method. In more difficult data problems, we can assess and refine numerical solutions using graphics.\\index{brushing!persistent}\n\\index{cluster analysis!spin-and-brush}\n\nThis part of the book discusses the use of interactive and dynamic graphics in the clustering of data. @sec-clust-bg introduces cluster analysis, focusing on interpoint distance measures. @sec-clust-graphics describes an example of a purely graphical approach to cluster analysis, the spin and brush method. In the example shown in that section, we were able to find simplifications of the data that had not been found using numerical clustering methods, and to find a variety of structures in high-dimensional space. @sec-hclust describes methods for reducing the interpoint distance matrix to an intercluster distance matrix using hierarchical algorithms, in @sec-kmeans shows the the $k$-means algorithm, @sec-mclust covers model-based clustering, and @sec-som describes clustering with self-organising maps. Each of these chapters shows how graphical tools can be used to assess the results of numerical methods. @sec-clust-compare summarizes these chapters and revisits the data analysis strategies used in the examples. Additional references that provide good companions to the material presented in these chapters are @VR02, @HOML, @hennig, @giordani, @kassambara, and the CRAN Task View [@ctv-clustering].\n\n## The importance of defining similar {#sec-clust-bg}\n\nBefore we can begin finding groups of cases that are similar[^3], we need to decide how to define or measure whether they are close together or far apart. Consider a dataset with three cases $(a_1, a_2, a_3)$ and four variables $(V_1, V_2, V_3, V_4)$, described in matrix format as\n\n[^3]: Both *similarity* and *dissimilarity* measures are used for defining how similar cases are. It can be confusing! They measure similar in opposite directions. With a dissimilarity measure, a smaller number means the cases are closer, as in a distance metric. A similarity measure usually ranges between 0 and 1, with 1 indicating that the cases are closer, for example, correlation.\n\n\\begin{align*}\nX = \\begin{bmatrix}\n& {\\color{grey} V_1} & {\\color{grey} V_2} & {\\color{grey} V_3} & {\\color{grey} V_4} \\\\\\hline\n{\\color{grey} a_1} | & x_{11} & x_{12} & x_{13} & x_{14} \\\\\n{\\color{grey} a_2} | & x_{21} & x_{22} & x_{23} & x_{24} \\\\\n{\\color{grey} a_3} | & x_{31} & x_{32} & x_{33} & x_{34} \n\\end{bmatrix}\n= \\begin{bmatrix}\n& {\\color{grey} V_1} & {\\color{grey} V_2} & {\\color{grey} V_3} & {\\color{grey} V_4} \\\\\\hline\n{\\color{grey} a_1} | & 7.3 & 7.6 & 7.7 & 8.0 \\\\\n{\\color{grey} a_2} | & 7.4 & 7.2 & 7.3 & 7.2 \\\\\n{\\color{grey} a_3} | & 4.1 & 4.6 & 4.6 & 4.8 \n\\end{bmatrix}\n\\end{align*}\n\n\n\\noindent which is plotted in @fig-similarity1. The Euclidean distance between two cases (rows of the matrix) with $p$ elements is defined as\n\n\\begin{align*}\nd_{\\rm Euc}(a_i,a_j) &=& ||a_i-a_j|| %\\\\\n% &=& \\sqrt{(x_{i1}-x_{j1})^2+\\dots + (x_{ip}-x_{jp})^2},\n~~~~~~i,j=1,\\dots, n,\n\\end{align*}\n\n\\noindent where $||x_i||=\\sqrt{x_{i1}^2+x_{i2}^2+\\dots +x_{ip}^2}$. For example, the Euclidean distance between cases 1 and 2 in the above data, is\n\n\\begin{align*}\nd_{\\rm Euc}(a_1,a_2) &= \\sqrt{(7.3-7.4)^2+(7.6-7.2)^2+ (7.7-7.3)^2+(8.0-7.2)^2} \\\\\n&= 1.0 \n\\end{align*}\n\n\\index{cluster analysis!interpoint distance}\n\n\\noindent For the three cases, the interpoint Euclidean distance matrix is\n\n::: {.content-visible when-format=\"html\"}\n::: {.hidden}\n $$\n \\require{mathtools}\n \\definecolor{grey}{RGB}{192, 192, 192}\n $$\n:::\n:::\n\n\n\\begin{align*}\nd_{\\rm Euc} = \\begin{bmatrix}\n& {\\color{grey} a_1} & {\\color{grey} a_2} & {\\color{grey} a_3} \\\\\\hline\n{\\color{grey} a_1} | & 0.0 & 1.0 & 6.3 \\\\\n{\\color{grey} a_2} | & 1.0 & 0.0 & 5.5 \\\\\n{\\color{grey} a_3} | & 6.3 & 5.5 & 0.0\n\\end{bmatrix}\n\\end{align*}\n\n::: {#fig-similarity1 layout-ncol=2}\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](6-intro-clust_files/figure-pdf/unnamed-chunk-2-1.pdf){width=80%}\n:::\n:::\n\n::: {.cell}\n::: {.cell-output-display}\n![](6-intro-clust_files/figure-pdf/unnamed-chunk-3-1.pdf){width=80%}\n:::\n:::\n\n\n\nThe scatterplot matrix (left) shows that cases $a_1$ and $a_2$ have similar values. The parallel coordinate plot (right) allows a comparison of other structure, which shows the similarity in the trend of the profiles on cases $a_1$ and $a_3$. \n:::\n\n\n\\noindent Cases $a_1$ and $a_2$ are more similar to each other than they are to case $a_3$, because the Euclidean distance between cases $a_1$ and $a_2$ is much smaller than the distance between cases $a_1$ and $a_3$ and between cases $a_2$ and $a_3$.\n\nThere are many different ways to calculate similarity. Similarity measures based on correlation distance can be useful. It is typically used where similarity of structure or shape is more important than similarity in magnitude.\n\n\\index{parallel coordinate plot}\n\nAs an example, see the parallel coordinate plot of the sample data at the right of @fig-similarity1. Cases $a_1$ and $a_3$ are widely separated, but their shapes are similar (low, medium, medium, high). Case $a_2$, although\noverlapping with case $a_1$, has a very different shape (high, medium, medium, low). The Pearson correlation between two cases, $\\rho(a_i,a_j)$, is defined as\n\n\\begin{align*}\n\\rho(a_i,a_j) = \\frac{(a_i-c_i)^\\top(a_j-c_j)}\n{\\sqrt{(a_i-c_i)^\\top(a_i-c_i)} \\sqrt{(a_j-c_j)^\\top(a_j-c_j)}}\n\\label{corc}\n\\end{align*}\n\n\\noindent Typically, $c_i, c_j$ are the sample means of each case, $\\bar{a}_i,\\bar{a}_j$. For these three observations, $c_1=\\bar{a}_1=7.650, c_2=\\bar{a}_2=7.275, c_3=\\bar{a}_3=4.525$. An interesting geometric fact, is that if $c_i, c_j$ are set to be 0, as is commonly done, $\\rho$ is a generalized correlation that describes the angle between the two data vectors. The correlation is then converted to a distance metric, with one possibility being as follows:\n\n\n\\begin{align*}\nd_{\\rm Cor}(a_i,a_j) = \\sqrt{2(1-\\rho(a_i,a_j))}\n\\end{align*}\n\nThis distance metric will treat cases that are strongly negatively correlated as the most distant. If you want to consider strong negative correlation as close, then you could take the absolute value of $\\rho(a_i,a_j)$ in the above equation, and remove the multiplication by 2.\n\nThe interpoint distance matrix for the sample data using $d_{\\rm Cor}$ and the Pearson correlation coefficient is\n\n\\begin{align*}\nd_{\\rm Cor} = \\begin{bmatrix}\n& {\\color{grey} a_1} & {\\color{grey} a_2} & {\\color{grey} a_3} \\\\\\hline\n{\\color{grey} a_1} | & 0.0 & 3.6 & 0.1 \\\\\n{\\color{grey} a_2} | & 3.6 & 0.0 & 3.8 \\\\\n{\\color{grey} a_3} | & 0.1 & 3.8 & 0.0\n\\end{bmatrix}\n\\end{align*}\n\n\\noindent By this metric, cases $a_1$ and $a_3$ are the most similar, because the correlation distance is smaller between these two cases than the other pairs of cases. \\index{cluster analysis!interpoint distance}\n\nNote that these interpoint distances differ dramatically from those for Euclidean distance. As a consequence, the way the cases would be clustered is also very different. Choosing the appropriate distance measure is an important part of a cluster analysis.\n\nAfter a distance metric has been chosen and a cluster analysis has been performed, the analyst must evaluate the results, and this is actually a difficult task. A cluster analysis does not generate $p$-values or other numerical criteria, and the process tends to produce hypotheses rather than testing them. Even the most determined attempts to produce the \"best\" results using modeling and validation techniques may result in clusters that, although seemingly significant, are useless for practical purposes. As a result, cluster analysis is best thought of as an exploratory technique, and it can be quite useful despite the lack of formal validation because of its power in data simplification.\n\n::: {.content-visible when-format=\"html\"}\n::: info\nDefining an appropriate distance metric from the context of the problem is a most important decision. For example, if your variables are all numeric, and on the same scale then Euclidean distance might be best. If your variables are categorical, you might need to use something like Hamming distance.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\infobox{Defining an appropriate distance metric from the context of the problem is a most important decision. For example, if your variables are all numeric, and on the same scale then Euclidean distance might be best. If your variables are categorical, you might need to use something like Hamming distance.}\n:::\n\nThe context in which the data arises is the key to assessing the results. If the clusters can be characterized in a sensible manner, and they increase our knowledge of the data, then we are on the right track. To use an even more pragmatic criterion, if a company can gain an economic advantage by using a particular clustering method to carve up their customer database, then that is the method they should use.\n\n## Exercises {-}\n\nUse the following data to answer these questions:\n\n\n\n::: {.cell}\n::: {.cell-output .cell-output-stdout}\n\n```\n x1 x2 x3\na1 0.13 0.21 0.09\na2 0.91 0.95 0.85\na3 0.62 0.73 0.65\na4 0.21 0.92 0.43\n```\n\n\n:::\n:::\n\n\n\n1. Compute the Euclidean distance between cases `a1`, `a2`, `a3`, `a4`.\n\n2. Compute the correlation distance (as defined above) between cases `a1`, `a2`, `a3`, `a4`.\n\n\n3. Which two points have the (a) biggest (b) smallest Mahalanobis (statistical) distance, assuming that the covariance matrix is:\n\n\n\n::: {.cell}\n::: {.cell-output .cell-output-stdout}\n\n```\n x1 x2 x3\nx1 1.0 0.8 0.8\nx2 0.8 1.0 0.8\nx3 0.8 0.8 1.0\n```\n\n\n:::\n:::\n\n\n\n(The base function `mahalanobis()` will calculate this in R. Technically this gives distance between each case and the mean vector.)\n\n4. Is the ordering of distance between cases the same if Manhattan distance is used instead of Euclidean?\n\n5. Compute the Chebychev distance between cases `a1`, `a2`, `a3`, `a4`.\n\n6. Compute Bray-Curtis distance between cases `a1`, `a2`, `a3`, `a4`.\n\n7. Make a plot of the data, and write a paragraph describing how the different distance metrics agree and disagree on how close or far the cases are from each other. \n\n\n\n::: {.cell}\n\n:::", "supporting": [ "6-intro-clust_files/figure-pdf" ], diff --git a/_freeze/6-intro-clust/figure-html/fig-ideal-clusters-1.png b/_freeze/6-intro-clust/figure-html/fig-ideal-clusters-1.png new file mode 100644 index 0000000..3f8248e Binary files /dev/null and b/_freeze/6-intro-clust/figure-html/fig-ideal-clusters-1.png differ diff --git a/_freeze/6-intro-clust/figure-html/unnamed-chunk-2-1.png b/_freeze/6-intro-clust/figure-html/unnamed-chunk-2-1.png new file mode 100644 index 0000000..1f1df15 Binary files /dev/null and b/_freeze/6-intro-clust/figure-html/unnamed-chunk-2-1.png differ diff --git a/_freeze/6-intro-clust/figure-html/unnamed-chunk-3-1.png b/_freeze/6-intro-clust/figure-html/unnamed-chunk-3-1.png new file mode 100644 index 0000000..6207978 Binary files /dev/null and b/_freeze/6-intro-clust/figure-html/unnamed-chunk-3-1.png differ diff --git a/_freeze/6-intro-clust/figure-pdf/fig-ideal-clusters-1.pdf b/_freeze/6-intro-clust/figure-pdf/fig-ideal-clusters-1.pdf index 6ef486d..139d4c4 100644 Binary files a/_freeze/6-intro-clust/figure-pdf/fig-ideal-clusters-1.pdf and b/_freeze/6-intro-clust/figure-pdf/fig-ideal-clusters-1.pdf differ diff --git a/_freeze/6-intro-clust/figure-pdf/unnamed-chunk-2-1.pdf b/_freeze/6-intro-clust/figure-pdf/unnamed-chunk-2-1.pdf index 86ef62d..e70f553 100644 Binary files a/_freeze/6-intro-clust/figure-pdf/unnamed-chunk-2-1.pdf and b/_freeze/6-intro-clust/figure-pdf/unnamed-chunk-2-1.pdf differ diff --git a/_freeze/6-intro-clust/figure-pdf/unnamed-chunk-3-1.pdf b/_freeze/6-intro-clust/figure-pdf/unnamed-chunk-3-1.pdf index 9eb3f9b..725d9a4 100644 Binary files a/_freeze/6-intro-clust/figure-pdf/unnamed-chunk-3-1.pdf and b/_freeze/6-intro-clust/figure-pdf/unnamed-chunk-3-1.pdf differ diff --git a/_freeze/7-spin-and-brush/execute-results/html.json b/_freeze/7-spin-and-brush/execute-results/html.json new file mode 100644 index 0000000..91f2872 --- /dev/null +++ b/_freeze/7-spin-and-brush/execute-results/html.json @@ -0,0 +1,17 @@ +{ + "hash": "ffc2d272f1b79c4104977850d2b30cb8", + "result": { + "engine": "knitr", + "markdown": "## Spin-and-brush approach {#sec-clust-graphics}\n\n\\index{brushing!persistent} \\index{tour}\n\\index{cluster analysis!spin-and-brush}\n\nSeveral examples of the spin-and-brush approach are documented in the literature, such as @CBCH95 and @WWS99. The steps are:\n\n1. Run the (grand) tour.\n2. Stop when you see a separated cluster of points.\n3. Paint the cluster a chosen colour.\n4. Repeat 1-2 until the data is grouped, and when no other separated cluster is visible in any projection. You may need to re-paint some points if they appear to be grouped incorrectly in a different projection, or paint more points that after spinning most likely belong to an existing group.\n\nSpin-and-brush is useful for exploring clustering when the data is numeric, and contains well-separated clusters. Patterns that adversely affect numerical techniques, such as nuisance variables or cases, differences in variances or shapes between clusters, don't pose any problems for spin-and-brush. It is also effective if the data has connected low-dimensional (1D or 2D) clusters in high dimensions. \n\nIt will not work very well when there are no distinct clusters and the purpose of clustering is to partition the data into subsets. Here, you could begin with a solution provided by some numerical clustering algorithm, and to use visual tools to evaluate it, with goal of refining the results.\n\nWith a complex problem where there are many clusters, one can work sequentially, and remove each cluster after it is brushed, to de-clutter the display, in order to find more clusters.\n\n\n\nSpin-and-brush is best achieved using a fully interactive graphics system like in the `detourr` package, where the results can be saved for further analysis. The code is very easy, and then all the controls are interactive.\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nlibrary(detourr)\ngrDevices::hcl.colors(3, palette=\"Zissou 1\")\ndetour(penguins_sub[,1:4], \n tour_aes(projection = bl:bm)) |>\n tour_path(grand_tour(2), fps = 60, \n max_bases=20) |>\n show_scatter(alpha = 0.7, \n axes = FALSE)\n```\n:::\n\n\n- `tour_aes(projection = bl:bm))` is `ggplot`-style syntax for specifying the variables `bl:bm` to include in the tour. \n- `tour_path(grand_tour(2), fps = 60, max_bases=20)` specifies 2D grand tour path, with a longer than default path set by `max_bases=20` and the `fps` argument sets the smoothness.\n- Brush interaction is set by choosing the square icon (4th from top), so when the cursor is moved over the window points are selected.\n- You can choose specific colours to brush, from the colour palette by using hexcolours to match your favourite palette. Here we've used colours from the Zissou palette.\n- The paintbrush icon sets the selected points to the current colour.\n- Save the final colour labels using the download icon. \n\n::: {#fig-penguins-bs-detourr fig-align=\"center\" layout-ncol=2}\n\n\n![One cluster painted](images/penguins-bs6.png){#fig-penguins-bs3 fig-alt=\"Projected view where one cluster can be distinguished and is brushed in blue.\" fig.align=\"center\" width=270}\n\n![Another cluster painted](images/penguins-bs7.png){#fig-penguins-bs4 fig-alt=\"Projection where a second cluster can be distinguished and is brushed in red.\" fig.align=\"center\" width=270}\n\n\nScreenshots of the spin-and-brush approach using `detourr` on the penguins data. \n:::\n\n@fig-penguins-bs-detourr shows the stages of spin-and-brush on the penguins data using detourr. The final results can be examined and used for later analysis. Because this data came with a class variable, the penguin species, it is interesting to see how close the spin-and-brush clustering approach came to recovering these: \n\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to make confusion matrix\"}\nlibrary(readr)\nload(\"data/penguins_sub.rda\")\ndetourr_penguins <- read_csv(\"data/detourr_penguins.csv\")\ntable(penguins_sub$species, detourr_penguins$colour)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n \n 000000 3e9eb6 f5191c\n Adelie 143 0 3\n Chinstrap 6 0 62\n Gentoo 2 117 0\n```\n\n\n:::\n:::\n\n\nIt's quite close! All but two of the 119 Gentoo penguins were identified as a cluster (labelled as \"3e9eb6\" from the chosen light blue hex colour), and all but three of the 146 Adelie penguins were identified as a cluster, (labelled as \"000000\" which is the unbrushed black group). Most of the Chinstrap species were recovered also (labelled as \"f5191c\" for the red hex colour).\n\n\n\n::: {.cell}\n\n:::\n\n\n\n\n## Exercises {-}\n\n1. Use the spin-and-brush approach to identify the three clusters in the `mulgar::clusters` data set.\n2. Use the spin-and-brush approach to identify the six clusters in the `mulgar::multicluster` data set. (The code below using detourr could be useful.)\n3. Use spin-and-brush on the challenge data sets, `c1`-`c7` from the `mulgar` package. How many clusters do you detect in each?\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nlibrary(detourr)\n\n# Use a random starting basis because the first two variables make it too easy\nstrt <- tourr::basis_random(10, 2)\ndetour(multicluster, \n tour_aes(projection = -group)) |>\n tour_path(grand_tour(2), start=strt, fps = 60) |>\n show_scatter(alpha = 0.7, axes = FALSE)\n```\n:::\n\n\n3. Use the spin-and-brush technique to identify the branches of the `fake_trees` data. The result should look something like this:\n\n![Example solution after spin-and-brush on fake trees data.](images/fake_trees_sb.png){#fig-fake-trees-sb fig-alt=\"Projection where some clusters extend in different direction, with point colors indicating the user-identified clusters.\"}\n\nYou can use the download button to save the data with the colours. Tabulate the `branches` id variable in the original data with the `colour` groups created from brushing, to see how closely you have recovered the original classes.\n\n\n::: {.cell}\n\n:::\n", + "supporting": [ + "7-spin-and-brush_files" + ], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": {}, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/7-spin-and-brush/execute-results/tex.json b/_freeze/7-spin-and-brush/execute-results/tex.json index b4d0014..97157fd 100644 --- a/_freeze/7-spin-and-brush/execute-results/tex.json +++ b/_freeze/7-spin-and-brush/execute-results/tex.json @@ -1,8 +1,8 @@ { - "hash": "ffc2d272f1b79c4104977850d2b30cb8", + "hash": "f5d78b818101dd46ba759478592119ab", "result": { "engine": "knitr", - "markdown": "## Spin-and-brush approach {#sec-clust-graphics}\n\n\\index{brushing!persistent} \\index{tour}\n\\index{cluster analysis!spin-and-brush}\n\nSeveral examples of the spin-and-brush approach are documented in the literature, such as @CBCH95 and @WWS99. The steps are:\n\n1. Run the (grand) tour.\n2. Stop when you see a separated cluster of points.\n3. Paint the cluster a chosen colour.\n4. Repeat 1-2 until the data is grouped, and when no other separated cluster is visible in any projection. You may need to re-paint some points if they appear to be grouped incorrectly in a different projection, or paint more points that after spinning most likely belong to an existing group.\n\nSpin-and-brush is useful for exploring clustering when the data is numeric, and contains well-separated clusters. Patterns that adversely affect numerical techniques, such as nuisance variables or cases, differences in variances or shapes between clusters, don't pose any problems for spin-and-brush. It is also effective if the data has connected low-dimensional (1D or 2D) clusters in high dimensions. \n\nIt will not work very well when there are no distinct clusters and the purpose of clustering is to partition the data into subsets. Here, you could begin with a solution provided by some numerical clustering algorithm, and to use visual tools to evaluate it, with goal of refining the results.\n\nWith a complex problem where there are many clusters, one can work sequentially, and remove each cluster after it is brushed, to de-clutter the display, in order to find more clusters.\n\n\n\nSpin-and-brush is best achieved using a fully interactive graphics system like in the `detourr` package, where the results can be saved for further analysis. The code is very easy, and then all the controls are interactive.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nlibrary(detourr)\ngrDevices::hcl.colors(3, palette=\"Zissou 1\")\ndetour(penguins_sub[,1:4], \n tour_aes(projection = bl:bm)) |>\n tour_path(grand_tour(2), fps = 60, \n max_bases=20) |>\n show_scatter(alpha = 0.7, \n axes = FALSE)\n```\n:::\n\n\n\n- `tour_aes(projection = bl:bm))` is `ggplot`-style syntax for specifying the variables `bl:bm` to include in the tour. \n- `tour_path(grand_tour(2), fps = 60, max_bases=20)` specifies 2D grand tour path, with a longer than default path set by `max_bases=20` and the `fps` argument sets the smoothness.\n- Brush interaction is set by choosing the square icon (4th from top), so when the cursor is moved over the window points are selected.\n- You can choose specific colours to brush, from the colour palette by using hexcolours to match your favourite palette. Here we've used colours from the Zissou palette.\n- The paintbrush icon sets the selected points to the current colour.\n- Save the final colour labels using the download icon. \n\n::: {#fig-penguins-bs-detourr fig-align=\"center\" layout-ncol=2}\n\n\n![One cluster painted](images/penguins-bs6.png){#fig-penguins-bs3 fig-alt=\"Projected view where one cluster can be distinguished and is brushed in blue.\" fig.align=\"center\" width=270}\n\n![Another cluster painted](images/penguins-bs7.png){#fig-penguins-bs4 fig-alt=\"Projection where a second cluster can be distinguished and is brushed in red.\" fig.align=\"center\" width=270}\n\n\nScreenshots of the spin-and-brush approach using `detourr` on the penguins data. \n:::\n\n@fig-penguins-bs-detourr shows the stages of spin-and-brush on the penguins data using detourr. The final results can be examined and used for later analysis. Because this data came with a class variable, the penguin species, it is interesting to see how close the spin-and-brush clustering approach came to recovering these: \n\n\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to make confusion matrix\"}\nlibrary(readr)\nload(\"data/penguins_sub.rda\")\ndetourr_penguins <- read_csv(\"data/detourr_penguins.csv\")\ntable(penguins_sub$species, detourr_penguins$colour)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n \n 000000 3e9eb6 f5191c\n Adelie 143 0 3\n Chinstrap 6 0 62\n Gentoo 2 117 0\n```\n\n\n:::\n:::\n\n\n\nIt's quite close! All but two of the 119 Gentoo penguins were identified as a cluster (labelled as \"3e9eb6\" from the chosen light blue hex colour), and all but three of the 146 Adelie penguins were identified as a cluster, (labelled as \"000000\" which is the unbrushed black group). Most of the Chinstrap species were recovered also (labelled as \"f5191c\" for the red hex colour).\n\n\n\n\n::: {.cell}\n\n:::\n\n\n\n\n\n## Exercises {-}\n\n1. Use the spin-and-brush approach to identify the three clusters in the `mulgar::clusters` data set.\n2. Use the spin-and-brush approach to identify the six clusters in the `mulgar::multicluster` data set. (The code below using detourr could be useful.)\n3. Use spin-and-brush on the challenge data sets, `c1`-`c7` from the `mulgar` package. How many clusters do you detect in each?\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nlibrary(detourr)\n\n# Use a random starting basis because the first two variables make it too easy\nstrt <- tourr::basis_random(10, 2)\ndetour(multicluster, \n tour_aes(projection = -group)) |>\n tour_path(grand_tour(2), start=strt, fps = 60) |>\n show_scatter(alpha = 0.7, axes = FALSE)\n```\n:::\n\n\n\n3. Use the spin-and-brush technique to identify the branches of the `fake_trees` data. The result should look something like this:\n\n![Example solution after spin-and-brush on fake trees data.](images/fake_trees_sb.png){#fig-fake-trees-sb fig-alt=\"Projection where some clusters extend in different direction, with point colors indicating the user-identified clusters.\"}\n\nYou can use the download button to save the data with the colours. Tabulate the `branches` id variable in the original data with the `colour` groups created from brushing, to see how closely you have recovered the original classes.\n\n\n\n::: {.cell}\n\n:::\n", + "markdown": "## Spin-and-brush approach {#sec-clust-graphics}\n\n\\index{brushing!persistent} \\index{tour}\n\\index{cluster analysis!spin-and-brush}\n\nSeveral examples of the spin-and-brush approach are documented in the literature, such as @CBCH95 and @WWS99. The steps are:\n\n1. Run the (grand) tour.\n2. Stop when you see a separated cluster of points.\n3. Paint the cluster a chosen colour.\n4. Repeat 1-2 until the data is grouped, and when no other separated cluster is visible in any projection. You may need to re-paint some points if they appear to be grouped incorrectly in a different projection, or paint more points that after spinning most likely belong to an existing group.\n\nSpin-and-brush is useful for exploring clustering when the data is numeric, and contains well-separated clusters. Patterns that adversely affect numerical techniques, such as nuisance variables or cases, differences in variances or shapes between clusters, don't pose any problems for spin-and-brush. It is also effective if the data has connected low-dimensional (1D or 2D) clusters in high dimensions. \n\nIt will not work very well when there are no distinct clusters and the purpose of clustering is to partition the data into subsets. Here, you could begin with a solution provided by some numerical clustering algorithm, and to use visual tools to evaluate it, with goal of refining the results.\n\nWith a complex problem where there are many clusters, one can work sequentially, and remove each cluster after it is brushed, to de-clutter the display, in order to find more clusters.\n\n\nSpin-and-brush is best achieved using a fully interactive graphics system like in the `detourr` package, where the results can be saved for further analysis. The code is very easy, and then all the controls are interactive.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nlibrary(detourr)\ngrDevices::hcl.colors(3, palette=\"Zissou 1\")\ndetour(penguins_sub[,1:4], \n tour_aes(projection = bl:bm)) |>\n tour_path(grand_tour(2), fps = 60, \n max_bases=20) |>\n show_scatter(alpha = 0.7, \n axes = FALSE)\n```\n:::\n\n\n\n- `tour_aes(projection = bl:bm))` is `ggplot`-style syntax for specifying the variables `bl:bm` to include in the tour. \n- `tour_path(grand_tour(2), fps = 60, max_bases=20)` specifies 2D grand tour path, with a longer than default path set by `max_bases=20` and the `fps` argument sets the smoothness.\n- Brush interaction is set by choosing the square icon (4th from top), so when the cursor is moved over the window points are selected.\n- You can choose specific colours to brush, from the colour palette by using hexcolours to match your favourite palette. Here we've used colours from the Zissou palette.\n- The paintbrush icon sets the selected points to the current colour.\n- Save the final colour labels using the download icon. \n\n::: {#fig-penguins-bs-detourr fig-align=\"center\" layout-ncol=2}\n\n\n![One cluster painted](images/penguins-bs6.png){#fig-penguins-bs3 fig-alt=\"Projected view where one cluster can be distinguished and is brushed in blue.\" fig.align=\"center\" width=270}\n\n![Another cluster painted](images/penguins-bs7.png){#fig-penguins-bs4 fig-alt=\"Projection where a second cluster can be distinguished and is brushed in red.\" fig.align=\"center\" width=270}\n\n\nScreenshots of the spin-and-brush approach using `detourr` on the penguins data. \n:::\n\n@fig-penguins-bs-detourr shows the stages of spin-and-brush on the penguins data using detourr. The final results can be examined and used for later analysis. Because this data came with a class variable, the penguin species, it is interesting to see how close the spin-and-brush clustering approach came to recovering these: \n\n\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to make confusion matrix\"}\nlibrary(readr)\nload(\"data/penguins_sub.rda\")\ndetourr_penguins <- read_csv(\"data/detourr_penguins.csv\")\ntable(penguins_sub$species, detourr_penguins$colour)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n \n 000000 3e9eb6 f5191c\n Adelie 143 0 3\n Chinstrap 6 0 62\n Gentoo 2 117 0\n```\n\n\n:::\n:::\n\n\n\nIt's quite close! All but two of the 119 Gentoo penguins were identified as a cluster (labelled as \"3e9eb6\" from the chosen light blue hex colour), and all but three of the 146 Adelie penguins were identified as a cluster, (labelled as \"000000\" which is the unbrushed black group). Most of the Chinstrap species were recovered also (labelled as \"f5191c\" for the red hex colour).\n\n## Exercises {-}\n\n1. Use the spin-and-brush approach to identify the three clusters in the `mulgar::clusters` data set.\n2. Use the spin-and-brush approach to identify the six clusters in the `mulgar::multicluster` data set. (The code below using detourr could be useful.)\n3. Use spin-and-brush on the challenge data sets, `c1`-`c7` from the `mulgar` package. How many clusters do you detect in each?\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nlibrary(detourr)\n\n# Use a random starting basis because the first two variables make it too easy\nstrt <- tourr::basis_random(10, 2)\ndetour(multicluster, \n tour_aes(projection = -group)) |>\n tour_path(grand_tour(2), \n start=strt, fps = 60) |>\n show_scatter(alpha = 0.7, \n axes = FALSE)\n```\n:::\n\n\n\n4. Use the spin-and-brush technique to identify the branches of the `fake_trees` data. The result should look something like this:\n\n![Example solution after spin-and-brush on fake trees data.](images/fake_trees_sb.png){#fig-fake-trees-sb fig-alt=\"Projection where some clusters extend in different direction, with point colors indicating the user-identified clusters.\"}\n\nYou can use the download button to save the data with the colours. Tabulate the `branches` id variable in the original data with the `colour` groups created from brushing, to see how closely you have recovered the original classes.\n\n\n\n::: {.cell}\n\n:::\n\n\n\n## Project {-}\n\nThis exercise continues from the project in Chapter 5, to check your choice of NLDR representation. Using your best NLDR representation, cluster the data into as many clusters or clumps as you can see. Save the clusters. Now use spin-and-brush in `detourr` to colour as many clusters in the high dimensions as you can find. Save your clusters. How closely do these two approaches agree? \n", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/_freeze/8-hierarchical/execute-results/html.json b/_freeze/8-hierarchical/execute-results/html.json new file mode 100644 index 0000000..2766d19 --- /dev/null +++ b/_freeze/8-hierarchical/execute-results/html.json @@ -0,0 +1,17 @@ +{ + "hash": "113646cbba3e3abb77e2c4fde9f0076a", + "result": { + "engine": "knitr", + "markdown": "# Hierarchical clustering {#sec-hclust}\n\n\\index{cluster analysis!algorithms}\n\n\\index{cluster analysis!hierarchical} \\index{cluster analysis!intercluster distance (linkage)}\n\n## Overview\n\nHierarchical cluster algorithms sequentially fuse neighboring points to form ever-larger clusters, starting from a full interpoint distance matrix. *Distance between clusters* is described by a \"linkage method\", of which there are many. For example, single linkage measures the distance between clusters by the smallest interpoint distance between the members of the two clusters, complete linkage uses the maximum interpoint distance, and average linkage uses the average of the interpoint distances. Wards linkage, which usually produces the best clustering solutions, defines the distance as the reduction in the within-group variance. A good discussion on cluster analysis and linkage can be found in @HOML, on [Wikipedia](https://en.wikipedia.org/wiki/Cluster_analysis) or any multivariate textbook.\n\n::: {.content-visible when-format=\"html\"}\n::: info\nHierarchical clustering is summarised by a dendrogram, which sequentially shows points being joined to form a cluster, with the corresponding distances. Breaking the data into clusters is done by cutting the dendrogram at the long edges. \n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\infobox{Hierarchical clustering is summarised by a dendrogram, which sequentially shows points being joined to form a cluster, with the corresponding distances. Breaking the data into clusters is done by cutting the dendrogram at the long edges.}\n:::\n\n\nHere we will take a look at hierarchical clustering, using Wards linkage, on the `simple_clusters` data. The steps taken are to:\n\n1. Plot the data to check for presence of clusters and their shape.\n2. Compute the hierarchical clustering.\n3. Plot the dendrogram to help decide on an appropriate number of clusters, using the `dendro_data` function from the `ggdendro` package.\n4. Show the dendrogram overlaid on the data, calculated by the `hierfly` function in `mulgar`.\n5. Plot the clustering result, by colouring points in the plot of the data.\n\n\\index{cluster analysis!dendrogram}\n\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Load libraries\"}\nlibrary(ggplot2)\nlibrary(mulgar)\nlibrary(ggdendro)\nlibrary(dplyr)\nlibrary(patchwork)\nlibrary(tourr)\nlibrary(plotly)\nlibrary(htmlwidgets)\nlibrary(colorspace)\nlibrary(GGally)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\ndata(simple_clusters)\n\n# Compute hierarchical clustering with Ward's linkage\ncl_hw <- hclust(dist(simple_clusters[,1:2]),\n method=\"ward.D2\")\ncl_ggd <- dendro_data(cl_hw, type = \"triangle\")\n\n# Compute dendrogram in the data\ncl_hfly <- hierfly(simple_clusters, cl_hw, scale=FALSE)\n\n# Show result\nsimple_clusters <- simple_clusters %>%\n mutate(clw = factor(cutree(cl_hw, 2)))\n```\n:::\n\n\n@fig-hc-sim illustrates the hierarchical clustering approach for a simple simulated data set (a) with two well-separated clusters in 2D. The dendrogram (b) is a representation of the order that points are joined into clusters. The dendrogram strongly indicates two clusters because the two branches representing the last join are much longer than all of the other branches. \n\nAlthough, the dendrogram is usually a good summary of the steps taken by the algorithm, it can be misleading. The dendrogram might indicate a clear clustering (big differences in heights of branches) but the result may be awful. You need to check this by examining the result on the data, called model-in-the-data space by @wickham2015. \n\nPlot (c) shows the dendrogram in 2D, overlaid on the data. The segments show how the points are joined to make clusters. In order to represent the dendrogram this way, new points (represented by a \"+\" here) need to be added corresponding to the centroid of groups of points that have been joined. These are used to draw the segments between other points and other clusters. We can see that the longest (two) edges stretches across the gap between the two clusters. This corresponds to the top of the dendrogram, the two long branches where we would cut it to make the two-cluster solution. This two-cluster solution is shown in plot (d).\n\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to make the four plots\"}\n# Plot the data\npd <- ggplot(simple_clusters, aes(x=x1, y=x2)) +\n geom_point(colour=\"#3B99B1\", size=2, alpha=0.8) +\n ggtitle(\"(a)\") + \n theme_minimal() +\n theme(aspect.ratio=1) \n\n# Plot the dendrogram\nph <- ggplot() +\n geom_segment(data=cl_ggd$segments, \n aes(x = x, y = y, \n xend = xend, yend = yend)) + \n geom_point(data=cl_ggd$labels, aes(x=x, y=y),\n colour=\"#3B99B1\", alpha=0.8) +\n ggtitle(\"(b)\") + \n theme_minimal() +\n theme_dendro()\n\n# Plot the dendrogram on the data\npdh <- ggplot() +\n geom_segment(data=cl_hfly$segments, \n aes(x=x, xend=xend,\n y=y, yend=yend)) +\n geom_point(data=cl_hfly$data, \n aes(x=x1, y=x2,\n shape=factor(node),\n colour=factor(node),\n size=1-node), alpha=0.8) +\n xlab(\"x1\") + ylab(\"x2\") +\n scale_shape_manual(values = c(16, 3)) +\n scale_colour_manual(values = c(\"#3B99B1\", \"black\")) +\n scale_size(limits=c(0,17)) +\n ggtitle(\"(c)\") + \n theme_minimal() +\n theme(aspect.ratio=1, legend.position=\"none\")\n\n# Plot the resulting clusters\npc <- ggplot(simple_clusters) +\n geom_point(aes(x=x1, y=x2, colour=clw), \n size=2, alpha=0.8) +\n scale_colour_discrete_divergingx(palette = \"Zissou 1\",\n nmax=5, rev=TRUE) +\n ggtitle(\"(d)\") + \n theme_minimal() +\n theme(aspect.ratio=1, legend.position=\"none\")\n\npd + ph + pdh + pc + plot_layout(ncol=2)\n```\n\n::: {.cell-output-display}\n![Hierarchical clustering on simulated data: (a) data, (b) dendrogram, (c) dendrogram on the data, and (d) two cluster solution. The extra points corresponding to nodes of the dendrogram are indicated by + in (c). The last join in the dendrogram (b), can be seen to correspond to the edges connecting the gap, when displayed with the data (c). The other joins can be seen to be pulling together points within each clump.](8-hierarchical_files/figure-html/fig-hc-sim-1.png){#fig-hc-sim width=100%}\n:::\n:::\n\n\n::: {.content-visible when-format=\"html\"}\n::: info\nPlotting the dendrogram in the data space can help you understand how the hierarchical clustering has collected the points together into clusters. You can learn if the algorithm has been confused by nuisance patterns in the data, and how different choices of linkage method affects the result. \n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\infobox{Plotting the dendrogram in the data space can help you understand how the hierarchical clustering has collected the points together into clusters. You can learn if the algorithm has been confused by nuisance patterns in the data, and how different choices of linkage method affects the result.}\n:::\n\n## Common patterns which confuse clustering algorithms\n\n@fig-problems shows two examples of structure in data that will confuse hierarchical clustering: nuisance variables and nuisance cases. We usually do not know that these problems exist prior to clustering the data. Discovering these iteratively as you conduct a clustering analysis is important for generating useful results. \n\n\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to make plots\"}\n# Nuisance observations\nset.seed(20190514)\nx <- (runif(20)-0.5)*4\ny <- x\nd1 <- data.frame(x1 = c(rnorm(50, -3), \n rnorm(50, 3), x),\n x2 = c(rnorm(50, -3), \n rnorm(50, 3), y),\n cl = factor(c(rep(\"A\", 50), \n rep(\"B\", 70))))\nd1 <- d1 %>% \n mutate_if(is.numeric, function(x) (x-mean(x))/sd(x))\npd1 <- ggplot(data=d1, aes(x=x1, y=x2)) + \n geom_point() +\n ggtitle(\"Nuisance observations\") + \n theme_minimal() +\n theme(aspect.ratio=1) \n\n# Nuisance variables\nset.seed(20190512)\nd2 <- data.frame(x1=c(rnorm(50, -4), \n rnorm(50, 4)),\n x2=c(rnorm(100)),\n cl = factor(c(rep(\"A\", 50), \n rep(\"B\", 50))))\nd2 <- d2 %>% \n mutate_if(is.numeric, function(x) (x-mean(x))/sd(x))\npd2 <- ggplot(data=d2, aes(x=x1, y=x2)) + \n geom_point() +\n ggtitle(\"Nuisance variables\") + \n theme_minimal() +\n theme(aspect.ratio=1)\n\npd1 + pd2 + plot_layout(ncol=2)\n```\n\n::: {.cell-output-display}\n![Two examples of data structure that causes problems for hierarchical clustering. Nuisance observations can cause problems because the close observations between the two clusters can cause some chaining in the hierarchical joining of observations. Nuisance variables can cause problems because observations across the gap can seem closer than observations at the end of each cluster.](8-hierarchical_files/figure-html/fig-problems-1.png){#fig-problems width=768}\n:::\n:::\n\n\nIf an outlier is a point that is extreme relative to other observations, an \"inlier\" is a point that is extreme relative to a cluster, but inside the domain of all of the observations. Nuisance observations are inliers, cases that occur between larger groups of points. If they were excluded there might be a gap between clusters. These can cause problems for clustering when distances between clusters are measured, and can be very problematic when single linkage hierarchical clustering is used. @fig-d1-s shows how nuisance observations affect single linkage but not Wards linkage hierarchical clustering.\n\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to make plots\"}\n# Compute single linkage\nd1_hs <- hclust(dist(d1[,1:2]),\n method=\"single\")\nd1_ggds <- dendro_data(d1_hs, type = \"triangle\")\npd1s <- ggplot() +\n geom_segment(data=d1_ggds$segments, \n aes(x = x, y = y, \n xend = xend, yend = yend)) + \n geom_point(data=d1_ggds$labels, aes(x=x, y=y),\n colour=\"#3B99B1\", alpha=0.8) +\n theme_minimal() +\n ggtitle(\"(a) Single linkage dendrogram\") +\n theme_dendro()\n\n# Compute dendrogram in data\nd1_hflys <- hierfly(d1, d1_hs, scale=FALSE)\n\npd1hs <- ggplot() +\n geom_segment(data=d1_hflys$segments, \n aes(x=x, xend=xend,\n y=y, yend=yend)) +\n geom_point(data=d1_hflys$data, \n aes(x=x1, y=x2,\n shape=factor(node),\n colour=factor(node),\n size=1-node), alpha=0.8) +\n scale_shape_manual(values = c(16, 3)) +\n scale_colour_manual(values = c(\"#3B99B1\", \"black\")) +\n scale_size(limits=c(0,17)) +\n ggtitle(\"(b) Dendrogram in data space\") + \n theme_minimal() +\n theme(aspect.ratio=1, legend.position=\"none\")\n\n# Show result\nd1 <- d1 %>%\n mutate(cls = factor(cutree(d1_hs, 2)))\npc_d1s <- ggplot(d1) +\n geom_point(aes(x=x1, y=x2, colour=cls), \n size=2, alpha=0.8) +\n scale_colour_discrete_divergingx(palette = \"Zissou 1\",\n nmax=4, rev=TRUE) +\n ggtitle(\"(c) Two-cluster solution\") + \n theme_minimal() +\n theme(aspect.ratio=1, legend.position=\"none\")\n\n# Compute Wards linkage\nd1_hw <- hclust(dist(d1[,1:2]),\n method=\"ward.D2\")\nd1_ggdw <- dendro_data(d1_hw, type = \"triangle\")\npd1w <- ggplot() +\n geom_segment(data=d1_ggdw$segments, \n aes(x = x, y = y, \n xend = xend, yend = yend)) + \n geom_point(data=d1_ggdw$labels, aes(x=x, y=y),\n colour=\"#3B99B1\", alpha=0.8) +\n ggtitle(\"(d) Ward's linkage dendrogram\") +\n theme_minimal() +\n theme_dendro()\n\n# Compute dendrogram in data\nd1_hflyw <- hierfly(d1, d1_hw, scale=FALSE)\n\npd1hw <- ggplot() +\n geom_segment(data=d1_hflyw$segments, \n aes(x=x, xend=xend,\n y=y, yend=yend)) +\n geom_point(data=d1_hflyw$data, \n aes(x=x1, y=x2,\n shape=factor(node),\n colour=factor(node),\n size=1-node), alpha=0.8) +\n scale_shape_manual(values = c(16, 3)) +\n scale_colour_manual(values = c(\"#3B99B1\", \"black\")) +\n scale_size(limits=c(0,17)) +\n ggtitle(\"(e) Dendrogram in data space\") + \n theme_minimal() +\n theme(aspect.ratio=1, legend.position=\"none\")\n\n# Show result\nd1 <- d1 %>%\n mutate(clw = factor(cutree(d1_hw, 2)))\npc_d1w <- ggplot(d1) +\n geom_point(aes(x=x1, y=x2, colour=clw), \n size=2, alpha=0.8) +\n scale_colour_discrete_divergingx(palette = \"Zissou 1\",\n nmax=4, rev=TRUE) +\n ggtitle(\"(f) Two-cluster solution\") + \n theme_minimal() +\n theme(aspect.ratio=1, legend.position=\"none\")\n\npd1s + pd1hs + pc_d1s + \n pd1w + pd1hw + pc_d1w +\n plot_layout(ncol=3)\n```\n\n::: {.cell-output-display}\n![The effect of nuisance observations on single linkage (a, b, c) and Ward's linkage hierarchical clustering (d, e, f). The single linkage dendrogram is very different to the Wards linkage dendrogram. When plotted with the data (b) we can see a pin cushion or asterisk pattern, where points are joined to others through a place in the middle of the line of nuisance observations. This results in the bad two cluster solution of a singleton cluster, and all the rest. Conversely, Ward's dendrogram (d) strongly suggests two clusters, although the final join corresponds to just a small gap when shown on the data (e) but results in two sensible clusters.](8-hierarchical_files/figure-html/fig-d1-s-1.png){#fig-d1-s width=864}\n:::\n:::\n\n\nNuisance variables are ones that do not contribute to the clustering, such as `x2` here. When we look at this data we see a gap between two elliptically shape clusters, with the gap being only in the horizontal direction, `x1`. When we compute the distances between points, in order to start clustering, without knowing that `x2` is a nuisance variable, points across the gap might be considered to be closer than points within the same cluster. @fig-d2-c shows how nuisance variables affects complete linkage but not Wards linkage hierarchical clustering. (Wards linkage can be affected but it isn't for this data.) Interestingly, the dendrogram for complete linkage looks ideal, that it suggests two clusters. It is not until you examine the resulting clusters in the data that you can see the error, that it has clustered across the gap.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Compute complete linkage\nd2_hc <- hclust(dist(d2[,1:2]),\n method=\"complete\")\nd2_ggdc <- dendro_data(d2_hc, type = \"triangle\")\npd2c <- ggplot() +\n geom_segment(data=d2_ggdc$segments, \n aes(x = x, y = y, \n xend = xend, yend = yend)) + \n geom_point(data=d2_ggdc$labels, aes(x=x, y=y),\n colour=\"#3B99B1\", alpha=0.8) +\n ggtitle(\"(a) Complete linkage dendrogram\") +\n theme_minimal() +\n theme_dendro()\n\n# Compute dendrogram in data\nd2_hflyc <- hierfly(d2, d2_hc, scale=FALSE)\n\npd2hc <- ggplot() +\n geom_segment(data=d2_hflyc$segments, \n aes(x=x, xend=xend,\n y=y, yend=yend)) +\n geom_point(data=d2_hflyc$data, \n aes(x=x1, y=x2,\n shape=factor(node),\n colour=factor(node),\n size=1-node), alpha=0.8) +\n scale_shape_manual(values = c(16, 3)) +\n scale_colour_manual(values = c(\"#3B99B1\", \"black\")) +\n scale_size(limits=c(0,17)) +\n ggtitle(\"(b) Dendrogram in data space\") + \n theme_minimal() +\n theme(aspect.ratio=1, legend.position=\"none\")\n\n# Show result\nd2 <- d2 %>%\n mutate(clc = factor(cutree(d2_hc, 2)))\npc_d2c <- ggplot(d2) +\n geom_point(aes(x=x1, y=x2, colour=clc), \n size=2, alpha=0.8) +\n scale_colour_discrete_divergingx(palette = \"Zissou 1\",\n nmax=4, rev=TRUE) +\n ggtitle(\"(c) Two-cluster solution\") + \n theme_minimal() +\n theme(aspect.ratio=1, legend.position=\"none\")\n\n# Compute Wards linkage\nd2_hw <- hclust(dist(d2[,1:2]),\n method=\"ward.D2\")\nd2_ggdw <- dendro_data(d2_hw, type = \"triangle\")\npd2w <- ggplot() +\n geom_segment(data=d2_ggdw$segments, \n aes(x = x, y = y, \n xend = xend, yend = yend)) + \n geom_point(data=d2_ggdw$labels, aes(x=x, y=y),\n colour=\"#3B99B1\", alpha=0.8) +\n ggtitle(\"(d) Ward's linkage dendrogram\") +\n theme_minimal() +\n theme_dendro()\n\n# Compute dendrogram in data\nd2_hflyw <- hierfly(d2, d2_hw, scale=FALSE)\n\npd2hw <- ggplot() +\n geom_segment(data=d2_hflyw$segments, \n aes(x=x, xend=xend,\n y=y, yend=yend)) +\n geom_point(data=d2_hflyw$data, \n aes(x=x1, y=x2,\n shape=factor(node),\n colour=factor(node),\n size=1-node), alpha=0.8) +\n scale_shape_manual(values = c(16, 3)) +\n scale_colour_manual(values = c(\"#3B99B1\", \"black\")) +\n scale_size(limits=c(0,17)) +\n ggtitle(\"(e) Dendrogram in data space\") + \n theme_minimal() +\n theme(aspect.ratio=1, legend.position=\"none\")\n\n# Show result\nd2 <- d2 %>%\n mutate(clw = factor(cutree(d2_hw, 2)))\npc_d2w <- ggplot(d2) +\n geom_point(aes(x=x1, y=x2, colour=clw), \n size=2, alpha=0.8) +\n scale_colour_discrete_divergingx(palette = \"Zissou 1\",\n nmax=4, rev=TRUE) +\n ggtitle(\"(f) Two-cluster solution\") + \n theme_minimal() +\n theme(aspect.ratio=1, legend.position=\"none\")\n\npd2c + pd2hc + pc_d2c + \n pd2w + pd2hw + pc_d2w +\n plot_layout(ncol=3)\n```\n\n::: {.cell-output-display}\n![Complete linkage clustering (a, b, c) on nuisance variables in comparison to Ward's linkage (d, e, f). The two dendrograms (a, d) look similar but when plotted on the data (b, e) we can see they are very different solutions. The complete linkage result breaks the data into clusters across the gap (c), which is a bad solution. It has been distract by the nuisance variables. Conversely, the Wards linkage two-cluster solution does as hoped, divided the data into two clusters separated by the gap (f).](8-hierarchical_files/figure-html/fig-d2-c-1.png){#fig-d2-c width=864}\n:::\n:::\n\n\n::: {.content-visible when-format=\"html\"}\n::: info\nTwo dendrograms might look similar but the resulting clustering can be very different. They can also look very different but correspond to very similar clusterings. Plotting the dendrogram in the data space is important for understanding how the algorithm operated when grouping observations, even more so for high dimensions.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\infobox{Two dendrograms might look similar but the resulting clustering can be very different. They can also look very different but correspond to very similar clusterings. Plotting the dendrogram in the data space is important for understanding how the algorithm operated when grouping observations, even more so for high dimensions.}\n:::\n\n## Dendrograms in high-dimensions\n\nThe first step with any clustering with high dimensional data is also to check the data. You typically don't know whether there are clusters, or what shape they might be, or if there are nuisance observations or variables. A pairs plot like in @fig-penguins-pairs is a nice complement to using the tour (@fig-penguins-gt-html) for this. Here you can see three elliptical clusters, with one is further from the others.\n\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code for scatterplot matrix\"}\nload(\"data/penguins_sub.rda\")\nggscatmat(penguins_sub[,1:4]) + \n theme_minimal() +\n xlab(\"\") + ylab(\"\")\n```\n\n::: {.cell-output-display}\n![Make a scatterplot matrix to check for the presence of clustering, shape of clusters and presence of nuisance observations and variables. In the penguins it appears that there might be three elliptically shaped clusters, with some nuisance observations.](8-hierarchical_files/figure-html/fig-penguins-pairs-1.png){#fig-penguins-pairs width=768}\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to create tour\"}\nset.seed(20230329)\nb <- basis_random(4,2)\npt1 <- save_history(penguins_sub[,1:4], \n max_bases = 500, \n start = b)\nsave(pt1, file=\"data/penguins_tour_path.rda\")\n\n# To re-create the gifs\nload(\"data/penguins_tour_path.rda\")\nanimate_xy(penguins_sub[,1:4], \n tour_path = planned_tour(pt1), \n axes=\"off\", rescale=FALSE, \n half_range = 3.5)\n\nrender_gif(penguins_sub[,1:4], \n planned_tour(pt1), \n display_xy(half_range=0.9, axes=\"off\"),\n gif_file=\"gifs/penguins_gt.gif\",\n frames=500,\n loop=FALSE)\n```\n:::\n\n\n::: {.content-visible when-format=\"html\"}\n\n::: {#fig-penguins-gt-html}\n\n![](gifs/penguins_gt.gif){fig-alt=\"Tour of many linear projections of the penguins data. You can see three elliptical clusters, one further apart from the other two.\" fig.align=\"center\"}\n\nUse a grand tour of your data to check for clusters, the shape of clusters and for nuisance observations and variables. Here the penguins data looks like it has possibly three elliptical clusters, one more separated than the other two, with some nuisance observations.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\n![One frame from a grand tour being used to check for clusters, the shape of clusters and for nuisance observations and variables. Here the penguins data looks like it has possibly three elliptical clusters, one more separated than the other two, with some nuisance observations.](images/penguins_gt_59.png){#fig-penguins-gt-pdf fig-alt=\"A scatterplot of a 2D projection. You can see three elliptical clusters, one further apart from the other two.\" fig.align=\"center\"}\n:::\n\n\nThe process is the same as for the simpler example. We compute and draw the dendrogram in 2D, compute it in $p$-D and view with a tour. Here we have also chosen to examine the three cluster solution for single linkage and wards linkage clustering.\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\np_dist <- dist(penguins_sub[,1:4])\np_hcw <- hclust(p_dist, method=\"ward.D2\")\np_hcs <- hclust(p_dist, method=\"single\")\n\np_clw <- penguins_sub %>% \n mutate(cl = factor(cutree(p_hcw, 3))) %>%\n as.data.frame()\np_cls <- penguins_sub %>% \n mutate(cl = factor(cutree(p_hcs, 3))) %>%\n as.data.frame()\n\np_w_hfly <- hierfly(p_clw, p_hcw, scale=FALSE)\np_s_hfly <- hierfly(p_cls, p_hcs, scale=FALSE)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to draw dendrograms\"}\n# Generate the dendrograms in 2D\np_hcw_dd <- dendro_data(p_hcw)\npw_dd <- ggplot() +\n geom_segment(data=p_hcw_dd$segments, \n aes(x = x, y = y, \n xend = xend, yend = yend)) + \n geom_point(data=p_hcw_dd$labels, aes(x=x, y=y),\n alpha=0.8) +\n theme_dendro()\n\np_hcs_dd <- dendro_data(p_hcs)\nps_dd <- ggplot() +\n geom_segment(data=p_hcs_dd$segments, \n aes(x = x, y = y, \n xend = xend, yend = yend)) + \n geom_point(data=p_hcs_dd$labels, aes(x=x, y=y),\n alpha=0.8) +\n theme_dendro()\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to create tours of dendrogram in data\"}\nload(\"data/penguins_tour_path.rda\")\nglyphs <- c(16, 46)\npchw <- glyphs[p_w_hfly$data$node+1]\npchs <- glyphs[p_s_hfly$data$node+1]\n\nanimate_xy(p_w_hfly$data[,1:4], \n #col=colw, \n tour_path = planned_tour(pt1),\n pch = pchw,\n edges=p_w_hfly$edges, \n axes=\"bottomleft\")\n\nanimate_xy(p_s_hfly$data[,1:4], \n #col=colw, \n tour_path = planned_tour(pt1),\n pch = pchs,\n edges=p_s_hfly$edges, \n axes=\"bottomleft\")\n\nrender_gif(p_w_hfly$data[,1:4], \n planned_tour(pt1),\n display_xy(half_range=0.9, \n pch = pchw,\n edges = p_w_hfly$edges,\n axes = \"off\"),\n gif_file=\"gifs/penguins_hflyw.gif\",\n frames=500,\n loop=FALSE)\n\nrender_gif(p_s_hfly$data[,1:4], \n planned_tour(pt1), \n display_xy(half_range=0.9, \n pch = pchs,\n edges = p_s_hfly$edges,\n axes = \"off\"),\n gif_file=\"gifs/penguins_hflys.gif\",\n frames=500,\n loop=FALSE)\n\n# Show three cluster solutions\nclrs <- hcl.colors(3, \"Zissou 1\")\nw3_col <- clrs[p_w_hfly$data$cl[p_w_hfly$data$node == 0]]\nrender_gif(p_w_hfly$data[p_w_hfly$data$node == 0, 1:4], \n planned_tour(pt1), \n display_xy(half_range=0.9, \n col=w3_col,\n axes = \"off\"),\n gif_file=\"gifs/penguins_w3.gif\",\n frames=500,\n loop=FALSE)\n\ns3_col <- clrs[p_s_hfly$data$cl[p_w_hfly$data$node == 0]]\nrender_gif(p_s_hfly$data[p_w_hfly$data$node == 0,1:4], \n planned_tour(pt1), \n display_xy(half_range=0.9, \n col=s3_col,\n axes = \"off\"),\n gif_file=\"gifs/penguins_s3.gif\",\n frames=500,\n loop=FALSE)\n```\n:::\n\n\n@fig-penguins-ddw and @fig-penguins-hfly-html show results for single linkage and wards linkage clustering of the penguins data. fig-penguins-ddw shows the 2D dendrograms. The 2D dendrograms are very different. Wards linkage produces a clearer indication of clusters, with a suggestion of three, or possibly four or five clusters. The dendrogram for single linkage suggests two clusters, and has the classical waterfall appearance that is often seen with this type of linkage. (If you look carefully, though, you will see it is actually a three cluster solution. At the very top of the dendrogram there is another branch connecting one observation to the other two clusters.)\n\n@fig-penguins-hfly-html (a) and (b) show the dendrograms in 4D overlaid on the data. The two are starkly different. The single linkage clustering is like pins pointing to (three) centres, with some long extra edges.\n\nPlots (c) and (d) show the three cluster solutions, with Wards linkage almost recovering the clusters of the three species. Single linkage has two big clusters, and the singleton cluster. Although the Wards linkage produces the best result, single linkage does provide some interesting and useful information about the data. That singleton cluster is an outlier, an unusually-sized penguin. We can see it as an outlier just from the tour in @fig-penguins-gt-html but single linkage emphasizes it, bringing it more strongly to our attention. \n\n\n::: {.cell}\n::: {.cell-output-display}\n![Wards linkage (left) and single linkage (right).](8-hierarchical_files/figure-html/fig-penguins-ddw-1.png){#fig-penguins-ddw width=672}\n:::\n:::\n\n\n\n::: {.content-visible when-format=\"html\"}\n::: {#fig-penguins-hfly-html layout-ncol=2}\n![Wards linkage](gifs/penguins_hflyw.gif){#fig-penguins-hflyw fig-alt=\"Tour showing the dendrogram for Wards linkage clustering on the penguins data in 4D. You can see that it connects points within each clump and then connects between clusters.\"}\n\n![Single linkage](gifs/penguins_hflys.gif){#fig-penguins-hflyw fig-alt=\"Tour showing the dendrogram for single linkage clustering on the penguins data in 4D. You can see that the connections are like asterisks, connecting towards the center of each clump and there are a couple of long connections between clusters.\"}\n\n![Wards linkage](gifs/penguins_w3.gif){#fig-penguins-w3}\n\n![Single linkage](gifs/penguins_s3.gif){#fig-penguins-s3}\n\nDendrograms for Wards and single linkage of the penguins data, shown in 2D (top) and in 4D (middle), and the three-cluster solution of each.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n::: {#fig-penguins-hfly-pdf layout-ncol=2}\n\n![Wards linkage](images/penguins_hflyw_59.png){#fig-penguins-hflyw fig-alt=\"Tour showing the dendrogram for Wards linkage clustering on the penguins data in 4D. You can see that it connects points within each clump and then connects between clusters.\"}\n\n![Single linkage](images/penguins_hflys_59.png){#fig-penguins-hflyw fig-alt=\"Tour showing the dendrogram for single linkage clustering on the penguins data in 4D. You can see that the connections are like asterisks, connecting towards the center of each clump and there are a couple of long connections between clusters.\"}\n\n![Wards linkage](images/penguins_w3_59.png){#fig-penguins-w3}\n\n![Single linkage](images/penguins_s3_59.png){#fig-penguins-s3}\n\nDendrograms for Wards and single linkage of the penguins data, shown in 2D (top) and in 4D (middle), and the three-cluster solution of each.\n:::\n:::\n\n::: {.content-visible when-format=\"html\"}\n::: insight\nSingle linkage on the penguins has a very different joining pattern to Wards! While Wards provides the better result, single linkage provides useful information about the data, such as emphasizing the outlier.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\insightbox{Single linkage on the penguins has a very different joining pattern to Wards! While Wards provides the better result, single linkage provides useful information about the data, such as emphasizing the outlier.}\n:::\n\n::: {.content-visible when-format=\"html\"}\n@fig-penguins-hfly-plotly provides HTML objects of the dendrograms, so that they can be directly compared. The same tour path is used, so the sliders allow setting the view to the same projection in each plot.\n:::\n\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to make html objects of the dendrogram in 4D\"}\nload(\"data/penguins_tour_path.rda\")\n# Create a smaller one, for space concerns\npt1i <- interpolate(pt1[,,1:5], 0.1)\npw_anim <- render_anim(p_w_hfly$data,\n vars=1:4,\n frames=pt1i, \n edges = p_w_hfly$edges,\n obs_labels=paste0(1:nrow(p_w_hfly$data),\n p_w_hfly$data$cl))\n\npw_gp <- ggplot() +\n geom_segment(data=pw_anim$edges, \n aes(x=x, xend=xend,\n y=y, yend=yend,\n frame=frame)) +\n geom_point(data=pw_anim$frames, \n aes(x=P1, y=P2, \n frame=frame, \n shape=factor(node),\n label=obs_labels), \n alpha=0.8, size=1) +\n xlim(-1,1) + ylim(-1,1) +\n scale_shape_manual(values=c(16, 46)) +\n coord_equal() +\n theme_bw() +\n theme(legend.position=\"none\", \n axis.text=element_blank(),\n axis.title=element_blank(),\n axis.ticks=element_blank(),\n panel.grid=element_blank())\n\npwg <- ggplotly(pw_gp, width=450, height=500,\n tooltip=\"label\") %>%\n animation_button(label=\"Go\") %>%\n animation_slider(len=0.8, x=0.5,\n xanchor=\"center\") %>%\n animation_opts(easing=\"linear\", transition = 0)\nhtmlwidgets::saveWidget(pwg,\n file=\"html/penguins_cl_ward.html\",\n selfcontained = TRUE)\n\n# Single\nps_anim <- render_anim(p_s_hfly$data, vars=1:4,\n frames=pt1i, \n edges = p_s_hfly$edges,\n obs_labels=paste0(1:nrow(p_s_hfly$data),\n p_s_hfly$data$cl))\n\nps_gp <- ggplot() +\n geom_segment(data=ps_anim$edges, \n aes(x=x, xend=xend,\n y=y, yend=yend,\n frame=frame)) +\n geom_point(data=ps_anim$frames, \n aes(x=P1, y=P2, \n frame=frame, \n shape=factor(node),\n label=obs_labels), \n alpha=0.8, size=1) +\n xlim(-1,1) + ylim(-1,1) +\n scale_shape_manual(values=c(16, 46)) +\n coord_equal() +\n theme_bw() +\n theme(legend.position=\"none\", \n axis.text=element_blank(),\n axis.title=element_blank(),\n axis.ticks=element_blank(),\n panel.grid=element_blank())\n\npsg <- ggplotly(ps_gp, width=450, height=500,\n tooltip=\"label\") %>%\n animation_button(label=\"Go\") %>%\n animation_slider(len=0.8, x=0.5,\n xanchor=\"center\") %>%\n animation_opts(easing=\"linear\", transition = 0)\nhtmlwidgets::saveWidget(psg,\n file=\"html/penguins_cl_single.html\",\n selfcontained = TRUE)\n```\n:::\n\n\n::: {.content-visible when-format=\"html\"}\n::: {#fig-penguins-hfly-plotly fig-align=\"center\"}\n\n\n\n\n\nAnimation of dendrogram from Wards (top) and single (bottom) linkage clustering of the penguins data.\n:::\n:::\n\n::: {.content-visible when-format=\"html\"}\n::: info\nViewing the dendrograms in high-dimensions provides insight into how the observations have joined points to clusters. For example, single linkage often has edges leading to a single focal point, which might not be yield a useful clustering but might help to identify outliers. If the edges point to multiple focal points, with long edges bridging gaps in the data, the result is more likely yielding a useful clustering.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\infobox{Viewing the dendrograms in high-dimensions provides insight into how the observations have joined points to clusters. For example, single linkage often has edges leading to a single focal point, which might not be yield a useful clustering but might help to identify outliers. If the edges point to multiple focal points, with long edges bridging gaps in the data, the result is more likely yielding a useful clustering.}\n:::\n\n## Exercises {-}\n\n1. Compute complete linkage clustering for the **nuisance observations** data set. Does it perform more similarly to single linkage or Wards linkage?\n2. Compute single linkage clustering for the **nuisance variables** data. Does it perform more similarly to complete linkage or Wards linkage?\n3. Use hierarchical clustering with Euclidean distance and Wards linkage to split the `clusters_nonlin` data into four clusters. Look at the dendrogram in 2D and 4D. In 4D you can also include the cluster assignment as color. Does this look like a good solution?\n4. Repeat the same exercise using single linkage instead of Wards linkage. How does this solution compare to what we have found with Wards linkage? Does the solution match how you would cluster the data in a spin-and-brush approach?\n5. Argue why single linkage might not perform well for the `fake_trees` data. Which method do you think will work best with this data? Conduct hierarchical clustering with your choice of linkage method. Does the 2D dendrogram suggest 10 clusters for the 10 branches? Take a look at the high-dimensional representation of the dendrogram. Has your chosen method captured the branches well, or not, explaining what you think worked well or poorly?\n6. What would a useful clustering of the first four PCs of the `aflw` data be? What linkage method would you expect works best to cluster it this way? Conduct the clustering. Examine the 2D dendrogram and decide on how many clusters should be used. Examine the cluster solution using a tour with points coloured by cluster. \n7. Based on your assessment of the cluster structure in the challenge data sets, `c1`-`c7`, from the `mulgar` package, which linkage method would you recommend. Use your suggested linkage method to cluster each data set, and summarise how well it performed in detecting the clusters that you have seen.\n\n\n\n::: {.cell}\n\n:::\n", + "supporting": [ + "8-hierarchical_files/figure-html" + ], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": {}, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/8-hierarchical/execute-results/tex.json b/_freeze/8-hierarchical/execute-results/tex.json index 76fcd51..ac3eed7 100644 --- a/_freeze/8-hierarchical/execute-results/tex.json +++ b/_freeze/8-hierarchical/execute-results/tex.json @@ -1,8 +1,8 @@ { - "hash": "e4c2ece62891404e37e30d61625cf5a9", + "hash": "cce50506e568ee8fb9c7dd16bdad8ab2", "result": { "engine": "knitr", - "markdown": "# Hierarchical clustering {#sec-hclust}\n\n\\index{cluster analysis!algorithms}\n\n\\index{cluster analysis!hierarchical} \\index{cluster analysis!intercluster distance (linkage)}\n\n## Overview\n\nHierarchical cluster algorithms sequentially fuse neighboring points to form ever-larger clusters, starting from a full interpoint distance matrix. *Distance between clusters* is described by a \"linkage method\", of which there are many. For example, single linkage measures the distance between clusters by the smallest interpoint distance between the members of the two clusters, complete linkage uses the maximum interpoint distance, and average linkage uses the average of the interpoint distances. Wards linkage, which usually produces the best clustering solutions, defines the distance as the reduction in the within-group variance. A good discussion on cluster analysis and linkage can be found in @HOML, on [Wikipedia](https://en.wikipedia.org/wiki/Cluster_analysis) or any multivariate textbook.\n\n::: {.content-visible when-format=\"html\"}\n::: info\nHierarchical clustering is summarised by a dendrogram, which sequentially shows points being joined to form a cluster, with the corresponding distances. Breaking the data into clusters is done by cutting the dendrogram at the long edges. \n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\infobox{Hierarchical clustering is summarised by a dendrogram, which sequentially shows points being joined to form a cluster, with the corresponding distances. Breaking the data into clusters is done by cutting the dendrogram at the long edges.}\n:::\n\n\nHere we will take a look at hierarchical clustering, using Wards linkage, on the `simple_clusters` data. The steps taken are to:\n\n1. Plot the data to check for presence of clusters and their shape.\n2. Compute the hierarchical clustering.\n3. Plot the dendrogram to help decide on an appropriate number of clusters, using the `dendro_data` function from the `ggdendro` package.\n4. Show the dendrogram overlaid on the data, calculated by the `hierfly` function in `mulgar`.\n5. Plot the clustering result, by colouring points in the plot of the data.\n\n\\index{cluster analysis!dendrogram}\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Load libraries\"}\nlibrary(ggplot2)\nlibrary(mulgar)\nlibrary(ggdendro)\nlibrary(dplyr)\nlibrary(patchwork)\nlibrary(tourr)\nlibrary(plotly)\nlibrary(htmlwidgets)\nlibrary(colorspace)\nlibrary(GGally)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\ndata(simple_clusters)\n\n# Compute hierarchical clustering with Ward's linkage\ncl_hw <- hclust(dist(simple_clusters[,1:2]),\n method=\"ward.D2\")\ncl_ggd <- dendro_data(cl_hw, type = \"triangle\")\n\n# Compute dendrogram in the data\ncl_hfly <- hierfly(simple_clusters, cl_hw, scale=FALSE)\n\n# Show result\nsimple_clusters <- simple_clusters %>%\n mutate(clw = factor(cutree(cl_hw, 2)))\n```\n:::\n\n\n\n@fig-hc-sim illustrates the hierarchical clustering approach for a simple simulated data set (a) with two well-separated clusters in 2D. The dendrogram (b) is a representation of the order that points are joined into clusters. The dendrogram strongly indicates two clusters because the two branches representing the last join are much longer than all of the other branches. \n\nAlthough, the dendrogram is usually a good summary of the steps taken by the algorithm, it can be misleading. The dendrogram might indicate a clear clustering (big differences in heights of branches) but the result may be awful. You need to check this by examining the result on the data, called model-in-the-data space by @wickham2015. \n\nPlot (c) shows the dendrogram in 2D, overlaid on the data. The segments show how the points are joined to make clusters. In order to represent the dendrogram this way, new points (represented by a \"+\" here) need to be added corresponding to the centroid of groups of points that have been joined. These are used to draw the segments between other points and other clusters. We can see that the longest (two) edges stretches across the gap between the two clusters. This corresponds to the top of the dendrogram, the two long branches where we would cut it to make the two-cluster solution. This two-cluster solution is shown in plot (d).\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to make the four plots\"}\n# Plot the data\npd <- ggplot(simple_clusters, aes(x=x1, y=x2)) +\n geom_point(colour=\"#3B99B1\", size=2, alpha=0.8) +\n ggtitle(\"(a)\") + \n theme_minimal() +\n theme(aspect.ratio=1) \n\n# Plot the dendrogram\nph <- ggplot() +\n geom_segment(data=cl_ggd$segments, \n aes(x = x, y = y, \n xend = xend, yend = yend)) + \n geom_point(data=cl_ggd$labels, aes(x=x, y=y),\n colour=\"#3B99B1\", alpha=0.8) +\n ggtitle(\"(b)\") + \n theme_minimal() +\n theme_dendro()\n\n# Plot the dendrogram on the data\npdh <- ggplot() +\n geom_segment(data=cl_hfly$segments, \n aes(x=x, xend=xend,\n y=y, yend=yend)) +\n geom_point(data=cl_hfly$data, \n aes(x=x1, y=x2,\n shape=factor(node),\n colour=factor(node),\n size=1-node), alpha=0.8) +\n xlab(\"x1\") + ylab(\"x2\") +\n scale_shape_manual(values = c(16, 3)) +\n scale_colour_manual(values = c(\"#3B99B1\", \"black\")) +\n scale_size(limits=c(0,17)) +\n ggtitle(\"(c)\") + \n theme_minimal() +\n theme(aspect.ratio=1, legend.position=\"none\")\n\n# Plot the resulting clusters\npc <- ggplot(simple_clusters) +\n geom_point(aes(x=x1, y=x2, colour=clw), \n size=2, alpha=0.8) +\n scale_colour_discrete_divergingx(palette = \"Zissou 1\",\n nmax=5, rev=TRUE) +\n ggtitle(\"(d)\") + \n theme_minimal() +\n theme(aspect.ratio=1, legend.position=\"none\")\n\npd + ph + pdh + pc + plot_layout(ncol=2)\n```\n\n::: {.cell-output-display}\n![Hierarchical clustering on simulated data: (a) data, (b) dendrogram, (c) dendrogram on the data, and (d) two cluster solution. The extra points corresponding to nodes of the dendrogram are indicated by + in (c). The last join in the dendrogram (b), can be seen to correspond to the edges connecting the gap, when displayed with the data (c). The other joins can be seen to be pulling together points within each clump.](8-hierarchical_files/figure-pdf/fig-hc-sim-1.pdf){#fig-hc-sim fig-pos='H' width=100%}\n:::\n:::\n\n\n\n::: {.content-visible when-format=\"html\"}\n::: info\nPlotting the dendrogram in the data space can help you understand how the hierarchical clustering has collected the points together into clusters. You can learn if the algorithm has been confused by nuisance patterns in the data, and how different choices of linkage method affects the result. \n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\infobox{Plotting the dendrogram in the data space can help you understand how the hierarchical clustering has collected the points together into clusters. You can learn if the algorithm has been confused by nuisance patterns in the data, and how different choices of linkage method affects the result.}\n:::\n\n## Common patterns which confuse clustering algorithms\n\n@fig-problems shows two examples of structure in data that will confuse hierarchical clustering: nuisance variables and nuisance cases. We usually do not know that these problems exist prior to clustering the data. Discovering these iteratively as you conduct a clustering analysis is important for generating useful results. \n\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to make plots\"}\n# Nuisance observations\nset.seed(20190514)\nx <- (runif(20)-0.5)*4\ny <- x\nd1 <- data.frame(x1 = c(rnorm(50, -3), \n rnorm(50, 3), x),\n x2 = c(rnorm(50, -3), \n rnorm(50, 3), y),\n cl = factor(c(rep(\"A\", 50), \n rep(\"B\", 70))))\nd1 <- d1 %>% \n mutate_if(is.numeric, function(x) (x-mean(x))/sd(x))\npd1 <- ggplot(data=d1, aes(x=x1, y=x2)) + \n geom_point() +\n ggtitle(\"Nuisance observations\") + \n theme_minimal() +\n theme(aspect.ratio=1) \n\n# Nuisance variables\nset.seed(20190512)\nd2 <- data.frame(x1=c(rnorm(50, -4), \n rnorm(50, 4)),\n x2=c(rnorm(100)),\n cl = factor(c(rep(\"A\", 50), \n rep(\"B\", 50))))\nd2 <- d2 %>% \n mutate_if(is.numeric, function(x) (x-mean(x))/sd(x))\npd2 <- ggplot(data=d2, aes(x=x1, y=x2)) + \n geom_point() +\n ggtitle(\"Nuisance variables\") + \n theme_minimal() +\n theme(aspect.ratio=1)\n\npd1 + pd2 + plot_layout(ncol=2)\n```\n\n::: {.cell-output-display}\n![Two examples of data structure that causes problems for hierarchical clustering. Nuisance observations can cause problems because the close observations between the two clusters can cause some chaining in the hierarchical joining of observations. Nuisance variables can cause problems because observations across the gap can seem closer than observations at the end of each cluster.](8-hierarchical_files/figure-pdf/fig-problems-1.pdf){#fig-problems fig-pos='H' width=80%}\n:::\n:::\n\n\n\nIf an outlier is a point that is extreme relative to other observations, an \"inlier\" is a point that is extreme relative to a cluster, but inside the domain of all of the observations. Nuisance observations are inliers, cases that occur between larger groups of points. If they were excluded there might be a gap between clusters. These can cause problems for clustering when distances between clusters are measured, and can be very problematic when single linkage hierarchical clustering is used. @fig-d1-s shows how nuisance observations affect single linkage but not Wards linkage hierarchical clustering.\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to make plots\"}\n# Compute single linkage\nd1_hs <- hclust(dist(d1[,1:2]),\n method=\"single\")\nd1_ggds <- dendro_data(d1_hs, type = \"triangle\")\npd1s <- ggplot() +\n geom_segment(data=d1_ggds$segments, \n aes(x = x, y = y, \n xend = xend, yend = yend)) + \n geom_point(data=d1_ggds$labels, aes(x=x, y=y),\n colour=\"#3B99B1\", alpha=0.8) +\n theme_minimal() +\n ggtitle(\"(a) Single linkage dendrogram\") +\n theme_dendro()\n\n# Compute dendrogram in data\nd1_hflys <- hierfly(d1, d1_hs, scale=FALSE)\n\npd1hs <- ggplot() +\n geom_segment(data=d1_hflys$segments, \n aes(x=x, xend=xend,\n y=y, yend=yend)) +\n geom_point(data=d1_hflys$data, \n aes(x=x1, y=x2,\n shape=factor(node),\n colour=factor(node),\n size=1-node), alpha=0.8) +\n scale_shape_manual(values = c(16, 3)) +\n scale_colour_manual(values = c(\"#3B99B1\", \"black\")) +\n scale_size(limits=c(0,17)) +\n ggtitle(\"(b) Dendrogram in data space\") + \n theme_minimal() +\n theme(aspect.ratio=1, legend.position=\"none\")\n\n# Show result\nd1 <- d1 %>%\n mutate(cls = factor(cutree(d1_hs, 2)))\npc_d1s <- ggplot(d1) +\n geom_point(aes(x=x1, y=x2, colour=cls), \n size=2, alpha=0.8) +\n scale_colour_discrete_divergingx(palette = \"Zissou 1\",\n nmax=4, rev=TRUE) +\n ggtitle(\"(c) Two-cluster solution\") + \n theme_minimal() +\n theme(aspect.ratio=1, legend.position=\"none\")\n\n# Compute Wards linkage\nd1_hw <- hclust(dist(d1[,1:2]),\n method=\"ward.D2\")\nd1_ggdw <- dendro_data(d1_hw, type = \"triangle\")\npd1w <- ggplot() +\n geom_segment(data=d1_ggdw$segments, \n aes(x = x, y = y, \n xend = xend, yend = yend)) + \n geom_point(data=d1_ggdw$labels, aes(x=x, y=y),\n colour=\"#3B99B1\", alpha=0.8) +\n ggtitle(\"(d) Ward's linkage dendrogram\") +\n theme_minimal() +\n theme_dendro()\n\n# Compute dendrogram in data\nd1_hflyw <- hierfly(d1, d1_hw, scale=FALSE)\n\npd1hw <- ggplot() +\n geom_segment(data=d1_hflyw$segments, \n aes(x=x, xend=xend,\n y=y, yend=yend)) +\n geom_point(data=d1_hflyw$data, \n aes(x=x1, y=x2,\n shape=factor(node),\n colour=factor(node),\n size=1-node), alpha=0.8) +\n scale_shape_manual(values = c(16, 3)) +\n scale_colour_manual(values = c(\"#3B99B1\", \"black\")) +\n scale_size(limits=c(0,17)) +\n ggtitle(\"(e) Dendrogram in data space\") + \n theme_minimal() +\n theme(aspect.ratio=1, legend.position=\"none\")\n\n# Show result\nd1 <- d1 %>%\n mutate(clw = factor(cutree(d1_hw, 2)))\npc_d1w <- ggplot(d1) +\n geom_point(aes(x=x1, y=x2, colour=clw), \n size=2, alpha=0.8) +\n scale_colour_discrete_divergingx(palette = \"Zissou 1\",\n nmax=4, rev=TRUE) +\n ggtitle(\"(f) Two-cluster solution\") + \n theme_minimal() +\n theme(aspect.ratio=1, legend.position=\"none\")\n\npd1s + pd1hs + pc_d1s + \n pd1w + pd1hw + pc_d1w +\n plot_layout(ncol=3)\n```\n\n::: {.cell-output-display}\n![The effect of nuisance observations on single linkage (a, b, c) and Ward's linkage hierarchical clustering (d, e, f). The single linkage dendrogram is very different to the Wards linkage dendrogram. When plotted with the data (b) we can see a pin cushion or asterisk pattern, where points are joined to others through a place in the middle of the line of nuisance observations. This results in the bad two cluster solution of a singleton cluster, and all the rest. Conversely, Ward's dendrogram (d) strongly suggests two clusters, although the final join corresponds to just a small gap when shown on the data (e) but results in two sensible clusters.](8-hierarchical_files/figure-pdf/fig-d1-s-1.pdf){#fig-d1-s fig-pos='H' width=80%}\n:::\n:::\n\n\n\nNuisance variables are ones that do not contribute to the clustering, such as `x2` here. When we look at this data we see a gap between two elliptically shape clusters, with the gap being only in the horizontal direction, `x1`. When we compute the distances between points, in order to start clustering, without knowing that `x2` is a nuisance variable, points across the gap might be considered to be closer than points within the same cluster. @fig-d2-c shows how nuisance variables affects complete linkage but not Wards linkage hierarchical clustering. (Wards linkage can be affected but it isn't for this data.) Interestingly, the dendrogram for complete linkage looks ideal, that it suggests two clusters. It is not until you examine the resulting clusters in the data that you can see the error, that it has clustered across the gap.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Compute complete linkage\nd2_hc <- hclust(dist(d2[,1:2]),\n method=\"complete\")\nd2_ggdc <- dendro_data(d2_hc, type = \"triangle\")\npd2c <- ggplot() +\n geom_segment(data=d2_ggdc$segments, \n aes(x = x, y = y, \n xend = xend, yend = yend)) + \n geom_point(data=d2_ggdc$labels, aes(x=x, y=y),\n colour=\"#3B99B1\", alpha=0.8) +\n ggtitle(\"(a) Complete linkage dendrogram\") +\n theme_minimal() +\n theme_dendro()\n\n# Compute dendrogram in data\nd2_hflyc <- hierfly(d2, d2_hc, scale=FALSE)\n\npd2hc <- ggplot() +\n geom_segment(data=d2_hflyc$segments, \n aes(x=x, xend=xend,\n y=y, yend=yend)) +\n geom_point(data=d2_hflyc$data, \n aes(x=x1, y=x2,\n shape=factor(node),\n colour=factor(node),\n size=1-node), alpha=0.8) +\n scale_shape_manual(values = c(16, 3)) +\n scale_colour_manual(values = c(\"#3B99B1\", \"black\")) +\n scale_size(limits=c(0,17)) +\n ggtitle(\"(b) Dendrogram in data space\") + \n theme_minimal() +\n theme(aspect.ratio=1, legend.position=\"none\")\n\n# Show result\nd2 <- d2 %>%\n mutate(clc = factor(cutree(d2_hc, 2)))\npc_d2c <- ggplot(d2) +\n geom_point(aes(x=x1, y=x2, colour=clc), \n size=2, alpha=0.8) +\n scale_colour_discrete_divergingx(palette = \"Zissou 1\",\n nmax=4, rev=TRUE) +\n ggtitle(\"(c) Two-cluster solution\") + \n theme_minimal() +\n theme(aspect.ratio=1, legend.position=\"none\")\n\n# Compute Wards linkage\nd2_hw <- hclust(dist(d2[,1:2]),\n method=\"ward.D2\")\nd2_ggdw <- dendro_data(d2_hw, type = \"triangle\")\npd2w <- ggplot() +\n geom_segment(data=d2_ggdw$segments, \n aes(x = x, y = y, \n xend = xend, yend = yend)) + \n geom_point(data=d2_ggdw$labels, aes(x=x, y=y),\n colour=\"#3B99B1\", alpha=0.8) +\n ggtitle(\"(d) Ward's linkage dendrogram\") +\n theme_minimal() +\n theme_dendro()\n\n# Compute dendrogram in data\nd2_hflyw <- hierfly(d2, d2_hw, scale=FALSE)\n\npd2hw <- ggplot() +\n geom_segment(data=d2_hflyw$segments, \n aes(x=x, xend=xend,\n y=y, yend=yend)) +\n geom_point(data=d2_hflyw$data, \n aes(x=x1, y=x2,\n shape=factor(node),\n colour=factor(node),\n size=1-node), alpha=0.8) +\n scale_shape_manual(values = c(16, 3)) +\n scale_colour_manual(values = c(\"#3B99B1\", \"black\")) +\n scale_size(limits=c(0,17)) +\n ggtitle(\"(e) Dendrogram in data space\") + \n theme_minimal() +\n theme(aspect.ratio=1, legend.position=\"none\")\n\n# Show result\nd2 <- d2 %>%\n mutate(clw = factor(cutree(d2_hw, 2)))\npc_d2w <- ggplot(d2) +\n geom_point(aes(x=x1, y=x2, colour=clw), \n size=2, alpha=0.8) +\n scale_colour_discrete_divergingx(palette = \"Zissou 1\",\n nmax=4, rev=TRUE) +\n ggtitle(\"(f) Two-cluster solution\") + \n theme_minimal() +\n theme(aspect.ratio=1, legend.position=\"none\")\n\npd2c + pd2hc + pc_d2c + \n pd2w + pd2hw + pc_d2w +\n plot_layout(ncol=3)\n```\n\n::: {.cell-output-display}\n![Complete linkage clustering (a, b, c) on nuisance variables in comparison to Ward's linkage (d, e, f). The two dendrograms (a, d) look similar but when plotted on the data (b, e) we can see they are very different solutions. The complete linkage result breaks the data into clusters across the gap (c), which is a bad solution. It has been distract by the nuisance variables. Conversely, the Wards linkage two-cluster solution does as hoped, divided the data into two clusters separated by the gap (f).](8-hierarchical_files/figure-pdf/fig-d2-c-1.pdf){#fig-d2-c fig-pos='H' width=80%}\n:::\n:::\n\n\n\n::: {.content-visible when-format=\"html\"}\n::: info\nTwo dendrograms might look similar but the resulting clustering can be very different. They can also look very different but correspond to very similar clusterings. Plotting the dendrogram in the data space is important for understanding how the algorithm operated when grouping observations, even more so for high dimensions.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\infobox{Two dendrograms might look similar but the resulting clustering can be very different. They can also look very different but correspond to very similar clusterings. Plotting the dendrogram in the data space is important for understanding how the algorithm operated when grouping observations, even more so for high dimensions.}\n:::\n\n## Dendrograms in high-dimensions\n\nThe first step with any clustering with high dimensional data is also to check the data. You typically don't know whether there are clusters, or what shape they might be, or if there are nuisance observations or variables. A pairs plot like in @fig-penguins-pairs is a nice complement to using the tour (@fig-penguins-gt-pdf) for this. Here you can see three elliptical clusters, with one is further from the others.\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code for scatterplot matrix\"}\nload(\"data/penguins_sub.rda\")\nggscatmat(penguins_sub[,1:4]) + \n theme_minimal() +\n xlab(\"\") + ylab(\"\")\n```\n\n::: {.cell-output-display}\n![Make a scatterplot matrix to check for the presence of clustering, shape of clusters and presence of nuisance observations and variables. In the penguins it appears that there might be three elliptically shaped clusters, with some nuisance observations.](8-hierarchical_files/figure-pdf/fig-penguins-pairs-1.pdf){#fig-penguins-pairs fig-pos='H' width=80%}\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to create tour\"}\nset.seed(20230329)\nb <- basis_random(4,2)\npt1 <- save_history(penguins_sub[,1:4], \n max_bases = 500, \n start = b)\nsave(pt1, file=\"data/penguins_tour_path.rda\")\n\n# To re-create the gifs\nload(\"data/penguins_tour_path.rda\")\nanimate_xy(penguins_sub[,1:4], \n tour_path = planned_tour(pt1), \n axes=\"off\", rescale=FALSE, \n half_range = 3.5)\n\nrender_gif(penguins_sub[,1:4], \n planned_tour(pt1), \n display_xy(half_range=0.9, axes=\"off\"),\n gif_file=\"gifs/penguins_gt.gif\",\n frames=500,\n loop=FALSE)\n```\n:::\n\n\n\n::: {.content-visible when-format=\"html\"}\n\n::: {#fig-penguins-gt-html}\n\n![](gifs/penguins_gt.gif){fig-alt=\"Tour of many linear projections of the penguins data. You can see three elliptical clusters, one further apart from the other two.\" fig.align=\"center\"}\n\nUse a grand tour of your data to check for clusters, the shape of clusters and for nuisance observations and variables. Here the penguins data looks like it has possibly three elliptical clusters, one more separated than the other two, with some nuisance observations.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\n![One frame from a grand tour being used to check for clusters, the shape of clusters and for nuisance observations and variables. Here the penguins data looks like it has possibly three elliptical clusters, one more separated than the other two, with some nuisance observations.](images/penguins_gt_59.png){#fig-penguins-gt-pdf fig-alt=\"A scatterplot of a 2D projection. You can see three elliptical clusters, one further apart from the other two.\" fig.align=\"center\"}\n:::\n\n\nThe process is the same as for the simpler example. We compute and draw the dendrogram in 2D, compute it in $p$-D and view with a tour. Here we have also chosen to examine the three cluster solution for single linkage and wards linkage clustering.\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\np_dist <- dist(penguins_sub[,1:4])\np_hcw <- hclust(p_dist, method=\"ward.D2\")\np_hcs <- hclust(p_dist, method=\"single\")\n\np_clw <- penguins_sub %>% \n mutate(cl = factor(cutree(p_hcw, 3))) %>%\n as.data.frame()\np_cls <- penguins_sub %>% \n mutate(cl = factor(cutree(p_hcs, 3))) %>%\n as.data.frame()\n\np_w_hfly <- hierfly(p_clw, p_hcw, scale=FALSE)\np_s_hfly <- hierfly(p_cls, p_hcs, scale=FALSE)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to draw dendrograms\"}\n# Generate the dendrograms in 2D\np_hcw_dd <- dendro_data(p_hcw)\npw_dd <- ggplot() +\n geom_segment(data=p_hcw_dd$segments, \n aes(x = x, y = y, \n xend = xend, yend = yend)) + \n geom_point(data=p_hcw_dd$labels, aes(x=x, y=y),\n alpha=0.8) +\n theme_dendro()\n\np_hcs_dd <- dendro_data(p_hcs)\nps_dd <- ggplot() +\n geom_segment(data=p_hcs_dd$segments, \n aes(x = x, y = y, \n xend = xend, yend = yend)) + \n geom_point(data=p_hcs_dd$labels, aes(x=x, y=y),\n alpha=0.8) +\n theme_dendro()\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to create tours of dendrogram in data\"}\nload(\"data/penguins_tour_path.rda\")\nglyphs <- c(16, 46)\npchw <- glyphs[p_w_hfly$data$node+1]\npchs <- glyphs[p_s_hfly$data$node+1]\n\nanimate_xy(p_w_hfly$data[,1:4], \n #col=colw, \n tour_path = planned_tour(pt1),\n pch = pchw,\n edges=p_w_hfly$edges, \n axes=\"bottomleft\")\n\nanimate_xy(p_s_hfly$data[,1:4], \n #col=colw, \n tour_path = planned_tour(pt1),\n pch = pchs,\n edges=p_s_hfly$edges, \n axes=\"bottomleft\")\n\nrender_gif(p_w_hfly$data[,1:4], \n planned_tour(pt1),\n display_xy(half_range=0.9, \n pch = pchw,\n edges = p_w_hfly$edges,\n axes = \"off\"),\n gif_file=\"gifs/penguins_hflyw.gif\",\n frames=500,\n loop=FALSE)\n\nrender_gif(p_s_hfly$data[,1:4], \n planned_tour(pt1), \n display_xy(half_range=0.9, \n pch = pchs,\n edges = p_s_hfly$edges,\n axes = \"off\"),\n gif_file=\"gifs/penguins_hflys.gif\",\n frames=500,\n loop=FALSE)\n\n# Show three cluster solutions\nclrs <- hcl.colors(3, \"Zissou 1\")\nw3_col <- clrs[p_w_hfly$data$cl[p_w_hfly$data$node == 0]]\nrender_gif(p_w_hfly$data[p_w_hfly$data$node == 0, 1:4], \n planned_tour(pt1), \n display_xy(half_range=0.9, \n col=w3_col,\n axes = \"off\"),\n gif_file=\"gifs/penguins_w3.gif\",\n frames=500,\n loop=FALSE)\n\ns3_col <- clrs[p_s_hfly$data$cl[p_w_hfly$data$node == 0]]\nrender_gif(p_s_hfly$data[p_w_hfly$data$node == 0,1:4], \n planned_tour(pt1), \n display_xy(half_range=0.9, \n col=s3_col,\n axes = \"off\"),\n gif_file=\"gifs/penguins_s3.gif\",\n frames=500,\n loop=FALSE)\n```\n:::\n\n\n\n@fig-penguins-ddw and @fig-penguins-hfly-pdf show results for single linkage and wards linkage clustering of the penguins data. The 2D dendrograms are very different. Wards linkage produces a clearer indication of clusters, with a suggestion of three, or possibly four or five clusters. The dendrogram for single linkage suggests two clusters, and has the classical waterfall appearance that is often seen with this type of linkage. (If you look carefully, though, you will see it is actually a three cluster solution. At the very top of the dendrogram there is another branch connecting one observation to the other two clusters.)\n\n@fig-penguins-hfly-pdf (a) and (b) show the dendrograms in 4D overlaid on the data. The two are starkly different. The single linkage clustering is like pins pointing to (three) centres, with some long extra edges.\n\nPlots (c) and (d) show the three cluster solutions, with Wards linkage almost recovering the clusters of the three species. Single linkage has two big clusters, and the singleton cluster. Although the Wards linkage produces the best result, single linkage does provide some interesting and useful information about the data. That singleton cluster is an outlier, an unusually-sized penguin. We can see it as an outlier just from the tour in @fig-penguins-gt-pdf but single linkage emphasizes it, bringing it more strongly to our attention. \n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Wards linkage (left) and single linkage (right).](8-hierarchical_files/figure-pdf/fig-penguins-ddw-1.pdf){#fig-penguins-ddw width=80%}\n:::\n:::\n\n\n\n\n::: {.content-visible when-format=\"html\"}\n::: {#fig-penguins-hfly-html layout-ncol=2}\n![Wards linkage](gifs/penguins_hflyw.gif){#fig-penguins-hflyw fig-alt=\"Tour showing the dendrogram for Wards linkage clustering on the penguins data in 4D. You can see that it connects points within each clump and then connects between clusters.\"}\n\n![Single linkage](gifs/penguins_hflys.gif){#fig-penguins-hflyw fig-alt=\"Tour showing the dendrogram for single linkage clustering on the penguins data in 4D. You can see that the connections are like asterisks, connecting towards the center of each clump and there are a couple of long connections between clusters.\"}\n\n![Wards linkage](gifs/penguins_w3.gif){#fig-penguins-w3}\n\n![Single linkage](gifs/penguins_s3.gif){#fig-penguins-s3}\n\nDendrograms for Wards and single linkage of the penguins data, shown in 2D (top) and in 4D (middle), and the three-cluster solution of each.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n::: {#fig-penguins-hfly-pdf layout-ncol=2}\n\n![Wards linkage](images/penguins_hflyw_59.png){#fig-penguins-hflyw fig-alt=\"Tour showing the dendrogram for Wards linkage clustering on the penguins data in 4D. You can see that it connects points within each clump and then connects between clusters.\"}\n\n![Single linkage](images/penguins_hflys_59.png){#fig-penguins-hflyw fig-alt=\"Tour showing the dendrogram for single linkage clustering on the penguins data in 4D. You can see that the connections are like asterisks, connecting towards the center of each clump and there are a couple of long connections between clusters.\"}\n\n![Wards linkage](images/penguins_w3_59.png){#fig-penguins-w3}\n\n![Single linkage](images/penguins_s3_59.png){#fig-penguins-s3}\n\nDendrograms for Wards and single linkage of the penguins data, shown in 2D (top) and in 4D (middle), and the three-cluster solution of each.\n:::\n:::\n\n::: {.content-visible when-format=\"html\"}\n::: insight\nSingle linkage on the penguins has a very different joining pattern to Wards! While Wards provides the better result, single linkage provides useful information about the data, such as emphasizing the outlier.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\insightbox{Single linkage on the penguins has a very different joining pattern to Wards! While Wards provides the better result, single linkage provides useful information about the data, such as emphasizing the outlier.}\n:::\n\n::: {.content-visible when-format=\"html\"}\n@fig-penguins-hfly-plotly provides HTML objects of the dendrograms, so that they can be directly compared. The same tour path is used, so the sliders allow setting the view to the same projection in each plot.\n:::\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to make html objects of the dendrogram in 4D\"}\nload(\"data/penguins_tour_path.rda\")\n# Create a smaller one, for space concerns\npt1i <- interpolate(pt1[,,1:5], 0.1)\npw_anim <- render_anim(p_w_hfly$data,\n vars=1:4,\n frames=pt1i, \n edges = p_w_hfly$edges,\n obs_labels=paste0(1:nrow(p_w_hfly$data),\n p_w_hfly$data$cl))\n\npw_gp <- ggplot() +\n geom_segment(data=pw_anim$edges, \n aes(x=x, xend=xend,\n y=y, yend=yend,\n frame=frame)) +\n geom_point(data=pw_anim$frames, \n aes(x=P1, y=P2, \n frame=frame, \n shape=factor(node),\n label=obs_labels), \n alpha=0.8, size=1) +\n xlim(-1,1) + ylim(-1,1) +\n scale_shape_manual(values=c(16, 46)) +\n coord_equal() +\n theme_bw() +\n theme(legend.position=\"none\", \n axis.text=element_blank(),\n axis.title=element_blank(),\n axis.ticks=element_blank(),\n panel.grid=element_blank())\n\npwg <- ggplotly(pw_gp, width=450, height=500,\n tooltip=\"label\") %>%\n animation_button(label=\"Go\") %>%\n animation_slider(len=0.8, x=0.5,\n xanchor=\"center\") %>%\n animation_opts(easing=\"linear\", transition = 0)\nhtmlwidgets::saveWidget(pwg,\n file=\"html/penguins_cl_ward.html\",\n selfcontained = TRUE)\n\n# Single\nps_anim <- render_anim(p_s_hfly$data, vars=1:4,\n frames=pt1i, \n edges = p_s_hfly$edges,\n obs_labels=paste0(1:nrow(p_s_hfly$data),\n p_s_hfly$data$cl))\n\nps_gp <- ggplot() +\n geom_segment(data=ps_anim$edges, \n aes(x=x, xend=xend,\n y=y, yend=yend,\n frame=frame)) +\n geom_point(data=ps_anim$frames, \n aes(x=P1, y=P2, \n frame=frame, \n shape=factor(node),\n label=obs_labels), \n alpha=0.8, size=1) +\n xlim(-1,1) + ylim(-1,1) +\n scale_shape_manual(values=c(16, 46)) +\n coord_equal() +\n theme_bw() +\n theme(legend.position=\"none\", \n axis.text=element_blank(),\n axis.title=element_blank(),\n axis.ticks=element_blank(),\n panel.grid=element_blank())\n\npsg <- ggplotly(ps_gp, width=450, height=500,\n tooltip=\"label\") %>%\n animation_button(label=\"Go\") %>%\n animation_slider(len=0.8, x=0.5,\n xanchor=\"center\") %>%\n animation_opts(easing=\"linear\", transition = 0)\nhtmlwidgets::saveWidget(psg,\n file=\"html/penguins_cl_single.html\",\n selfcontained = TRUE)\n```\n:::\n\n\n\n::: {.content-visible when-format=\"html\"}\n::: {#fig-penguins-hfly-plotly fig-align=\"center\"}\n\n\n\n\n\nAnimation of dendrogram from Wards (top) and single (bottom) linkage clustering of the penguins data.\n:::\n:::\n\n::: {.content-visible when-format=\"html\"}\n::: info\nViewing the dendrograms in high-dimensions provides insight into how the observations have joined points to clusters. For example, single linkage often has edges leading to a single focal point, which might not be yield a useful clustering but might help to identify outliers. If the edges point to multiple focal points, with long edges bridging gaps in the data, the result is more likely yielding a useful clustering.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\infobox{Viewing the dendrograms in high-dimensions provides insight into how the observations have joined points to clusters. For example, single linkage often has edges leading to a single focal point, which might not be yield a useful clustering but might help to identify outliers. If the edges point to multiple focal points, with long edges bridging gaps in the data, the result is more likely yielding a useful clustering.}\n:::\n\n## Exercises {-}\n\n1. Compute complete linkage clustering for the **nuisance observations** data set. Does it perform more similarly to single linkage or Wards linkage?\n2. Compute single linkage clustering for the **nuisance variables** data. Does it perform more similarly to complete linkage or Wards linkage?\n3. Use hierarchical clustering with Euclidean distance and Wards linkage to split the `clusters_nonlin` data into four clusters. Look at the dendrogram in 2D and 4D. In 4D you can also include the cluster assignment as color. Does this look like a good solution?\n4. Repeat the same exercise using single linkage instead of Wards linkage. How does this solution compare to what we have found with Wards linkage? Does the solution match how you would cluster the data in a spin-and-brush approach?\n5. Argue why single linkage might not perform well for the `fake_trees` data. Which method do you think will work best with this data? Conduct hierarchical clustering with your choice of linkage method. Does the 2D dendrogram suggest 10 clusters for the 10 branches? Take a look at the high-dimensional representation of the dendrogram. Has your chosen method captured the branches well, or not, explaining what you think worked well or poorly?\n6. What would a useful clustering of the first four PCs of the `aflw` data be? What linkage method would you expect works best to cluster it this way? Conduct the clustering. Examine the 2D dendrogram and decide on how many clusters should be used. Examine the cluster solution using a tour with points coloured by cluster. \n7. Based on your assessment of the cluster structure in the challenge data sets, `c1`-`c7`, from the `mulgar` package, which linkage method would you recommend. Use your suggested linkage method to cluster each data set, and summarise how well it performed in detecting the clusters that you have seen.\n\n\n\n\n::: {.cell}\n\n:::\n", + "markdown": "# Hierarchical clustering {#sec-hclust}\n\n\\index{cluster analysis!algorithms}\n\n\\index{cluster analysis!hierarchical} \\index{cluster analysis!intercluster distance (linkage)}\n\n## Overview\n\nHierarchical cluster algorithms sequentially fuse neighboring points to form ever-larger clusters, starting from a full interpoint distance matrix. *Distance between clusters* is described by a \"linkage method\", of which there are many. For example, single linkage measures the distance between clusters by the smallest interpoint distance between the members of the two clusters, complete linkage uses the maximum interpoint distance, and average linkage uses the average of the interpoint distances. Wards linkage, which usually produces the best clustering solutions, defines the distance as the reduction in the within-group variance. A good discussion on cluster analysis and linkage can be found in @HOML, on [Wikipedia](https://en.wikipedia.org/wiki/Cluster_analysis) or any multivariate textbook.\n\n::: {.content-visible when-format=\"html\"}\n::: info\nHierarchical clustering is summarised by a dendrogram, which sequentially shows points being joined to form a cluster, with the corresponding distances. Breaking the data into clusters is done by cutting the dendrogram at the long edges. \n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\infobox{Hierarchical clustering is summarised by a dendrogram, which sequentially shows points being joined to form a cluster, with the corresponding distances. Breaking the data into clusters is done by cutting the dendrogram at the long edges.}\n:::\n\n\nHere we will take a look at hierarchical clustering, using Wards linkage, on the `simple_clusters` data. The steps taken are to:\n\n1. Plot the data to check for presence of clusters and their shape.\n2. Compute the hierarchical clustering.\n3. Plot the dendrogram to help decide on an appropriate number of clusters, using the `dendro_data()` function from the `ggdendro` package.\n4. Show the dendrogram overlaid on the data, calculated by the `hierfly()` function in `mulgar`.\n5. Plot the clustering result, by colouring points in the plot of the data.\n\n\\index{cluster analysis!dendrogram}\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Load libraries\"}\nlibrary(ggplot2)\nlibrary(mulgar)\nlibrary(ggdendro)\nlibrary(dplyr)\nlibrary(patchwork)\nlibrary(tourr)\nlibrary(plotly)\nlibrary(htmlwidgets)\nlibrary(colorspace)\nlibrary(GGally)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\ndata(simple_clusters)\n\n# Compute hierarchical clustering with Ward's linkage\ncl_hw <- hclust(dist(simple_clusters[,1:2]),\n method=\"ward.D2\")\ncl_ggd <- dendro_data(cl_hw, type = \"triangle\")\n\n# Compute dendrogram in the data\ncl_hfly <- hierfly(simple_clusters, cl_hw, scale=FALSE)\n\n# Show result\nsimple_clusters <- simple_clusters %>%\n mutate(clw = factor(cutree(cl_hw, 2)))\n```\n:::\n\n\n\n@fig-hc-sim illustrates the hierarchical clustering approach for a simple simulated data set (a) with two well-separated clusters in 2D. The dendrogram (b) is a representation of the order that points are joined into clusters. The dendrogram strongly indicates two clusters because the two branches representing the last join are much longer than all of the other branches. \n\nAlthough, the dendrogram is usually a good summary of the steps taken by the algorithm, it can be misleading. The dendrogram might indicate a clear clustering (big differences in heights of branches) but the result may be awful. You need to check this by examining the result on the data, called model-in-the-data space by @wickham2015. \n\nPlot (c) shows the dendrogram in 2D, overlaid on the data. The segments show how the points are joined to make clusters. In order to represent the dendrogram this way, new points (represented by a \"+\" here) need to be added corresponding to the centroid of groups of points that have been joined. These are used to draw the segments between other points and other clusters. We can see that the longest (two) edges stretches across the gap between the two clusters. This corresponds to the top of the dendrogram, the two long branches where we would cut it to make the two-cluster solution. This two-cluster solution is shown in plot (d).\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to make the four plots\"}\n# Plot the data\npd <- ggplot(simple_clusters, aes(x=x1, y=x2)) +\n geom_point(colour=\"#3B99B1\", size=2, alpha=0.8) +\n ggtitle(\"(a)\") + \n theme_minimal() +\n theme(aspect.ratio=1) \n\n# Plot the dendrogram\nph <- ggplot() +\n geom_segment(data=cl_ggd$segments, \n aes(x = x, y = y, \n xend = xend, yend = yend)) + \n geom_point(data=cl_ggd$labels, aes(x=x, y=y),\n colour=\"#3B99B1\", alpha=0.8) +\n ggtitle(\"(b)\") + \n theme_minimal() +\n theme_dendro()\n\n# Plot the dendrogram on the data\npdh <- ggplot() +\n geom_segment(data=cl_hfly$segments, \n aes(x=x, xend=xend,\n y=y, yend=yend)) +\n geom_point(data=cl_hfly$data, \n aes(x=x1, y=x2,\n shape=factor(node),\n colour=factor(node),\n size=1-node), alpha=0.8) +\n xlab(\"x1\") + ylab(\"x2\") +\n scale_shape_manual(values = c(16, 3)) +\n scale_colour_manual(values = c(\"#3B99B1\", \"black\")) +\n scale_size(limits=c(0,17)) +\n ggtitle(\"(c)\") + \n theme_minimal() +\n theme(aspect.ratio=1, legend.position=\"none\")\n\n# Plot the resulting clusters\npc <- ggplot(simple_clusters) +\n geom_point(aes(x=x1, y=x2, colour=clw), \n size=2, alpha=0.8) +\n scale_colour_discrete_divergingx(palette = \"Zissou 1\",\n nmax=5, rev=TRUE) +\n ggtitle(\"(d)\") + \n theme_minimal() +\n theme(aspect.ratio=1, legend.position=\"none\")\n\npd + ph + pdh + pc + plot_layout(ncol=2)\n```\n\n::: {.cell-output-display}\n![Hierarchical clustering on simulated data: (a) data, (b) dendrogram, (c) dendrogram on the data, and (d) two cluster solution. The extra points corresponding to nodes of the dendrogram are indicated by + in (c). The last join in the dendrogram (b), can be seen to correspond to the edges connecting the gap, when displayed with the data (c). The other joins can be seen to be pulling together points within each clump.](8-hierarchical_files/figure-pdf/fig-hc-sim-1.pdf){#fig-hc-sim fig-pos='H' width=100%}\n:::\n:::\n\n\n\n::: {.content-visible when-format=\"html\"}\n::: info\nPlotting the dendrogram in the data space can help you understand how the hierarchical clustering has collected the points together into clusters. You can learn if the algorithm has been confused by nuisance patterns in the data, and how different choices of linkage method affects the result. \n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\infobox{Plotting the dendrogram in the data space can help you understand how the hierarchical clustering has collected the points together into clusters. You can learn if the algorithm has been confused by nuisance patterns in the data, and how different choices of linkage method affects the result.}\n:::\n\n## Common patterns which confuse clustering algorithms\n\n@fig-problems shows two examples of structure in data that will confuse hierarchical clustering: nuisance variables and nuisance cases. We usually do not know that these problems exist prior to clustering the data. Discovering these iteratively as you conduct a clustering analysis is important for generating useful results. \n\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to make plots\"}\n# Nuisance observations\nset.seed(20190514)\nx <- (runif(20)-0.5)*4\ny <- x\nd1 <- data.frame(x1 = c(rnorm(50, -3), \n rnorm(50, 3), x),\n x2 = c(rnorm(50, -3), \n rnorm(50, 3), y),\n cl = factor(c(rep(\"A\", 50), \n rep(\"B\", 70))))\nd1 <- d1 %>% \n mutate_if(is.numeric, function(x) (x-mean(x))/sd(x))\npd1 <- ggplot(data=d1, aes(x=x1, y=x2)) + \n geom_point() +\n ggtitle(\"Nuisance observations\") + \n theme_minimal() +\n theme(aspect.ratio=1) \n\n# Nuisance variables\nset.seed(20190512)\nd2 <- data.frame(x1=c(rnorm(50, -4), \n rnorm(50, 4)),\n x2=c(rnorm(100)),\n cl = factor(c(rep(\"A\", 50), \n rep(\"B\", 50))))\nd2 <- d2 %>% \n mutate_if(is.numeric, function(x) (x-mean(x))/sd(x))\npd2 <- ggplot(data=d2, aes(x=x1, y=x2)) + \n geom_point() +\n ggtitle(\"Nuisance variables\") + \n theme_minimal() +\n theme(aspect.ratio=1)\n\npd1 + pd2 + plot_layout(ncol=2)\n```\n\n::: {.cell-output-display}\n![Two examples of data structure that causes problems for hierarchical clustering. Nuisance observations can cause problems because the close observations between the two clusters can cause some chaining in the hierarchical joining of observations. Nuisance variables can cause problems because observations across the gap can seem closer than observations at the end of each cluster.](8-hierarchical_files/figure-pdf/fig-problems-1.pdf){#fig-problems fig-pos='H' width=80%}\n:::\n:::\n\n\n\nIf an outlier is a point that is extreme relative to other observations, an \"inlier\" is a point that is extreme relative to a cluster, but inside the domain of all of the observations. Nuisance observations are inliers, cases that occur between larger groups of points. If they were excluded there might be a gap between clusters. These can cause problems for clustering when distances between clusters are measured, and can be very problematic when single linkage hierarchical clustering is used. @fig-d1-s shows how nuisance observations affect single linkage but not Wards linkage hierarchical clustering.\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to make plots\"}\n# Compute single linkage\nd1_hs <- hclust(dist(d1[,1:2]),\n method=\"single\")\nd1_ggds <- dendro_data(d1_hs, type = \"triangle\")\npd1s <- ggplot() +\n geom_segment(data=d1_ggds$segments, \n aes(x = x, y = y, \n xend = xend, yend = yend)) + \n geom_point(data=d1_ggds$labels, aes(x=x, y=y),\n colour=\"#3B99B1\", alpha=0.8) +\n theme_minimal() +\n ggtitle(\"(a) Single linkage dendrogram\") +\n theme_dendro()\n\n# Compute dendrogram in data\nd1_hflys <- hierfly(d1, d1_hs, scale=FALSE)\n\npd1hs <- ggplot() +\n geom_segment(data=d1_hflys$segments, \n aes(x=x, xend=xend,\n y=y, yend=yend)) +\n geom_point(data=d1_hflys$data, \n aes(x=x1, y=x2,\n shape=factor(node),\n colour=factor(node),\n size=1-node), alpha=0.8) +\n scale_shape_manual(values = c(16, 3)) +\n scale_colour_manual(values = c(\"#3B99B1\", \"black\")) +\n scale_size(limits=c(0,17)) +\n ggtitle(\"(b) Dendrogram in data space\") + \n theme_minimal() +\n theme(aspect.ratio=1, legend.position=\"none\")\n\n# Show result\nd1 <- d1 %>%\n mutate(cls = factor(cutree(d1_hs, 2)))\npc_d1s <- ggplot(d1) +\n geom_point(aes(x=x1, y=x2, colour=cls), \n size=2, alpha=0.8) +\n scale_colour_discrete_divergingx(palette = \"Zissou 1\",\n nmax=4, rev=TRUE) +\n ggtitle(\"(c) Two-cluster solution\") + \n theme_minimal() +\n theme(aspect.ratio=1, legend.position=\"none\")\n\n# Compute Wards linkage\nd1_hw <- hclust(dist(d1[,1:2]),\n method=\"ward.D2\")\nd1_ggdw <- dendro_data(d1_hw, type = \"triangle\")\npd1w <- ggplot() +\n geom_segment(data=d1_ggdw$segments, \n aes(x = x, y = y, \n xend = xend, yend = yend)) + \n geom_point(data=d1_ggdw$labels, aes(x=x, y=y),\n colour=\"#3B99B1\", alpha=0.8) +\n ggtitle(\"(d) Ward's linkage dendrogram\") +\n theme_minimal() +\n theme_dendro()\n\n# Compute dendrogram in data\nd1_hflyw <- hierfly(d1, d1_hw, scale=FALSE)\n\npd1hw <- ggplot() +\n geom_segment(data=d1_hflyw$segments, \n aes(x=x, xend=xend,\n y=y, yend=yend)) +\n geom_point(data=d1_hflyw$data, \n aes(x=x1, y=x2,\n shape=factor(node),\n colour=factor(node),\n size=1-node), alpha=0.8) +\n scale_shape_manual(values = c(16, 3)) +\n scale_colour_manual(values = c(\"#3B99B1\", \"black\")) +\n scale_size(limits=c(0,17)) +\n ggtitle(\"(e) Dendrogram in data space\") + \n theme_minimal() +\n theme(aspect.ratio=1, legend.position=\"none\")\n\n# Show result\nd1 <- d1 %>%\n mutate(clw = factor(cutree(d1_hw, 2)))\npc_d1w <- ggplot(d1) +\n geom_point(aes(x=x1, y=x2, colour=clw), \n size=2, alpha=0.8) +\n scale_colour_discrete_divergingx(palette = \"Zissou 1\",\n nmax=4, rev=TRUE) +\n ggtitle(\"(f) Two-cluster solution\") + \n theme_minimal() +\n theme(aspect.ratio=1, legend.position=\"none\")\n\npd1s + pd1hs + pc_d1s + \n pd1w + pd1hw + pc_d1w +\n plot_layout(ncol=3)\n```\n\n::: {.cell-output-display}\n![The effect of nuisance observations on single linkage (a, b, c) and Ward's linkage hierarchical clustering (d, e, f). The single linkage dendrogram is very different to the Wards linkage dendrogram. When plotted with the data (b) we can see a pin cushion or asterisk pattern, where points are joined to others through a place in the middle of the line of nuisance observations. This results in the bad two cluster solution of a singleton cluster, and all the rest. Conversely, Ward's dendrogram (d) strongly suggests two clusters, although the final join corresponds to just a small gap when shown on the data (e) but results in two sensible clusters.](8-hierarchical_files/figure-pdf/fig-d1-s-1.pdf){#fig-d1-s fig-pos='H' width=80%}\n:::\n:::\n\n\n\nNuisance variables are ones that do not contribute to the clustering, such as `x2` here. When we look at this data we see a gap between two elliptically shape clusters, with the gap being only in the horizontal direction, `x1`. When we compute the distances between points, in order to start clustering, without knowing that `x2` is a nuisance variable, points across the gap might be considered to be closer than points within the same cluster. @fig-d2-c shows how nuisance variables affects complete linkage but not Wards linkage hierarchical clustering. (Wards linkage can be affected but it isn't for this data.) Interestingly, the dendrogram for complete linkage looks ideal, that it suggests two clusters. It is not until you examine the resulting clusters in the data that you can see the error, that it has clustered across the gap.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Compute complete linkage\nd2_hc <- hclust(dist(d2[,1:2]),\n method=\"complete\")\nd2_ggdc <- dendro_data(d2_hc, type = \"triangle\")\npd2c <- ggplot() +\n geom_segment(data=d2_ggdc$segments, \n aes(x = x, y = y, \n xend = xend, yend = yend)) + \n geom_point(data=d2_ggdc$labels, aes(x=x, y=y),\n colour=\"#3B99B1\", alpha=0.8) +\n ggtitle(\"(a) Complete linkage dendrogram\") +\n theme_minimal() +\n theme_dendro()\n\n# Compute dendrogram in data\nd2_hflyc <- hierfly(d2, d2_hc, scale=FALSE)\n\npd2hc <- ggplot() +\n geom_segment(data=d2_hflyc$segments, \n aes(x=x, xend=xend,\n y=y, yend=yend)) +\n geom_point(data=d2_hflyc$data, \n aes(x=x1, y=x2,\n shape=factor(node),\n colour=factor(node),\n size=1-node), alpha=0.8) +\n scale_shape_manual(values = c(16, 3)) +\n scale_colour_manual(values = c(\"#3B99B1\", \"black\")) +\n scale_size(limits=c(0,17)) +\n ggtitle(\"(b) Dendrogram in data space\") + \n theme_minimal() +\n theme(aspect.ratio=1, legend.position=\"none\")\n\n# Show result\nd2 <- d2 %>%\n mutate(clc = factor(cutree(d2_hc, 2)))\npc_d2c <- ggplot(d2) +\n geom_point(aes(x=x1, y=x2, colour=clc), \n size=2, alpha=0.8) +\n scale_colour_discrete_divergingx(palette = \"Zissou 1\",\n nmax=4, rev=TRUE) +\n ggtitle(\"(c) Two-cluster solution\") + \n theme_minimal() +\n theme(aspect.ratio=1, legend.position=\"none\")\n\n# Compute Wards linkage\nd2_hw <- hclust(dist(d2[,1:2]),\n method=\"ward.D2\")\nd2_ggdw <- dendro_data(d2_hw, type = \"triangle\")\npd2w <- ggplot() +\n geom_segment(data=d2_ggdw$segments, \n aes(x = x, y = y, \n xend = xend, yend = yend)) + \n geom_point(data=d2_ggdw$labels, aes(x=x, y=y),\n colour=\"#3B99B1\", alpha=0.8) +\n ggtitle(\"(d) Ward's linkage dendrogram\") +\n theme_minimal() +\n theme_dendro()\n\n# Compute dendrogram in data\nd2_hflyw <- hierfly(d2, d2_hw, scale=FALSE)\n\npd2hw <- ggplot() +\n geom_segment(data=d2_hflyw$segments, \n aes(x=x, xend=xend,\n y=y, yend=yend)) +\n geom_point(data=d2_hflyw$data, \n aes(x=x1, y=x2,\n shape=factor(node),\n colour=factor(node),\n size=1-node), alpha=0.8) +\n scale_shape_manual(values = c(16, 3)) +\n scale_colour_manual(values = c(\"#3B99B1\", \"black\")) +\n scale_size(limits=c(0,17)) +\n ggtitle(\"(e) Dendrogram in data space\") + \n theme_minimal() +\n theme(aspect.ratio=1, legend.position=\"none\")\n\n# Show result\nd2 <- d2 %>%\n mutate(clw = factor(cutree(d2_hw, 2)))\npc_d2w <- ggplot(d2) +\n geom_point(aes(x=x1, y=x2, colour=clw), \n size=2, alpha=0.8) +\n scale_colour_discrete_divergingx(palette = \"Zissou 1\",\n nmax=4, rev=TRUE) +\n ggtitle(\"(f) Two-cluster solution\") + \n theme_minimal() +\n theme(aspect.ratio=1, legend.position=\"none\")\n\npd2c + pd2hc + pc_d2c + \n pd2w + pd2hw + pc_d2w +\n plot_layout(ncol=3)\n```\n\n::: {.cell-output-display}\n![Complete linkage clustering (a, b, c) on nuisance variables in comparison to Ward's linkage (d, e, f). The two dendrograms (a, d) look similar but when plotted on the data (b, e) we can see they are very different solutions. The complete linkage result breaks the data into clusters across the gap (c), which is a bad solution. It has been distract by the nuisance variables. Conversely, the Wards linkage two-cluster solution does as hoped, divided the data into two clusters separated by the gap (f).](8-hierarchical_files/figure-pdf/fig-d2-c-1.pdf){#fig-d2-c fig-pos='H' width=80%}\n:::\n:::\n\n\n\n::: {.content-visible when-format=\"html\"}\n::: info\nTwo dendrograms might look similar but the resulting clustering can be very different. They can also look very different but correspond to very similar clusterings. Plotting the dendrogram in the data space is important for understanding how the algorithm operated when grouping observations, even more so for high dimensions.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\infobox{Two dendrograms might look similar but the resulting clustering can be very different. They can also look very different but correspond to very similar clusterings. Plotting the dendrogram in the data space is important for understanding how the algorithm operated when grouping observations, even more so for high dimensions.}\n:::\n\n## Dendrograms in high-dimensions\n\nThe first step with any clustering with high dimensional data is also to check the data. You typically don't know whether there are clusters, or what shape they might be, or if there are nuisance observations or variables. A pairs plot like in @fig-penguins-pairs is a nice complement to using the tour (@fig-penguins-gt-pdf) for this. Here you can see three elliptical clusters, with one is further from the others.\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code for scatterplot matrix\"}\nload(\"data/penguins_sub.rda\")\nggscatmat(penguins_sub[,1:4]) + \n theme_minimal() +\n xlab(\"\") + ylab(\"\")\n```\n\n::: {.cell-output-display}\n![Make a scatterplot matrix to check for the presence of clustering, shape of clusters and presence of nuisance observations and variables. In the penguins it appears that there might be three elliptically shaped clusters, with some nuisance observations.](8-hierarchical_files/figure-pdf/fig-penguins-pairs-1.pdf){#fig-penguins-pairs fig-pos='H' width=80%}\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to create tour\"}\nset.seed(20230329)\nb <- basis_random(4,2)\npt1 <- save_history(penguins_sub[,1:4], \n max_bases = 500, \n start = b)\nsave(pt1, file=\"data/penguins_tour_path.rda\")\n\n# To re-create the gifs\nload(\"data/penguins_tour_path.rda\")\nanimate_xy(penguins_sub[,1:4], \n tour_path = planned_tour(pt1), \n axes=\"off\", rescale=FALSE, \n half_range = 3.5)\n\nrender_gif(penguins_sub[,1:4], \n planned_tour(pt1), \n display_xy(half_range=0.9, axes=\"off\"),\n gif_file=\"gifs/penguins_gt.gif\",\n frames=500,\n loop=FALSE)\n```\n:::\n\n\n\n::: {.content-visible when-format=\"html\"}\n\n::: {#fig-penguins-gt-html}\n\n![](gifs/penguins_gt.gif){fig-alt=\"Tour of many linear projections of the penguins data. You can see three elliptical clusters, one further apart from the other two.\" fig.align=\"center\"}\n\nUse a grand tour of your data to check for clusters, the shape of clusters and for nuisance observations and variables. Here the penguins data looks like it has possibly three elliptical clusters, one more separated than the other two, with some nuisance observations.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\n![One frame from a grand tour being used to check for clusters, the shape of clusters and for nuisance observations and variables. Here the penguins data looks like it has possibly three elliptical clusters, one more separated than the other two, with some nuisance observations.](images/penguins_gt_59.png){#fig-penguins-gt-pdf fig-alt=\"A scatterplot of a 2D projection. You can see three elliptical clusters, one further apart from the other two.\" fig.align=\"center\"}\n:::\n\n\nThe process is the same as for the simpler example. We compute and draw the dendrogram in 2D, compute it in $p$-D and view with a tour. Here we have also chosen to examine the three cluster solution for single linkage and wards linkage clustering.\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\np_dist <- dist(penguins_sub[,1:4])\np_hcw <- hclust(p_dist, method=\"ward.D2\")\np_hcs <- hclust(p_dist, method=\"single\")\n\np_clw <- penguins_sub %>% \n mutate(cl = factor(cutree(p_hcw, 3))) %>%\n as.data.frame()\np_cls <- penguins_sub %>% \n mutate(cl = factor(cutree(p_hcs, 3))) %>%\n as.data.frame()\n\np_w_hfly <- hierfly(p_clw, p_hcw, scale=FALSE)\np_s_hfly <- hierfly(p_cls, p_hcs, scale=FALSE)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to draw dendrograms\"}\n# Generate the dendrograms in 2D\np_hcw_dd <- dendro_data(p_hcw)\npw_dd <- ggplot() +\n geom_segment(data=p_hcw_dd$segments, \n aes(x = x, y = y, \n xend = xend, yend = yend)) + \n geom_point(data=p_hcw_dd$labels, aes(x=x, y=y),\n alpha=0.8) +\n theme_dendro()\n\np_hcs_dd <- dendro_data(p_hcs)\nps_dd <- ggplot() +\n geom_segment(data=p_hcs_dd$segments, \n aes(x = x, y = y, \n xend = xend, yend = yend)) + \n geom_point(data=p_hcs_dd$labels, aes(x=x, y=y),\n alpha=0.8) +\n theme_dendro()\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to create tours of dendrogram in data\"}\nload(\"data/penguins_tour_path.rda\")\nglyphs <- c(16, 46)\npchw <- glyphs[p_w_hfly$data$node+1]\npchs <- glyphs[p_s_hfly$data$node+1]\n\nanimate_xy(p_w_hfly$data[,1:4], \n #col=colw, \n tour_path = planned_tour(pt1),\n pch = pchw,\n edges=p_w_hfly$edges, \n axes=\"bottomleft\")\n\nanimate_xy(p_s_hfly$data[,1:4], \n #col=colw, \n tour_path = planned_tour(pt1),\n pch = pchs,\n edges=p_s_hfly$edges, \n axes=\"bottomleft\")\n\nrender_gif(p_w_hfly$data[,1:4], \n planned_tour(pt1),\n display_xy(half_range=0.9, \n pch = pchw,\n edges = p_w_hfly$edges,\n axes = \"off\"),\n gif_file=\"gifs/penguins_hflyw.gif\",\n frames=500,\n loop=FALSE)\n\nrender_gif(p_s_hfly$data[,1:4], \n planned_tour(pt1), \n display_xy(half_range=0.9, \n pch = pchs,\n edges = p_s_hfly$edges,\n axes = \"off\"),\n gif_file=\"gifs/penguins_hflys.gif\",\n frames=500,\n loop=FALSE)\n\n# Show three cluster solutions\nclrs <- hcl.colors(3, \"Zissou 1\")\nw3_col <- clrs[p_w_hfly$data$cl[p_w_hfly$data$node == 0]]\nrender_gif(p_w_hfly$data[p_w_hfly$data$node == 0, 1:4], \n planned_tour(pt1), \n display_xy(half_range=0.9, \n col=w3_col,\n axes = \"off\"),\n gif_file=\"gifs/penguins_w3.gif\",\n frames=500,\n loop=FALSE)\n\ns3_col <- clrs[p_s_hfly$data$cl[p_w_hfly$data$node == 0]]\nrender_gif(p_s_hfly$data[p_w_hfly$data$node == 0,1:4], \n planned_tour(pt1), \n display_xy(half_range=0.9, \n col=s3_col,\n axes = \"off\"),\n gif_file=\"gifs/penguins_s3.gif\",\n frames=500,\n loop=FALSE)\n```\n:::\n\n\n\n@fig-penguins-ddw and @fig-penguins-hfly-pdf show results for single linkage and wards linkage clustering of the penguins data. The 2D dendrograms are very different. Wards linkage produces a clearer indication of clusters, with a suggestion of three, or possibly four or five clusters. The dendrogram for single linkage suggests two clusters, and has the classical waterfall appearance that is often seen with this type of linkage. (If you look carefully, though, you will see it is actually a three cluster solution. At the very top of the dendrogram there is another branch connecting one observation to the other two clusters.)\n\n@fig-penguins-hfly-pdf (a) and (b) show the dendrograms in 4D overlaid on the data. The two are starkly different. The single linkage clustering is like pins pointing to (three) centres, with some long extra edges.\n\nPlots (c) and (d) show the three cluster solutions, with Wards linkage almost recovering the clusters of the three species. Single linkage has two big clusters, and the singleton cluster. Although the Wards linkage produces the best result, single linkage does provide some interesting and useful information about the data. That singleton cluster is an outlier, an unusually-sized penguin. We can see it as an outlier just from the tour in @fig-penguins-gt-pdf but single linkage emphasizes it, bringing it more strongly to our attention. \n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Wards linkage (left) and single linkage (right).](8-hierarchical_files/figure-pdf/fig-penguins-ddw-1.pdf){#fig-penguins-ddw width=80%}\n:::\n:::\n\n\n\n\n::: {.content-visible when-format=\"html\"}\n::: {#fig-penguins-hfly-html layout-ncol=2}\n![Wards linkage](gifs/penguins_hflyw.gif){#fig-penguins-hflyw fig-alt=\"Tour showing the dendrogram for Wards linkage clustering on the penguins data in 4D. You can see that it connects points within each clump and then connects between clusters.\"}\n\n![Single linkage](gifs/penguins_hflys.gif){#fig-penguins-hflyw fig-alt=\"Tour showing the dendrogram for single linkage clustering on the penguins data in 4D. You can see that the connections are like asterisks, connecting towards the center of each clump and there are a couple of long connections between clusters.\"}\n\n![Wards linkage](gifs/penguins_w3.gif){#fig-penguins-w3}\n\n![Single linkage](gifs/penguins_s3.gif){#fig-penguins-s3}\n\nDendrograms for Wards and single linkage of the penguins data, shown in 2D (top) and in 4D (middle), and the three-cluster solution of each.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n::: {#fig-penguins-hfly-pdf layout-ncol=2}\n\n![Wards linkage](images/penguins_hflyw_59.png){#fig-penguins-hflyw fig-alt=\"Tour showing the dendrogram for Wards linkage clustering on the penguins data in 4D. You can see that it connects points within each clump and then connects between clusters.\"}\n\n![Single linkage](images/penguins_hflys_59.png){#fig-penguins-hflyw fig-alt=\"Tour showing the dendrogram for single linkage clustering on the penguins data in 4D. You can see that the connections are like asterisks, connecting towards the center of each clump and there are a couple of long connections between clusters.\"}\n\n![Wards linkage](images/penguins_w3_59.png){#fig-penguins-w3}\n\n![Single linkage](images/penguins_s3_59.png){#fig-penguins-s3}\n\nDendrograms for Wards and single linkage of the penguins data, shown in 2D (top) and in 4D (middle), and the three-cluster solution of each.\n:::\n:::\n\n::: {.content-visible when-format=\"html\"}\n::: insight\nSingle linkage on the penguins has a very different joining pattern to Wards! While Wards provides the better result, single linkage provides useful information about the data, such as emphasizing the outlier.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\insightbox{Single linkage on the penguins has a very different joining pattern to Wards! While Wards provides the better result, single linkage provides useful information about the data, such as emphasizing the outlier.}\n:::\n\n::: {.content-visible when-format=\"html\"}\n@fig-penguins-hfly-plotly provides HTML objects of the dendrograms, so that they can be directly compared. The same tour path is used, so the sliders allow setting the view to the same projection in each plot.\n:::\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to make html objects of the dendrogram in 4D\"}\nload(\"data/penguins_tour_path.rda\")\n# Create a smaller one, for space concerns\npt1i <- interpolate(pt1[,,1:5], 0.1)\npw_anim <- render_anim(p_w_hfly$data,\n vars=1:4,\n frames=pt1i, \n edges = p_w_hfly$edges,\n obs_labels=paste0(1:nrow(p_w_hfly$data),\n p_w_hfly$data$cl))\n\npw_gp <- ggplot() +\n geom_segment(data=pw_anim$edges, \n aes(x=x, xend=xend,\n y=y, yend=yend,\n frame=frame)) +\n geom_point(data=pw_anim$frames, \n aes(x=P1, y=P2, \n frame=frame, \n shape=factor(node),\n label=obs_labels), \n alpha=0.8, size=1) +\n xlim(-1,1) + ylim(-1,1) +\n scale_shape_manual(values=c(16, 46)) +\n coord_equal() +\n theme_bw() +\n theme(legend.position=\"none\", \n axis.text=element_blank(),\n axis.title=element_blank(),\n axis.ticks=element_blank(),\n panel.grid=element_blank())\n\npwg <- ggplotly(pw_gp, width=450, height=500,\n tooltip=\"label\") %>%\n animation_button(label=\"Go\") %>%\n animation_slider(len=0.8, x=0.5,\n xanchor=\"center\") %>%\n animation_opts(easing=\"linear\", transition = 0)\nhtmlwidgets::saveWidget(pwg,\n file=\"html/penguins_cl_ward.html\",\n selfcontained = TRUE)\n\n# Single\nps_anim <- render_anim(p_s_hfly$data, vars=1:4,\n frames=pt1i, \n edges = p_s_hfly$edges,\n obs_labels=paste0(1:nrow(p_s_hfly$data),\n p_s_hfly$data$cl))\n\nps_gp <- ggplot() +\n geom_segment(data=ps_anim$edges, \n aes(x=x, xend=xend,\n y=y, yend=yend,\n frame=frame)) +\n geom_point(data=ps_anim$frames, \n aes(x=P1, y=P2, \n frame=frame, \n shape=factor(node),\n label=obs_labels), \n alpha=0.8, size=1) +\n xlim(-1,1) + ylim(-1,1) +\n scale_shape_manual(values=c(16, 46)) +\n coord_equal() +\n theme_bw() +\n theme(legend.position=\"none\", \n axis.text=element_blank(),\n axis.title=element_blank(),\n axis.ticks=element_blank(),\n panel.grid=element_blank())\n\npsg <- ggplotly(ps_gp, width=450, height=500,\n tooltip=\"label\") %>%\n animation_button(label=\"Go\") %>%\n animation_slider(len=0.8, x=0.5,\n xanchor=\"center\") %>%\n animation_opts(easing=\"linear\", transition = 0)\nhtmlwidgets::saveWidget(psg,\n file=\"html/penguins_cl_single.html\",\n selfcontained = TRUE)\n```\n:::\n\n\n\n::: {.content-visible when-format=\"html\"}\n::: {#fig-penguins-hfly-plotly fig-align=\"center\"}\n\n\n\n\n\nAnimation of dendrogram from Wards (top) and single (bottom) linkage clustering of the penguins data.\n:::\n:::\n\n::: {.content-visible when-format=\"html\"}\n::: info\nViewing the dendrograms in high-dimensions provides insight into how the observations have joined points to clusters. For example, single linkage often has edges leading to a single focal point, which might not be yield a useful clustering but might help to identify outliers. If the edges point to multiple focal points, with long edges bridging gaps in the data, the result is more likely yielding a useful clustering.\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\infobox{Viewing the dendrograms in high-dimensions provides insight into how the observations have joined points to clusters. For example, single linkage often has edges leading to a single focal point, which might not be yield a useful clustering but might help to identify outliers. If the edges point to multiple focal points, with long edges bridging gaps in the data, the result is more likely yielding a useful clustering.}\n:::\n\n## Exercises {-}\n\n1. Compute complete linkage clustering for the **nuisance observations** data set. Does it perform more similarly to single linkage or Wards linkage?\n2. Compute single linkage clustering for the **nuisance variables** data. Does it perform more similarly to complete linkage or Wards linkage?\n3. Use hierarchical clustering with Euclidean distance and Wards linkage to split the `clusters_nonlin` data into four clusters. Look at the dendrogram in 2D and 4D. In 4D you can also include the cluster assignment as color. Does this look like a good solution?\n4. Repeat the same exercise using single linkage instead of Wards linkage. How does this solution compare to what we have found with Wards linkage? Does the solution match how you would cluster the data in a spin-and-brush approach?\n5. Argue why single linkage might not perform well for the `fake_trees` data. Which method do you think will work best with this data? Conduct hierarchical clustering with your choice of linkage method. Does the 2D dendrogram suggest 10 clusters for the 10 branches? Take a look at the high-dimensional representation of the dendrogram. Has your chosen method captured the branches well, or not, explaining what you think worked well or poorly?\n6. What would a useful clustering of the first four PCs of the `aflw` data be? What linkage method would you expect works best to cluster it this way? Conduct the clustering. Examine the 2D dendrogram and decide on how many clusters should be used. Examine the cluster solution using a tour with points coloured by cluster. \n7. Based on your assessment of the cluster structure in the challenge data sets, `c1`-`c7`, from the `mulgar` package, which linkage method would you recommend. Use your suggested linkage method to cluster each data set, and summarise how well it performed in detecting the clusters that you have seen.\n\n\n\n\n::: {.cell}\n\n:::\n", "supporting": [ "8-hierarchical_files/figure-pdf" ], diff --git a/_freeze/8-hierarchical/figure-html/fig-d1-s-1.png b/_freeze/8-hierarchical/figure-html/fig-d1-s-1.png new file mode 100644 index 0000000..78d9d0f Binary files /dev/null and b/_freeze/8-hierarchical/figure-html/fig-d1-s-1.png differ diff --git a/_freeze/8-hierarchical/figure-html/fig-d2-c-1.png b/_freeze/8-hierarchical/figure-html/fig-d2-c-1.png new file mode 100644 index 0000000..c1dc83b Binary files /dev/null and b/_freeze/8-hierarchical/figure-html/fig-d2-c-1.png differ diff --git a/_freeze/8-hierarchical/figure-html/fig-hc-sim-1.png b/_freeze/8-hierarchical/figure-html/fig-hc-sim-1.png new file mode 100644 index 0000000..95e5c6e Binary files /dev/null and b/_freeze/8-hierarchical/figure-html/fig-hc-sim-1.png differ diff --git a/_freeze/8-hierarchical/figure-html/fig-penguins-ddw-1.png b/_freeze/8-hierarchical/figure-html/fig-penguins-ddw-1.png new file mode 100644 index 0000000..08ad0e8 Binary files /dev/null and b/_freeze/8-hierarchical/figure-html/fig-penguins-ddw-1.png differ diff --git a/_freeze/8-hierarchical/figure-html/fig-penguins-pairs-1.png b/_freeze/8-hierarchical/figure-html/fig-penguins-pairs-1.png new file mode 100644 index 0000000..4b7b35f Binary files /dev/null and b/_freeze/8-hierarchical/figure-html/fig-penguins-pairs-1.png differ diff --git a/_freeze/8-hierarchical/figure-html/fig-problems-1.png b/_freeze/8-hierarchical/figure-html/fig-problems-1.png new file mode 100644 index 0000000..96e28c4 Binary files /dev/null and b/_freeze/8-hierarchical/figure-html/fig-problems-1.png differ diff --git a/_freeze/8-hierarchical/figure-pdf/fig-d1-s-1.pdf b/_freeze/8-hierarchical/figure-pdf/fig-d1-s-1.pdf index f06093c..9d23b27 100644 Binary files a/_freeze/8-hierarchical/figure-pdf/fig-d1-s-1.pdf and b/_freeze/8-hierarchical/figure-pdf/fig-d1-s-1.pdf differ diff --git a/_freeze/8-hierarchical/figure-pdf/fig-d2-c-1.pdf b/_freeze/8-hierarchical/figure-pdf/fig-d2-c-1.pdf index 42c373d..95b7141 100644 Binary files a/_freeze/8-hierarchical/figure-pdf/fig-d2-c-1.pdf and b/_freeze/8-hierarchical/figure-pdf/fig-d2-c-1.pdf differ diff --git a/_freeze/8-hierarchical/figure-pdf/fig-hc-sim-1.pdf b/_freeze/8-hierarchical/figure-pdf/fig-hc-sim-1.pdf index b241523..900803e 100644 Binary files a/_freeze/8-hierarchical/figure-pdf/fig-hc-sim-1.pdf and b/_freeze/8-hierarchical/figure-pdf/fig-hc-sim-1.pdf differ diff --git a/_freeze/8-hierarchical/figure-pdf/fig-penguins-ddw-1.pdf b/_freeze/8-hierarchical/figure-pdf/fig-penguins-ddw-1.pdf index cdc2b17..153581b 100644 Binary files a/_freeze/8-hierarchical/figure-pdf/fig-penguins-ddw-1.pdf and b/_freeze/8-hierarchical/figure-pdf/fig-penguins-ddw-1.pdf differ diff --git a/_freeze/8-hierarchical/figure-pdf/fig-penguins-pairs-1.pdf b/_freeze/8-hierarchical/figure-pdf/fig-penguins-pairs-1.pdf index 3eb9073..4916f53 100644 Binary files a/_freeze/8-hierarchical/figure-pdf/fig-penguins-pairs-1.pdf and b/_freeze/8-hierarchical/figure-pdf/fig-penguins-pairs-1.pdf differ diff --git a/_freeze/8-hierarchical/figure-pdf/fig-problems-1.pdf b/_freeze/8-hierarchical/figure-pdf/fig-problems-1.pdf index aee7e28..3d42f65 100644 Binary files a/_freeze/8-hierarchical/figure-pdf/fig-problems-1.pdf and b/_freeze/8-hierarchical/figure-pdf/fig-problems-1.pdf differ diff --git a/_freeze/9-kmeans/execute-results/html.json b/_freeze/9-kmeans/execute-results/html.json new file mode 100644 index 0000000..41fa0ce --- /dev/null +++ b/_freeze/9-kmeans/execute-results/html.json @@ -0,0 +1,17 @@ +{ + "hash": "47e09c461b31eb53ff57d81517f53e40", + "result": { + "engine": "knitr", + "markdown": "# $k$-means clustering {#sec-kmeans}\n\\index{cluster analysis!k-means}\n\\index{cluster analysis!algorithms}\n\nOne of the simplest and efficient techniques for clustering data is the $k$-means algorithm. The algorithm begins with a choice for $k$, the number of clusters to divide the data into. It is seeded with $k$ initial means, and sequentially iterates through the observations, assigning them to the nearest mean, and re-calculating the $k$ means. It stops at a given number of iterations or when points no longer change clusters. The algorithm will tend to segment the data into roughly equal sized, or spherical clusters, and thus will work well if the clusters are separated and equally spherical in shape. \n\nA good place to learn ore about the $k$-means algorithm is Chapter 20 of @HOML. The algorithm has been in use for a long time! It was named $k$-means by @MacQueen1967, but developed by Lloyd in 1957 (as described in @Lloyd1982) and separately by @forgy65, and perhaps others as it is a very simple procedure.\n\n::: {.content-visible when-format=\"html\"}\n::: info\nThe key elements to examine in a k-means clustering algorithm result are:\n\n- means \n- boundaries\n\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\\infobox{The key elements to examine in a k-means clustering algorithm result are:\n\\begin{itemize}\n\\item means \n\\item boundaries \n\\end{itemize}}\n:::\n\n## Examining results in 2D\n\n@fig-km-2D shows the results of $k$-means clustering on the 2D `simple_clusters` data and two variables of the penguins data. We can see that it works well when the clusters are spherical, but for the penguins data it fails because the shape of the clusters is elliptical. It actually makes a mistake that would not be made if we simply visually clustered: cluster 3 has grouped points across a gap, a divide that visually we would all agree should form a separation.\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nlibrary(mulgar)\nlibrary(ggplot2)\nlibrary(dplyr)\nlibrary(colorspace)\nlibrary(patchwork)\ndata(\"simple_clusters\")\nload(\"data/penguins_sub.rda\")\n\nset.seed(202305)\nsc_bl_bd_km <- kmeans(simple_clusters[,1:2], centers=2, \n iter.max = 50, nstart = 5)\nsc_bl_bd_km_means <- data.frame(sc_bl_bd_km$centers) %>%\n mutate(cl = factor(rownames(sc_bl_bd_km$centers)))\nsc_bl_bd_km_d <- simple_clusters[,1:2] %>% \n mutate(cl = factor(sc_bl_bd_km$cluster))\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to make plots\"}\nsc_bl_bd_km_p <- ggplot() +\n geom_point(data=sc_bl_bd_km_d, \n aes(x=x1, y=x2, colour=cl), \n shape=16, alpha=0.4) +\n geom_point(data=sc_bl_bd_km_means, \n aes(x=x1, y=x2, colour=cl), \n shape=3, size=5) +\n scale_color_discrete_divergingx(\"Zissou 1\") +\n ggtitle(\"(a)\") +\n theme_minimal() +\n theme(aspect.ratio = 1, \n legend.position = \"bottom\",\n legend.title = element_blank()) \n\np_bl_bd_km <- kmeans(penguins_sub[,1:2], centers=3, \n iter.max = 50, nstart = 5)\np_bl_bd_km_means <- data.frame(p_bl_bd_km$centers) %>%\n mutate(cl = factor(rownames(p_bl_bd_km$centers)))\np_bl_bd_km_d <- penguins_sub[,1:2] %>% \n mutate(cl = factor(p_bl_bd_km$cluster))\n\np_bl_bd_km_p <- ggplot() +\n geom_point(data=p_bl_bd_km_d, \n aes(x=bl, y=bd, colour=cl), \n shape=16, alpha=0.4) +\n geom_point(data=p_bl_bd_km_means, \n aes(x=bl, y=bd, colour=cl), \n shape=3, size=5) +\n scale_color_discrete_divergingx(\"Zissou 1\") +\n ggtitle(\"(b)\") +\n theme_minimal() +\n theme(aspect.ratio = 1, \n legend.position = \"bottom\",\n legend.title = element_blank()) \n\nsc_bl_bd_km_p + p_bl_bd_km_p + plot_layout(ncol=2)\n```\n\n::: {.cell-output-display}\n![Examining $k$-means clustering results for simple clusters (a) and two variables of the penguins data (b). The means are indicated by a $+$. The results are perfect for the simple clusters but not for the penguins data. The penguin clusters are elliptically shaped which is not captured by $k$-means. Cluster 3 has observations grouped across a gap in the data.](9-kmeans_files/figure-html/fig-km-2D-1.png){#fig-km-2D width=576}\n:::\n:::\n\n\n## Examining results in high dimensions\n\nThis approach extends to high-dimensions. One colours observations by the cluster label, and overlays the final cluster means. If we see gaps in points in a single cluster it would mean that $k$-means fails to see important cluster structure. This is what happens with the 4D penguins data as shown in @fig-p-km-html.\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\np_km <- kmeans(penguins_sub[,1:4], centers=3, \n iter.max = 50, nstart = 5)\np_km_means <- data.frame(p_km$centers) %>%\n mutate(cl = factor(rownames(p_km$centers)))\np_km_d <- penguins_sub[,1:4] %>% \n mutate(cl = factor(p_km$cluster))\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to make animated gifs\"}\nlibrary(tourr)\np_km_means <- p_km_means %>%\n mutate(type = \"mean\")\np_km_d <- p_km_d %>%\n mutate(type = \"data\")\np_km_all <- bind_rows(p_km_means, p_km_d)\np_km_all$type <- factor(p_km_all$type, levels=c(\"mean\", \"data\"))\np_pch <- c(3, 20)[as.numeric(p_km_all$type)]\np_cex <- c(3, 1)[as.numeric(p_km_all$type)]\nanimate_xy(p_km_all[,1:4], col=p_km_all$cl, \n pch=p_pch, cex=p_cex, axes=\"bottomleft\")\nrender_gif(p_km_all[,1:4],\n grand_tour(),\n display_xy(col=p_km_all$cl, \n pch=p_pch, \n cex=p_cex, \n axes=\"bottomleft\"),\n gif_file=\"gifs/p_km.gif\",\n width=400,\n height=400,\n frames=500)\n```\n:::\n\n\n::: {.content-visible when-format=\"html\"}\n\n![Exploring the k-means clustering result for the 4D penguins data. You can see cluster 2 clearly separated from the other observations. Cluster 3, like in the 2D example, is a mix of observations across a gap. Even the mean of the cluster is almost in the gap. ](gifs/p_km.gif){#fig-p-km-html fig-alt=\"FIX ME\" width=400}\n\nGenerally, there is no need to choose $k$ ahead of time. One would re-fit $k$-means with various choices of $k$, and compare the `tot.withinss` and examine the clusters visually, to decide on the optimal final value of $k$. This can be assessed in a similar way to the scree plot for PCA. \n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\n![Exploring the k-means clustering result for the 4D penguins data. You can see cluster 2 clearly separated from the other observations. Cluster 3, like in the 2D example, is a mix of observations across a gap. Even the mean of the cluster is almost in the gap. ](images/p_km_71.png){#fig-p-km-pdf fig-alt=\"FIX ME\" width=400}\n\nGenerally, there is no need to choose $k$ ahead of time. One would re-fit $k$-means with various choices of $k$, and compare the `tot.withinss` and examine the clusters visually, to decide on the optimal final value of $k$. This can be assessed in a similar way to the scree plot for PCA. \n:::\n\n## Exercises {-}\n\n1. Compute a $k$-means clustering for the `fake_trees` data, varying $k$ to about 20. Choose your best $k$, and examine the solution using the first 10 PCs on the data. It should capture the data quite nicely, although it will break up each branch into multiple clusters.\n2. Compute a $k$-means clustering of the first four PCs of the `aflw` data. Examine the best solution (you choose which $k$), and describe how it divides the data. By examining the means, can you tell if it extracts clusters of offensive vs defensive vs midfield players? Or does it break the data into high skills vs low skills?\n3. Use $k$-means clustering on the challenge data sets, `c1`-`c7` from the `mulgar` package. Explain what choice of $k$ is best for each data set, and why or why not the cluster structure, as you have described it from earlier chapters, is detected or not.\n", + "supporting": [ + "9-kmeans_files/figure-html" + ], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": {}, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/9-kmeans/figure-html/fig-km-2D-1.png b/_freeze/9-kmeans/figure-html/fig-km-2D-1.png new file mode 100644 index 0000000..3bc3045 Binary files /dev/null and b/_freeze/9-kmeans/figure-html/fig-km-2D-1.png differ diff --git a/_freeze/A1-toolbox/execute-results/html.json b/_freeze/A1-toolbox/execute-results/html.json new file mode 100644 index 0000000..2ff5919 --- /dev/null +++ b/_freeze/A1-toolbox/execute-results/html.json @@ -0,0 +1,17 @@ +{ + "hash": "6436d8fab29fccde7133cdacb725eaa6", + "result": { + "engine": "knitr", + "markdown": "# Toolbox {#toolbox}\n\n\n## Using tours in the `tourr` package\n\n### Installation\n\nYou can install the released version of `tourr` from [CRAN](https://CRAN.R-project.org) with:\n\n``` r\ninstall.packages(\"tourr\")\n```\n\nand the development version from the [GitHub repo](https://github.com/ggobi/tourr) with:\n\n``` r\n# install.packages(\"remotes\")\nremotes::install_github(\"ggobi/tourr\")\n```\n\n### Getting started\n\nTo run a tour in R, use one of the animate functions. The following code will show a 2D tour displayed as a scatterplot on a 6D data set with three labelled classes. \n\n``` r\nanimate_xy(flea[,-7], col=flea$species)\n```\n\n@tourr remains a good reference for learning more about this package. The package [website](http://ggobi.github.io/tourr/) has a list of current functionality. \n\n### Different tours\n\nThe two main components of the tour algorithm are the projection dimension which affects the choice of display to use, and the algorithm that delivers the projection sequence. The primary functions for these two parts are \n\n1. For display of different projection dimensions:\n\n- `display_dist()`: choice of density, histogram or average shifted histogram (ash) display of the 1D projections.\n- `display_xy()`, `display_density2d()`, `display_groupxy()`, `display_pca()`, `display_sage()`, `display_slice()`, `display_trails()`: choices in display of 2D projections.\n- `display_depth()`, `display_stereo()`: choices to display 3D projections. \n- `display_pcp()`, `display_scatmat()`, `display_stars()`, `display_faces()`: choices for displaying three or more variables.\n- `display_image()`: to use with multispectral images, where different combinations of spectral bands are displayed. See @WPS98 and @Symanzik2002NewAO for applications.\n- `display_andrews()`: 1D projections as Andrews curves.\n\n2. To change the way projections are delivered:\n\n- `grand_tour()`: Smooth sequence of random projections to view all possible projections as quickly as possible. Good for getting an overview of the high-dimensional data, especially when you don't know what you are looking for. \n- `guided_tour()`: Follow a projection pursuit optimisation to find projections that have particular patterns. This is used when you want to learn if the data has particular patterns, such as clustering or outliers. Use the `holes()` index to find projections with gaps that allow one to see clusters, or `lda_pp()` or `pda_pp()` when class labels are known and you want to find the projections where the clusters are separated. \n- `little_tour()`: Smoothly interpolate between pairs of variables, to show all the marginal views of the data.\n- `local_tour()`: Makes small movements around a chosen projections to explore a small neighbourhood. Very useful to learn if small distances away from a projection change the pattern substantially or not.\n- `radial_tour()`: Interpolates a chosen variable out of the projection, and then back into the projection. This is useful for assessing importance of variables to pattern in a projection. If the pattern changes a lot when the variable is rotated out, then the variable is important for producing it.\n- `dependendence_tour()`: Delivers two sequences of 1D grand tours, to examine associations between two sets of variables. This is useful for displaying two groups of variables as in multiple regression, or multivariate regression or canonical correlation analysis, as two independent 1D projections.\n- `frozen_tour()`: This is an interesting one! it allows the coefficient for some variables to be fixed, and others to vary.\n\n### The importance of scale\n\nScaling of multivariate data is really important in many ways. It affects most model fitting, and can affect the perception of patterns when data is visualised. Here we describe a few scaling issues to take control of when using tours.\n\n\n**Pre-processing data**\n\nIt is generally useful to standardise your data to have mean 0 and variance-covariance equal to the identity matrix before using the tour. We use the tour to discover associations between variables. Characteristics of single variables should be examined and understood before embarking on looking for high-dimensional structure. \n\nThe `rescale` parameter in the `animate()` function will scale all variables to range between 0 and 1, prior to starting the tour. This will force all to have the same range. It is the default, and without this data with different ranges across variable may have some strange patterns. If you have already scaled the data yourself, even if using a different scaling such as using standardised variables you should set `rescale=FALSE`.\n\nA more severe transformation that can be useful prior to starting a tour is to **sphere** the data. This is also an option in the `animate()` function, but is `FALSE` by default. Sphering is the same as conducting a principal component analysis, and using the principal components as the variables. It removes all linear association between variables! This can be especially useful if you want to focus on finding non-linear associations, including clusters, and outliers. \n\n**Scaling to fit into plot region**\n\nThe `half_range` parameter in most of the display types sets the range used to scale the data into the plot. It is estimated when a tour is started, but you may need to change it if you find that the data keeps escaping the plot window or is not fully using the space. Space expands exponentially as dimension increases, and the estimation takes this into account. However, different distributions of data points lead to different variance of observations in high-dimensional space. A skewed distribution will be more varied than a normal distribution. It is hard to estimate precisely how the data should be scaled so that it fits nicely into the plot space for all projections viewed. \n\nThe `center` parameter is used to centre each projection by setting the mean to be at the middle of the plot space. With different distributions the mean of the data can vary around the plot region, and this can be distracting. Fixing the mean of each projection to always be at the center of the plot space makes it easier to focus on other patterns.\n\n### Saving your tour\n\nThe functions `save_history()` and `planned_tour()` allow the tour path to be pre-computed, and re-played in your chosen way. The tour path is saved as a list of projection vectors, which can also be passed to external software for displaying tours easily. Only a minimal set of projections is saved, by default, and a full interpolation path of projections can always be generated from it using the `interpolate()` function.\n\nVersions and elements of tours can be saved for publication using a variety of functions:\n\n- `render_gif()`: Save a tour as an animated gif, using the `gifski` package.\n- `render_proj()`: Save an object that can be used to produce a polished rendering of a single projection, possibly with `ggplot`.\n- `render_anim()`: Creates an object containing a sequence of projections that can be used with `plotly()` to produce an HTML animation, with interactive control.\n\n### Understanding your tour path\n\n@fig-tour-paths-html shows tour paths on 3D data spaces. For 1D projections the space of all possible projections is a $p$-dimensional sphere [@fig-tourpaths1d]. For 2D projections the space of all possible projections is a $p\\times 2$-dimensional torus [@fig-tourpaths2d]! The geometry is elegant. \n\nIn these figures, the space is represented by the light colour, and is constructed by simulating a large number of random projections. The two darker colours indicate paths generated by a grand tour and a guided tour. The grand tour will cover the full space of all possible projections if allowed to run for some time. The guided tour will quickly converge to an optimal projection, so will cover only a small part of the overall space. \n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Load libraries\"}\nlibrary(ferrn)\nlibrary(tourr)\nlibrary(geozoo)\nlibrary(dplyr)\nlibrary(purrr)\n```\n:::\n\n::: {.cell}\n\n:::\n\n\n\n::: {.content-visible when-format=\"html\"} \n\n::: {#fig-tour-paths-html layout-ncol=2}\n\n![1D tour paths](gifs/tour_paths1d.gif){#fig-tourpaths1d width=40%}\n\n![2D tour paths](gifs/tour_paths2d.gif){#fig-tourpaths2d width=40%}\n\nGrand and guided tour paths of 1D and 2D projections of 3D data. The light points represent the space of all 1D and 2D projections respectively. You can see the grand tour is more comprehensively covering the space, as expected, whereas the guided tour is more focused, and quickly moves to the best projection. \n:::\n\n:::\n\n\n::: {.content-visible when-format=\"pdf\"} \n\n::: {#fig-tour-paths-pdf layout-ncol=2}\n\n![1D tour paths](images/tour_paths1d.png){#fig-tourpaths1d width=40%}\n\n![2D tour paths](images/tour_paths2d.png){#fig-tourpaths2d width=40%}\n\nGrand and guided tour paths of 1D and 2D projections of 3D data. The light points represent the space of all 1D and 2D projections respectively. You can see the grand tour is more comprehensively covering the space, as expected, whereas the guided tour is more focused, and quickly moves to the best projection. \n:::\n\n:::\n\n## What not to do\n\n### Discrete and categorical data\n\nTour methods are for numerical data, particularly real-valued measurements. If your data is numerical, but discrete the data can look artificially clustered. @fig-discrete-data-html shows an example. The data is numeric but discrete, so it is ok to examine it in a tour. In this example, there will be overplotting of observations and the artificial clustering (plot a). It can be helpful to jitter observations, by adding a small amount of noise (plot b). This helps to remove the artificial clustering, but preserve the main pattern which is the strong linear association. Generally, jittering is a useful tool for working with discrete data, so that you can focus on examining the multivariate association. If the data is categorical, with no natural ordering of categories, the tour is not advised.\n\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Discrete data code\"}\nset.seed(430)\ndf <- data.frame(x1 = sample(1:6, 107, replace=TRUE)) %>% \n mutate(x2 = x1 + sample(1:2, 107, replace=TRUE),\n x3 = x1 - sample(1:2, 107, replace=TRUE),\n x4 = sample(1:3, 107, replace=TRUE))\nanimate_xy(df)\nrender_gif(df, \n grand_tour(),\n display_xy(),\n gif_file = \"gifs/discrete_data.gif\",\n frames = 100,\n width = 300, \n height = 300)\n\ndfj <- df %>%\n mutate(x1 = jitter(x1, 2), \n x2 = jitter(x2, 2),\n x3 = jitter(x3, 2),\n x4 = jitter(x4, 2))\nanimate_xy(dfj)\nrender_gif(dfj, \n grand_tour(),\n display_xy(),\n gif_file = \"gifs/jittered_data.gif\",\n frames = 100,\n width = 300, \n height = 300)\n```\n:::\n\n\n\n::: {.content-visible when-format=\"html\"} \n\n::: {#fig-discrete-data-html layout-ncol=2}\n\n![Discrete data](gifs/discrete_data.gif){#fig-discrete width=40%}\n\n![Jittered data](gifs/jittered_data.gif){#fig-jittered width=40%}\n\n\nDiscrete data can look like clusters, which is misleading. Adding a small amount of jitter (random number) can help. The noise is not meaningful but it could allow the viewer to focus on linear or non-linear association between variables without being distracted by artificial clustering. \n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"} \n\n::: {#fig-discrete-data-pdf layout-ncol=2}\n\n![Discrete data](images/discrete_data.png){#fig-discrete width=40%}\n\n![Jittered data](images/jittered_data.png){#fig-jittered width=40%}\n\n\nDiscrete data can look like clusters, which is misleading. Adding a small amount of jitter (random number) can help. The noise is not meaningful but it could allow the viewer to focus on linear or non-linear association between variables without being distracted by artificial clustering. \n:::\n:::\n\n### Missing values\n\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to handle missing values\"}\nlibrary(naniar)\nlibrary(ggplot2)\nlibrary(colorspace)\ndata(\"oceanbuoys\")\nob_p <- oceanbuoys %>%\n filter(year == 1993) %>%\n ggplot(aes(x = air_temp_c,\n y = humidity)) +\n geom_miss_point() +\n scale_color_discrete_divergingx(palette=\"Zissou 1\") +\n theme_minimal() + \n theme(aspect.ratio=1)\nob_nomiss_below <- oceanbuoys %>%\n filter(year == 1993) %>%\n rename(st = sea_temp_c,\n at = air_temp_c,\n hu = humidity) %>%\n select(st, at, hu) %>%\n rowwise() %>%\n mutate(anymiss = factor(ifelse(naniar:::any_na(c(st, at, hu)), TRUE, FALSE))) %>%\n add_shadow(st, at, hu) %>%\n impute_below_if(.predicate = is.numeric) \nob_nomiss_mean <- oceanbuoys %>%\n filter(year == 1993) %>%\n rename(st = sea_temp_c,\n at = air_temp_c,\n hu = humidity) %>%\n select(st, at, hu) %>%\n rowwise() %>%\n mutate(anymiss = factor(ifelse(naniar:::any_na(c(st, at, hu)), TRUE, FALSE))) %>%\n add_shadow(st, at, hu) %>%\n impute_mean_if(.predicate = is.numeric) \nob_p_below <- ob_nomiss_below %>%\n ggplot(aes(x=st, y=hu, colour=anymiss)) +\n geom_point() +\n scale_color_discrete_divergingx(palette=\"Zissou 1\") +\n theme_minimal() + \n theme(aspect.ratio=1, legend.position = \"None\")\nob_p_mean <- ob_nomiss_mean %>%\n ggplot(aes(x=st, y=hu, colour=anymiss)) +\n geom_point() +\n scale_color_discrete_divergingx(palette=\"Zissou 1\") +\n theme_minimal() + \n theme(aspect.ratio=1, legend.position = \"None\")\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-summary=\"Code to make animation\"}\nanimate_xy(ob_nomiss_below[,1:3], col=ob_nomiss$anymiss)\nrender_gif(ob_nomiss_below[,1:3],\n grand_tour(),\n display_xy(col=ob_nomiss_below$anymiss), \n gif_file = \"gifs/missing_values1.gif\",\n frames = 100,\n width = 300, \n height = 300)\nrender_gif(ob_nomiss_mean[,1:3],\n grand_tour(),\n display_xy(col=ob_nomiss_mean$anymiss), \n gif_file = \"gifs/missing_values2.gif\",\n frames = 100,\n width = 300, \n height = 300)\n```\n:::\n\n\nMissing values can also pose a problem for high-dimensional visualisation, but they shouldn't just be ignored or removed. Methods used in 2D to display missings as done in the `naniar` package [@naniar] like placing them below the complete data don't translate well to high dimensions. \n@fig-missings-html illustrates this. It leads to artificial clustering of observations [@fig-below-highD]. It is better to impute the values, and mark them with colour when plotting. The cases are then included in the visualisation so we can assess the multivariate relationships, and also obtain some sense of how these cases should be handled, or imputed. In the example in @fig-imputed-highD we imputed the values simply, using the mean of the complete cases. We can see this is not an ideal approach for imputation for this data because some of the imputed values are outside the domain of the complete cases. \n\n::: {.content-visible when-format=\"html\"} \n\n::: {#fig-missings-html layout-ncol=2}\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Missings below in 2D](A1-toolbox_files/figure-html/fig-missings-below-2D-html-1.png){#fig-missings-below-2D-html width=288}\n:::\n:::\n\n\n![Missings below in high-D](gifs/missing_values1.gif){#fig-below-highD width=40%}\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Missings imputed in 2D](A1-toolbox_files/figure-html/fig-missings-mean-2D-html-1.png){#fig-missings-mean-2D-html width=288}\n:::\n:::\n\n\n![Missings imputed in high-D](gifs/missing_values2.gif){#fig-imputed-highD width=40%}\n\nWays to visualise missings for 2D don't transfer to higher dimensions. When the missings are set at 10% below the complete cases it appears to be clustered data when viewed in a tour (b). It is better to impute the value, and use colour to indicate that it is originally a missing value (d).\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"} \n\n::: {#fig-missings-pdf layout-ncol=2}\n\n![Missings below in 2-D](images/fig-missings-below-2D-pdf-1.png){#fig-below-2D-pdf width=40%}\n\n![Missings below in high-D](images/missing_values1.png){#fig-below-highD width=40%}\n\n\n![Missings imputed in 2-D](images/fig-missings-mean-2D-pdf-1.png){#fig-mean-2D-pdf width=40%}\n\n![Missings imputed in high-D](images/missing_values2.png){#fig-imputed-highD width=40%}\n\nWays to visualise missings for 2D don't transfer to higher dimensions. When the missings are set at 10% below the complete cases it appears to be clustered data when viewed in a tour (b). It is better to impute the value, and use colour to indicate that it is originally a missing value (d).\n:::\n:::\n\n### Context such as time and space\n\nWe occasionally hear statements like \"time is the fourth dimension\" or \"space is the fifth dimension\". This is not a useful way to think about dimensionality. \n\nIf you have data with spatial or temporal context, we recommend avoiding using the time index or the spatial coordinates along with the multiple variables in the tour. Time and space are different types of variables, and should not be combined with the multivariate measurements. \n\nFor multivariate temporal data, we recommend using a dependence tour, where one axis is reserved for the time index, and the other axis is used to tour on the multiple variables. For spatial data, we recommend using an image tour, where horizontal and vertical axes are used for spatial coordinates and colour of a tile is used for the tour of multiple variables.\n\n## Tours in other software\n\nThere are tours available in various software packages. For most examples we use the `tourr` package, but the same purpose could be achieved by using other software. We also use some of the software this book, when the `tourr` package is not up for the task. For information about these packages, their websites are the best places to start\n\n- [liminal](https://sa-lee.github.io/liminal/): to combine tours with (non-linear) dimension reduction algorithms.\n- [detourr](https://casperhart.github.io/detourr/): animations for {tourr} using `htmlwidgets` for performance and portability.\n- [langevitour](https://logarithmic.net/langevitour/): HTML widget that randomly tours projections of a high-dimensional dataset with an animated scatterplot.\n- [woylier](https://numbats.github.io/woylier/): alternative method for generating a tour path by interpolating between d-D frames in p-D space rather than d-D planes.\n- [spinifex](https://nspyrison.github.io/spinifex/): manual control of dynamic projections of numeric multivariate data.\n- [ferrn](https://huizezhang-sherry.github.io/ferrn/): extracts key components in the data object collected by the guided tour optimisation, and produces diagnostic plots.\n\n## Supporting software\n\n- [classifly](https://github.com/hadley/classifly): This package is used heavily for supervised classification. \n\nThe `explore()` function is used to explore the classification model. It will predict the class of a sample of points in the predictor space (`.TYPE=simulated`), and return this in a data frame with the observed data (`.TYPE=actual`). The variable `.BOUNDARY` indicates that a point is within a small distance of the classification boundary, when the value is `FALSE`. The variable `.ADVANTAGE` gives an indication of the confidence with which an observation is predicted, so can also be used to select simulated points near the boundary.\n\n", + "supporting": [ + "A1-toolbox_files/figure-html" + ], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": {}, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/A1-toolbox/execute-results/tex.json b/_freeze/A1-toolbox/execute-results/tex.json new file mode 100644 index 0000000..3c50cc6 --- /dev/null +++ b/_freeze/A1-toolbox/execute-results/tex.json @@ -0,0 +1,21 @@ +{ + "hash": "5b5d47a005d48eaf640f11e873921ca9", + "result": { + "engine": "knitr", + "markdown": "# Toolbox {#toolbox}\n\n\n## Using tours in the `tourr` package\n\n### Installation\n\nYou can install the released version of `tourr` from [CRAN](https://CRAN.R-project.org) with:\n\n``` r\ninstall.packages(\"tourr\")\n```\n\nand the development version from the [GitHub repo](https://github.com/ggobi/tourr) with:\n\n``` r\n# install.packages(\"remotes\")\nremotes::install_github(\"ggobi/tourr\")\n```\n\n### Getting started\n\nTo run a tour in R, use one of the animate functions. The following code will show a 2D tour displayed as a scatterplot on a 6D data set with three labelled classes. \n\n``` r\nanimate_xy(flea[,-7], col=flea$species)\n```\n\n@tourr remains a good reference for learning more about this package. The package [website](http://ggobi.github.io/tourr/) has a list of current functionality. \n\n### Different tours\n\nThe two main components of the tour algorithm are the projection dimension which affects the choice of display to use, and the algorithm that delivers the projection sequence. The primary functions for these two parts are \n\n1. For display of different projection dimensions:\n\n- `display_dist()`: choice of density, histogram or average shifted histogram (ash) display of the 1D projections.\n- `display_xy()`, `display_density2d()`, `display_groupxy()`, `display_pca()`, `display_sage()`, `display_slice()`, `display_trails()`: choices in display of 2D projections.\n- `display_depth()`, `display_stereo()`: choices to display 3D projections. \n- `display_pcp()`, `display_scatmat()`, `display_stars()`, `display_faces()`: choices for displaying three or more variables.\n- `display_image()`: to use with multispectral images, where different combinations of spectral bands are displayed. See @WPS98 and @Symanzik2002NewAO for applications.\n- `display_andrews()`: 1D projections as Andrews curves.\n\n2. To change the way projections are delivered:\n\n- `grand_tour()`: Smooth sequence of random projections to view all possible projections as quickly as possible. Good for getting an overview of the high-dimensional data, especially when you don't know what you are looking for. \n- `guided_tour()`: Follow a projection pursuit optimisation to find projections that have particular patterns. This is used when you want to learn if the data has particular patterns, such as clustering or outliers. Use the `holes()` index to find projections with gaps that allow one to see clusters, or `lda_pp()` or `pda_pp()` when class labels are known and you want to find the projections where the clusters are separated. \n- `little_tour()`: Smoothly interpolate between pairs of variables, to show all the marginal views of the data.\n- `local_tour()`: Makes small movements around a chosen projections to explore a small neighbourhood. Very useful to learn if small distances away from a projection change the pattern substantially or not.\n- `radial_tour()`: Interpolates a chosen variable out of the projection, and then back into the projection. This is useful for assessing importance of variables to pattern in a projection. If the pattern changes a lot when the variable is rotated out, then the variable is important for producing it.\n- `dependendence_tour()`: Delivers two sequences of 1D grand tours, to examine associations between two sets of variables. This is useful for displaying two groups of variables as in multiple regression, or multivariate regression or canonical correlation analysis, as two independent 1D projections.\n- `frozen_tour()`: This is an interesting one! it allows the coefficient for some variables to be fixed, and others to vary.\n\n### The importance of scale\n\nScaling of multivariate data is really important in many ways. It affects most model fitting, and can affect the perception of patterns when data is visualised. Here we describe a few scaling issues to take control of when using tours.\n\n\n**Pre-processing data**\n\nIt is generally useful to standardise your data to have mean 0 and variance-covariance equal to the identity matrix before using the tour. We use the tour to discover associations between variables. Characteristics of single variables should be examined and understood before embarking on looking for high-dimensional structure. \n\nThe `rescale` parameter in the `animate()` function will scale all variables to range between 0 and 1, prior to starting the tour. This will force all to have the same range. It is not the default, and without this data with different ranges across variable may have some strange patterns. You should set `rescale=TRUE`. If you have already scaled the data yourself, even if using a different scaling such as using standardised variables then the default `rescale=FALSE` is best.\n\nA more severe transformation that can be useful prior to starting a tour is to **sphere** the data. This is also an option in the `animate()` function, but is `FALSE` by default. Sphering is the same as conducting a principal component analysis, and using the principal components as the variables. It removes all linear association between variables! This can be especially useful if you want to focus on finding non-linear associations, including clusters, and outliers. \n\n**Scaling to fit into plot region**\n\nThe `half_range` parameter in most of the display types sets the range used to scale the data into the plot. It is estimated when a tour is started, but you may need to change it if you find that the data keeps escaping the plot window or is not fully using the space. Space expands exponentially as dimension increases, and the estimation takes this into account. However, different distributions of data points lead to different variance of observations in high-dimensional space. A skewed distribution will be more varied than a normal distribution. It is hard to estimate precisely how the data should be scaled so that it fits nicely into the plot space for all projections viewed. \n\nThe `center` parameter is used to centre each projection by setting the mean to be at the middle of the plot space. With different distributions the mean of the data can vary around the plot region, and this can be distracting. Fixing the mean of each projection to always be at the center of the plot space makes it easier to focus on other patterns.\n\n### Saving your tour\n\nThe functions `save_history()` and `planned_tour()` allow the tour path to be pre-computed, and re-played in your chosen way. The tour path is saved as a list of projection vectors, which can also be passed to external software for displaying tours easily. Only a minimal set of projections is saved, by default, and a full interpolation path of projections can always be generated from it using the `interpolate()` function.\n\nVersions and elements of tours can be saved for publication using a variety of functions:\n\n- `render_gif()`: Save a tour as an animated gif, using the `gifski` package.\n- `render_proj()`: Save an object that can be used to produce a polished rendering of a single projection, possibly with `ggplot`.\n- `render_anim()`: Creates an object containing a sequence of projections that can be used with `plotly()` to produce an HTML animation, with interactive control.\n\n### Understanding your tour path\n\n@fig-tour-paths-pdf shows tour paths on 3D data spaces. For 1D projections the space of all possible projections is a $p$-dimensional sphere (@fig-tourpaths1d). For 2D projections the space of all possible projections is a $p\\times 2$-dimensional torus (@fig-tourpaths2d)! The geometry is elegant. \n\nIn these figures, the space is represented by the light colour, and is constructed by simulating a large number of random projections. The two darker colours indicate paths generated by a grand tour and a guided tour. The grand tour will cover the full space of all possible projections if allowed to run for some time. The guided tour will quickly converge to an optimal projection, so will cover only a small part of the overall space. \n\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n:::\n\n\n\n\n::: {.content-visible when-format=\"html\"} \n\n::: {#fig-tour-paths-html layout-ncol=2}\n\n![1D tour paths](gifs/tour_paths1d.gif){#fig-tourpaths1d width=48%}\n\n![2D tour paths](gifs/tour_paths2d.gif){#fig-tourpaths2d width=48%}\n\nGrand and guided tour paths of 1D and 2D projections of 3D data. The light points represent the space of all 1D and 2D projections respectively. You can see the grand tour is more comprehensively covering the space, as expected, whereas the guided tour is more focused, and quickly moves to the best projection. \n:::\n\n:::\n\n\n::: {.content-visible when-format=\"pdf\"} \n\n::: {#fig-tour-paths-pdf layout-ncol=2}\n\n![1D tour paths](images/tour_paths1d.png){#fig-tourpaths1d width=210}\n\n![2D tour paths](images/tour_paths2d.png){#fig-tourpaths2d width=210}\n\nGrand and guided tour paths of 1D and 2D projections of 3D data. The light points represent the space of all 1D and 2D projections respectively. You can see the grand tour is more comprehensively covering the space, as expected, whereas the guided tour is more focused, and quickly moves to the best projection. {{< fa play-circle >}}\n:::\n\n:::\n\n## What not to do\n\n### Discrete and categorical data\n\nTour methods are for numerical data, particularly real-valued measurements. If your data is numerical, but discrete the data can look artificially clustered. @fig-discrete-data-pdf shows an example. The data is numeric but discrete, so it is ok to examine it in a tour. In this example, there will be overplotting of observations and the artificial clustering (plot a). It can be helpful to jitter observations, by adding a small amount of noise (plot b). This helps to remove the artificial clustering, but preserve the main pattern which is the strong linear association. Generally, jittering is a useful tool for working with discrete data, so that you can focus on examining the multivariate association. If the data is categorical, with no natural ordering of categories, the tour is not advised.\n\n\n\n::: {.cell}\n\n:::\n\n\n\n\n::: {.content-visible when-format=\"html\"} \n\n::: {#fig-discrete-data-html layout-ncol=2}\n\n![Discrete data](gifs/discrete_data.gif){#fig-discrete width=40%}\n\n![Jittered data](gifs/jittered_data.gif){#fig-jittered width=40%}\n\n\nDiscrete data can look like clusters, which is misleading. Adding a small amount of jitter (random number) can help. The noise is not meaningful but it could allow the viewer to focus on linear or non-linear association between variables without being distracted by artificial clustering. \n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"} \n\n::: {#fig-discrete-data-pdf layout-ncol=2}\n\n![Discrete data](images/discrete_data.png){#fig-discrete width=220}\n\n![Jittered data](images/jittered_data.png){#fig-jittered width=220}\n\n\nDiscrete data can look like clusters, which is misleading. Adding a small amount of jitter (random number) can help. The noise is not meaningful but it could allow the viewer to focus on linear or non-linear association between variables without being distracted by artificial clustering. {{< fa play-circle >}}\n:::\n:::\n\n### Missing values\n\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n:::\n\n\n\nMissing values can also pose a problem for high-dimensional visualisation, but they shouldn't just be ignored or removed. Methods used in 2D to display missings as done in the `naniar` package [@naniar] like placing them below the complete data don't translate well to high dimensions. \n@fig-missings-pdf illustrates this. It leads to artificial clustering of observations (@fig-below-highD). It is better to impute the values, and mark them with colour when plotting. The cases are then included in the visualisation so we can assess the multivariate relationships, and also obtain some sense of how these cases should be handled, or imputed. In the example in @fig-imputed-highD we imputed the values simply, using the mean of the complete cases. We can see this is not an ideal approach for imputation for this data because some of the imputed values are outside the domain of the complete cases. \n\n::: {.content-visible when-format=\"html\"} \n\n::: {#fig-missings-html layout-ncol=2}\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Missings below in 2D](A1-toolbox_files/figure-pdf/fig-missings-below-2D-html-1.pdf){#fig-missings-below-2D-html width=80%}\n:::\n:::\n\n\n\n![Missings below in high-D](gifs/missing_values1.gif){#fig-below-highD width=40%}\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Missings imputed in 2D](A1-toolbox_files/figure-pdf/fig-missings-mean-2D-html-1.pdf){#fig-missings-mean-2D-html width=80%}\n:::\n:::\n\n\n\n![Missings imputed in high-D](gifs/missing_values2.gif){#fig-imputed-highD width=40%}\n\nWays to visualise missings for 2D don't transfer to higher dimensions. When the missings are set at 10% below the complete cases it appears to be clustered data when viewed in a tour (b). It is better to impute the value, and use colour to indicate that it is originally a missing value (d).\n:::\n:::\n\n::: {.content-visible when-format=\"pdf\"} \n\n::: {#fig-missings-pdf layout-ncol=2}\n\n![Missings below in 2-D](images/fig-missings-below-2D-pdf-1.png){#fig-below-2D-pdf width=210}\n\n![Missings below in high-D](images/missing_values1.png){#fig-below-highD width=210}\n\n\n![Missings imputed in 2-D](images/fig-missings-mean-2D-pdf-1.png){#fig-mean-2D-pdf width=210}\n\n![Missings imputed in high-D](images/missing_values2.png){#fig-imputed-highD width=210}\n\nWays to visualise missings for 2D don't transfer to higher dimensions. When the missings are set at 10% below the complete cases it appears to be clustered data when viewed in a tour (b). It is better to impute the value, and use colour to indicate that it is originally a missing value (d). {{< fa play-circle >}}\n:::\n:::\n\n### Context such as time and space\n\nWe occasionally hear statements like \"time is the fourth dimension\" or \"space is the fifth dimension\". This is not a useful way to think about dimensionality. \n\nIf you have data with spatial or temporal context, we recommend avoiding using the time index or the spatial coordinates along with the multiple variables in the tour. Time and space are different types of variables, and should not be combined with the multivariate measurements. \n\nFor multivariate temporal data, we recommend using a dependence tour, where one axis is reserved for the time index, and the other axis is used to tour on the multiple variables. For spatial data, we recommend using an image tour, where horizontal and vertical axes are used for spatial coordinates and colour of a tile is used for the tour of multiple variables.\n\n## Tours in other software\n\nThere are tours available in various software packages. For most examples we use the `tourr` package, but the same purpose could be achieved by using other software. We also use some of the software this book, when the `tourr` package is not up for the task. For information about these packages, their websites are the best places to start\n\n- [liminal](https://sa-lee.github.io/liminal/): to combine tours with (non-linear) dimension reduction algorithms.\n- [detourr](https://casperhart.github.io/detourr/): animations for {tourr} using `htmlwidgets` for performance and portability.\n- [langevitour](https://logarithmic.net/langevitour/): HTML widget that shows tours projections of a high-dimensional dataset with an animated scatterplot.\n- [woylier](https://numbats.github.io/woylier/): alternative method for generating a tour path by interpolating between d-D frames in p-D space rather than d-D planes.\n- [spinifex](https://nspyrison.github.io/spinifex/): manual control of dynamic projections of numeric multivariate data.\n- [ferrn](https://huizezhang-sherry.github.io/ferrn/): extracts key components in the data object collected by the guided tour optimisation, and produces diagnostic plots.\n\n## Supporting software\n\n- [classifly](https://github.com/hadley/classifly): This package is used heavily for supervised classification. \n\nThe `explore()` function is used to explore the classification model. It will predict the class of a sample of points in the predictor space (`.TYPE=simulated`), and return this in a data frame with the observed data (`.TYPE=actual`). The variable `.BOUNDARY` indicates that a point is within a small distance of the classification boundary, when the value is `FALSE`. The variable `.ADVANTAGE` gives an indication of the confidence with which an observation is predicted, so can also be used to select simulated points near the boundary.\n\n", + "supporting": [ + "A1-toolbox_files/figure-pdf" + ], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": {}, + "engineDependencies": { + "knitr": [ + "{\"type\":\"list\",\"attributes\":{},\"value\":[]}" + ] + }, + "preserve": null, + "postProcess": false + } +} \ No newline at end of file diff --git a/_freeze/A1-toolbox/figure-html/fig-missings-below-2D-html-1.png b/_freeze/A1-toolbox/figure-html/fig-missings-below-2D-html-1.png new file mode 100644 index 0000000..67c639d Binary files /dev/null and b/_freeze/A1-toolbox/figure-html/fig-missings-below-2D-html-1.png differ diff --git a/_freeze/A1-toolbox/figure-html/fig-missings-mean-2D-html-1.png b/_freeze/A1-toolbox/figure-html/fig-missings-mean-2D-html-1.png new file mode 100644 index 0000000..f9ce16d Binary files /dev/null and b/_freeze/A1-toolbox/figure-html/fig-missings-mean-2D-html-1.png differ diff --git a/_freeze/A1-toolbox/figure-pdf/fig-missings-below-2D-html-1.pdf b/_freeze/A1-toolbox/figure-pdf/fig-missings-below-2D-html-1.pdf new file mode 100644 index 0000000..418c46e Binary files /dev/null and b/_freeze/A1-toolbox/figure-pdf/fig-missings-below-2D-html-1.pdf differ diff --git a/_freeze/A1-toolbox/figure-pdf/fig-missings-below-2D-pdf-1.pdf b/_freeze/A1-toolbox/figure-pdf/fig-missings-below-2D-pdf-1.pdf new file mode 100644 index 0000000..b331239 Binary files /dev/null and b/_freeze/A1-toolbox/figure-pdf/fig-missings-below-2D-pdf-1.pdf differ diff --git a/_freeze/A1-toolbox/figure-pdf/fig-missings-mean-2D-html-1.pdf b/_freeze/A1-toolbox/figure-pdf/fig-missings-mean-2D-html-1.pdf new file mode 100644 index 0000000..061a2df Binary files /dev/null and b/_freeze/A1-toolbox/figure-pdf/fig-missings-mean-2D-html-1.pdf differ diff --git a/_freeze/A1-toolbox/figure-pdf/fig-missings-mean-2D-pdf-1.pdf b/_freeze/A1-toolbox/figure-pdf/fig-missings-mean-2D-pdf-1.pdf new file mode 100644 index 0000000..7f27bd5 Binary files /dev/null and b/_freeze/A1-toolbox/figure-pdf/fig-missings-mean-2D-pdf-1.pdf differ diff --git a/_freeze/A2-data/execute-results/html.json b/_freeze/A2-data/execute-results/html.json new file mode 100644 index 0000000..93967b6 --- /dev/null +++ b/_freeze/A2-data/execute-results/html.json @@ -0,0 +1,17 @@ +{ + "hash": "df47641a35dde03935e37c145c470ab2", + "result": { + "engine": "knitr", + "markdown": "# Data {#data}\n\nThis chapter describes the datasets used throughout the book as listed in @tbl-datalist-html. \n\n\n::: {.cell}\n\n:::\n\n\n::: {.content-visible when-format=\"html\"}\n\n\n::: {#tbl-datalist-html .cell tbl-cap='List of data sets and their sources used in the book examples.'}\n::: {.cell-output-display}\n\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n\n\n \n\n\n\n \n\n\n\n \n\n\n\n \n\n\n\n \n\n\n\n \n\n\n\n \n\n\n\n \n\n\n\n \n\n\n\n \n\n\n\n \n\n\n\n \n \n \n
NameDescriptionSourceAnalysis
aflwPlayer statistics from the AFLWmulgarclustering, dimension reduction
bushfiresMultivariate spatio-temporal data for locations of bushfiresmulgarclustering, classification with RF
Australian election dataSocioecenomic characteristics of Australian electorateshttps://github.com/jforbes14/eechidna-paperdimension reduction, multicollinearity
penguinsMeasure four physical characteristics of three species of penguinshttps://allisonhorst.github.io/palmerpenguins/classification, and clustering
pisaOECD programme for international student assessment datalearningtowerdimension reduction, regression
sketchesGoogle's Quickdraw datamulgarneural networks, classification
multiclusterSimulated data used to show various cluster examplesmulgarclustering
fake treesSimulated data showing branching structuremulgarclustering, dimension reduction
plane and boxSimulated data showing hyper-planesmulgardimension reduction
clusterSimulated data with various clusteringmulgarclustering
c1-c7Simulated data with various clustering, challenge datamulgarclustering
fashion MNISTCollection of apparel imageshttps://github.com/zalandoresearch/fashion-mnistclassification
\n
\n```\n\n:::\n:::\n\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\n\n::: {#tbl-datalist-pdf .cell tbl-cap='List of data sets and their sources used in the book examples.'}\n::: {.cell-output-display}\n\\begin{table}\n\\centering\n\\begin{tabular}{>{\\raggedright\\arraybackslash}p{2cm}>{\\raggedright\\arraybackslash}p{5cm}>{\\raggedright\\arraybackslash}p{5cm}}\n\\toprule\nName & Description & Analysis\\\\\n\\midrule\naflw & Player statistics from the AFLW & clustering, dimension reduction\\\\\nbushfires & Multivariate spatio-temporal data for locations of bushfires & clustering, classification with RF\\\\\nAustralian election data & Socioecenomic characteristics of Australian electorates & dimension reduction, multicollinearity\\\\\npenguins & Measure four physical characteristics of three species of penguins & classification, and clustering\\\\\npisa & OECD programme for international student assessment data & dimension reduction, regression\\\\\n\\addlinespace\nsketches & Google's Quickdraw data & neural networks, classification\\\\\nmulticluster & Simulated data used to show various cluster examples & clustering\\\\\nfake trees & Simulated data showing branching structure & clustering, dimension reduction\\\\\nplane and box & Simulated data showing hyper-planes & dimension reduction\\\\\ncluster & Simulated data with various clustering & clustering\\\\\n\\addlinespace\nc1-c7 & Simulated data with various clustering, challenge data & clustering\\\\\nfashion MNIST & Collection of apparel images & classification\\\\\n\\bottomrule\n\\end{tabular}\n\\end{table}\n\n\n:::\n:::\n\n:::\n\n## Australian Football League Women\n\n### Description {-}\n\nThe `aflw` data is from the 2021 Women's Australian Football League. These are average player statistics across the season, with game statistics provided by the [fitzRoy](https://jimmyday12.github.io/fitzRoy/) package. If you are new to the game of AFL, there is a nice explanation on [Wikipedia](https://en.wikipedia.org/wiki/Women%27s_Australian_rules_football). \n\n### Variables {-}\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n::: {.cell-output .cell-output-stdout}\n\n```\nRows: 381\nColumns: 35\n$ id \"CD_I1001678\", \"CD_I1001679\", \"CD_I1001681\", \"CD_I1001…\n$ given_name \"Jordan\", \"Brianna\", \"Jodie\", \"Ebony\", \"Emma\", \"Pepa\",…\n$ surname \"Zanchetta\", \"Green\", \"Hicks\", \"Antonio\", \"King\", \"Ran…\n$ number 2, 3, 5, 12, 60, 21, 22, 23, 35, 14, 3, 8, 16, 12, 19,…\n$ team \"Brisbane Lions\", \"West Coast Eagles\", \"GWS Giants\", \"…\n$ position \"INT\", \"INT\", \"HFFR\", \"WL\", \"RK\", \"BPL\", \"INT\", \"INT\",…\n$ time_pct 63.00000, 61.25000, 76.50000, 74.90000, 85.10000, 77.4…\n$ goals 0.0000000, 0.0000000, 0.0000000, 0.1000000, 0.6000000,…\n$ behinds 0.0000000, 0.0000000, 0.5000000, 0.4000000, 0.4000000,…\n$ kicks 5.000000, 2.500000, 3.750000, 8.800000, 4.100000, 3.22…\n$ handballs 2.500000, 3.750000, 3.000000, 3.600000, 2.700000, 2.22…\n$ disposals 7.500000, 6.250000, 6.750000, 12.400000, 6.800000, 5.4…\n$ marks 1.5000000, 0.2500000, 1.0000000, 3.7000000, 2.2000000,…\n$ bounces 0.0000000, 0.0000000, 0.0000000, 0.6000000, 0.1000000,…\n$ tackles 3.000000, 2.250000, 2.250000, 3.900000, 2.000000, 1.77…\n$ contested 3.500000, 2.250000, 3.500000, 5.700000, 4.400000, 2.66…\n$ uncontested 3.500000, 4.500000, 3.000000, 7.000000, 2.800000, 1.77…\n$ possessions 7.000000, 6.750000, 6.500000, 12.700000, 7.200000, 4.4…\n$ marks_in50 1.0000000, 0.0000000, 0.2500000, 0.5000000, 0.9000000,…\n$ contested_marks 1.0000000, 0.0000000, 0.0000000, 0.4000000, 1.2000000,…\n$ hitouts 0.0000000, 0.0000000, 0.0000000, 0.0000000, 19.4000000…\n$ one_pct 0.0000000, 1.5000000, 0.5000000, 1.2000000, 2.6000000,…\n$ disposal 60.25000, 67.15000, 37.20000, 65.96000, 61.72000, 66.8…\n$ clangers 2.000000, 0.500000, 2.500000, 3.100000, 2.400000, 1.33…\n$ frees_for 1.0000000, 0.5000000, 0.2500000, 2.5000000, 0.5000000,…\n$ frees_against 1.0000000, 0.5000000, 1.2500000, 1.3000000, 1.1000000,…\n$ rebounds_in50 0.0000000, 0.5000000, 0.2500000, 1.1000000, 0.0000000,…\n$ assists 0.00000000, 0.00000000, 0.00000000, 0.20000000, 0.2000…\n$ accuracy 0.00000, 0.00000, 0.00000, 5.00000, 30.00000, 0.00000,…\n$ turnovers 1.500000, 1.000000, 2.500000, 4.000000, 1.700000, 1.22…\n$ intercepts 2.0000000, 2.0000000, 0.5000000, 5.3000000, 1.3000000,…\n$ tackles_in50 0.5000000, 0.0000000, 0.7500000, 0.5000000, 0.5000000,…\n$ shots 0.5000000, 0.0000000, 0.7500000, 1.0000000, 1.2000000,…\n$ metres 72.50000, 58.50000, 76.00000, 225.90000, 89.80000, 76.…\n$ clearances 0.5000000, 0.2500000, 1.2500000, 0.4000000, 0.9000000,…\n```\n\n\n:::\n:::\n\n\n### Purpose {-}\n\nThe primary analysis is to summarise the variation using principal component analysis, which gives information about relationships between the statistics or skills sets common in players. One also might be tempted to cluster the players, but there are no obvious clusters so it could be frustrating. At best one could partition the players into groups, while recognising there are no absolutely distinct and separated groups.\n\n### Source {-}\n\nSee the information provided with the [fitzRoy](https://jimmyday12.github.io/fitzRoy/) package.\n\n### Pre-processing {-}\n\nThe code for downloading and pre-processing the data is available at the [mulgar](https://dicook.github.io/mulgar/) website in the `data-raw` folder. The data provided by the fitzRoy package was pre-processed to reduce the variables to only those that relate to player skills and performance. It is possible that using some transformations on the variables would be useful to make them less skewed. \n\n## Bushfires\n\n### Description {-}\n\nThis data was collated by Weihao (Patrick) Li as part of his Honours research at Monash University. It contains fire ignitions as detected from satellite hotspots, and processed using the [spotoroo](https://tengmcing.github.io/spotoroo/) package, augmented with measurements on weather, vegetation, proximity to human activity. The cause variable is predicted based on historical fire ignition data collected by County Fire Authority personnel.\n\n### Variables {-}\n\n\n::: {.cell}\n::: {.cell-output .cell-output-stdout}\n\n```\nRows: 1,021\nColumns: 60\n$ id 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…\n$ lon 141.1300, 141.3000, 141.4800, 147.1600, 148.1050, 144.18…\n$ lat -37.13000, -37.65000, -37.35000, -37.85000, -37.57999, -…\n$ time 2019-10-01, 2019-10-01, 2019-10-02, 2019-10-02, 2019-10…\n$ FOR_CODE 41, 41, 91, 44, 0, 44, 0, 102, 0, 91, 45, 41, 45, 45, 45…\n$ FOR_TYPE \"Eucalypt Medium Woodland\", \"Eucalypt Medium Woodland\", …\n$ FOR_CAT \"Native forest\", \"Native forest\", \"Commercial plantation…\n$ COVER 1, 1, 4, 2, 6, 2, 6, 5, 6, 4, 2, 1, 2, 2, 2, 2, 6, 6, 6,…\n$ HEIGHT 2, 2, 4, 2, 6, 2, 6, 5, 6, 4, 3, 2, 3, 3, 3, 2, 6, 6, 6,…\n$ FOREST 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,…\n$ rf 0.0, 0.0, 15.4, 4.8, 6.0, 11.6, 11.6, 0.6, 0.2, 0.6, 0.0…\n$ arf7 5.0857143, 2.4000000, 2.4000000, 0.7142857, 0.8571429, 1…\n$ arf14 2.8142857, 1.7428571, 1.8000000, 1.6714286, 1.5714286, 1…\n$ arf28 1.9785714, 1.5357143, 1.5357143, 3.7857143, 1.9000000, 1…\n$ arf60 2.3033333, 1.7966667, 1.7966667, 4.0000000, 2.5333333, 1…\n$ arf90 1.2566667, 1.0150000, 1.0150000, 2.9600000, 2.1783333, 1…\n$ arf180 0.9355556, 0.8444444, 0.8444444, 2.3588889, 1.7866667, 1…\n$ arf360 1.3644444, 1.5255556, 1.5255556, 1.7272222, 1.4716667, 1…\n$ arf720 1.3011111, 1.5213889, 1.5213889, 1.7111111, 1.5394444, 1…\n$ se 3.8, 4.6, 14.2, 23.7, 23.8, 16.8, 18.0, 12.9, 14.7, 12.9…\n$ ase7 18.02857, 18.50000, 21.41429, 23.08571, 23.11429, 22.014…\n$ ase14 17.03571, 17.44286, 18.03571, 19.17143, 18.45714, 18.628…\n$ ase28 19.32857, 18.47500, 19.33929, 18.23571, 16.86071, 19.375…\n$ ase60 20.38644, 19.99153, 20.39492, 19.90847, 19.26780, 20.449…\n$ ase90 22.54118, 21.93193, 22.04370, 20.59328, 20.04538, 21.809…\n$ ase180 20.79106, 19.93966, 19.99385, 19.11006, 18.66760, 19.810…\n$ ase360 15.55153, 14.83259, 14.87883, 14.69276, 14.44318, 14.755…\n$ ase720 15.52350, 14.75049, 14.77427, 14.53463, 14.32656, 14.540…\n$ maxt 21.3, 17.8, 15.4, 20.8, 19.8, 15.8, 19.5, 12.6, 18.8, 12…\n$ amaxt7 22.38571, 20.44286, 22.21429, 24.21429, 23.14286, 21.671…\n$ amaxt14 21.42857, 19.72857, 19.86429, 21.80000, 20.89286, 19.578…\n$ amaxt28 20.71071, 19.10000, 19.18929, 19.75000, 19.05714, 18.885…\n$ amaxt60 24.02667, 22.28000, 22.38667, 22.93167, 22.12000, 21.031…\n$ amaxt90 27.07750, 25.77667, 25.89833, 24.93667, 23.93750, 23.164…\n$ amaxt180 26.92000, 25.92722, 25.98500, 24.84056, 23.95389, 23.343…\n$ amaxt360 21.55389, 20.79778, 20.81333, 20.21972, 19.99389, 19.505…\n$ amaxt720 21.47750, 20.57222, 20.57694, 20.13153, 20.03875, 19.650…\n$ mint 9.6, 9.0, 7.3, 7.7, 8.3, 8.3, 6.1, 5.9, 7.4, 5.9, 6.9, 7…\n$ amint7 9.042857, 7.971429, 9.171429, 10.328571, 11.200000, 10.6…\n$ amint14 9.928571, 9.235714, 9.421429, 10.007143, 10.900000, 10.7…\n$ amint28 8.417857, 7.560714, 7.353571, 8.671429, 9.575000, 10.060…\n$ amint60 11.156667, 9.903333, 9.971667, 10.971667, 11.975000, 12.…\n$ amint90 11.96667, 10.81250, 10.87833, 12.49000, 13.46167, 13.638…\n$ amint180 11.96778, 11.01056, 11.02000, 12.41944, 13.42500, 13.695…\n$ amint360 9.130556, 8.459722, 8.448333, 9.588611, 10.456389, 11.03…\n$ amint720 8.854861, 8.266250, 8.254028, 9.674861, 10.517083, 10.96…\n$ dist_cfa 9442.206, 6322.438, 7957.374, 7790.785, 10692.055, 6054.…\n$ dist_camp 50966.485, 6592.893, 31767.235, 8816.272, 15339.702, 941…\n$ ws 1.263783, 1.263783, 1.456564, 5.424445, 4.219751, 4.1769…\n$ aws_m0 2.644795, 2.644795, 2.644795, 5.008369, 3.947659, 5.2316…\n$ aws_m1 2.559202, 2.559202, 2.559202, 5.229680, 4.027398, 4.9704…\n$ aws_m3 2.446211, 2.446211, 2.446211, 5.386005, 3.708622, 5.3045…\n$ aws_m6 2.144843, 2.144843, 2.144843, 5.132617, 3.389890, 5.0355…\n$ aws_m12 2.545008, 2.545008, 2.548953, 5.045297, 3.698736, 5.2341…\n$ aws_m24 2.580671, 2.580671, 2.584047, 5.081100, 3.745286, 5.2522…\n$ dist_road 498.75145, 102.22032, 1217.22446, 281.69151, 215.56176, …\n$ log_dist_cfa 9.152945, 8.751860, 8.981854, 8.960697, 9.277256, 8.7084…\n$ log_dist_camp 10.838924, 8.793748, 10.366191, 9.084354, 9.638200, 9.15…\n$ log_dist_road 6.212108, 4.627130, 7.104329, 5.640813, 5.373247, 5.0047…\n$ cause \"lightning\", \"lightning\", \"lightning\", \"lightning\", \"lig…\n```\n\n\n:::\n:::\n\n\n### Purpose {-}\n\nThe primary goal is to predict the cause of the bushfire using the weather and distance from human activity variables provided. \n\n### Source {-}\n\nCollated data was part of Weihao Li's Honours thesis, which is not publicly available. The hotspots data was collected from @jaxa, climate data was taken from the Australian Bureau of Meteorology using the `bomrang` package [@R-bomrang], wind data from @wind and @windasos, vegetation data from @forest, distance from roads calculated using @OpenStreetMap, CFA stations from @cfa, and campsites from @recreation. The cause was predicted from training data provided by @fireorigin.\n\n### Pre-processing {-}\n\nThe 60 variables are too many to view with a tour, so it should be pre-processed using principal component analysis. The categorical variables of FOR_TYPE and FOR_CAT are removed. It would be possible to keep these if they are converted to dummy (binary variables).\n\n\n## Australian election data\n\n### Description {-}\n\nThis is data from a study on the relationship between voting patterns and socio-demographic characteristics of Australian electorates reported in @eechidna. These are the predictor variables upon which voting percentages are modelled. There are two years of data in `oz_election_2001` and `oz_election_2016`.\n\n### Variables {-}\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nload(\"data/oz_election_2001.rda\")\nload(\"data/oz_election_2016.rda\")\nglimpse(oz_election_2001)\n```\n:::\n\n\n### Purpose {-}\n\nThe tour is used to check for multicollinearity between predictors, that might adversely affect the linear model fit. \n\n### Source {-}\n\nThe data was compiled from Australian Electoral Commission (AEC) and the Australian\n38 Bureau of Statistics (ABS). Code to construct the data, and the original data are available at https://github.com/jforbes14/eechidna-paper. \n\n### Pre-processing {-}\n\nConsiderable pre-processing was done to produce these data sets. The original data was wrangled into tidy form, some variables were log transformed to reduce skewness, and a subset of variables was chosen. \n\n## Palmer penguins\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(palmerpenguins)\npenguins <- penguins %>%\n na.omit() # 11 observations out of 344 removed\n# use only vars of interest, and standardise\n# them for easier interpretation\npenguins_sub <- penguins %>% \n select(bill_length_mm,\n bill_depth_mm,\n flipper_length_mm,\n body_mass_g,\n species, \n sex) %>% \n mutate(across(where(is.numeric), ~ scale(.)[,1])) %>%\n rename(bl = bill_length_mm,\n bd = bill_depth_mm,\n fl = flipper_length_mm,\n bm = body_mass_g)\nsave(penguins_sub, file=\"data/penguins_sub.rda\")\n```\n:::\n\n\n### Description {-}\n\nThis data measure four physical characteristics of three species of penguins. \n\n### Variables {-}\n\n|Name | Description |\n|:---------|:----------------|\n|`bl` | a number denoting bill length (millimeters) |\n`bd` | a number denoting bill depth (millimeters) |\n|`fl` | an integer denoting flipper length (millimeters) |\n|`bm` | an integer denoting body mass (grams) |\n|`species` | a factor denoting penguin species (Adélie, Chinstrap and Gentoo) |\n\n### Purpose {-}\n\nThe primary goal is to find a combination of the four variables where the three species are distinct. This is also a useful data set to illustrate cluster analysis.\n\n### Source {-}\n\nDetails of the penguins data can be found at [https://allisonhorst.github.io/palmerpenguins/](https://allisonhorst.github.io/palmerpenguins/), and @R-palmerpenguins is the package source.\n\n### Pre-processing {-}\n\nThe data is loaded from the `palmerpenguins` package. The four physical measurement variables and the species are selected, and the penguins with missing values are removed. Variables are standardised, and their names are shortened.\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nlibrary(palmerpenguins)\npenguins <- penguins %>%\n na.omit() # 11 observations out of 344 removed\n# use only vars of interest, and standardise\n# them for easier interpretation\npenguins_sub <- penguins[,c(3:6, 1)] %>% \n mutate(across(where(is.numeric), ~ scale(.)[,1])) %>%\n rename(bl = bill_length_mm,\n bd = bill_depth_mm,\n fl = flipper_length_mm,\n bm = body_mass_g) %>%\n as.data.frame()\nsave(penguins_sub, file=\"data/penguins_sub.rda\")\n```\n:::\n\n\n\n## Program for International Student Assessment \n\n### Description {-}\n\nThe `pisa` data contains plausible scores for math, reading and science of Australian and Indonesian students from the 2018 testing cycle. The plausible scores are simulated from a model fitted to the original data, to preserve privacy of the students.\n\n### Variables {-}\n\n|Name | Description |\n|:---------|:----------------|\n|`CNT` | country, either AUS for Australia or IDN for Indonesia |\n|`PV1MATH`-`PV10MATH` | plausible scores for math |\n|`PV1READ`-`PV10READ` | plausible scores for reading |\n|`PV1SCIE`-`PV10SCIE` | plausible scores for science |\n\n### Purpose {-}\n\nPrimarily this data is useful as an example for dimension reduction.\n\n### Source {-}\n\nThe full data is available from https://www.oecd.org/pisa/. There are records of the student test scores, along with survey data from the students, their households and their schools.\n\n### Pre-processing {-}\n\nThe data was reduced to country and the plausible scores, and filtered to the two countries. It may be helpful to know that the SPSS format data was used, and was read into R using the `read_sav()` function in the `haven` package. \n\n## Sketches \n\n### Description {-}\n\nThis data is a subset of images from https://quickdraw.withgoogle.com The subset was created using the quickdraw R package at https://huizezhang-sherry.github.io/quickdraw/. It has 6 different groups: banana, boomerang, cactus, flip flops, kangaroo. Each image is 28x28 pixels. The `sketches_train` data would be used to train a classification model, and the unlabelled `sketches_test` can be used for prediction. \n\n### Variables {-}\n\n|Name | Description |\n|:---------|:----------------|\n|`V1-V784` | grey scale 0-255 |\n|`word` | what the person was asked to draw, NA in the test data |\n|`id` | unique id for each sketch |\n\n### Purpose {-}\n\nPrimarily this data is useful as an example for supervised classification, and also dimension reduction.\n\n### Source {-}\n\nThe full data is available from https://quickdraw.withgoogle.com. \n\n### Pre-processing {-}\n\nIt is typically useful to pre-process this data into principal components. This code can also be useful for plotting one of the sketches in a recognisable form:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code code-fold=\"false\"}\nlibrary(mulgar)\nlibrary(ggplot2)\ndata(\"sketches_train\")\nset.seed(77)\nx <- sketches_train[sample(1:nrow(sketches_train), 1), ]\nxm <- data.frame(gry=t(as.matrix(x[,1:784])),\n x=rep(1:28, 28),\n y=rep(28:1, rep(28, 28)))\nggplot(xm, aes(x=x, y=y, fill=gry)) +\n geom_tile() +\n scale_fill_gradientn(colors = gray.colors(256, \n start = 0, \n end = 1, \n rev = TRUE )) +\n ggtitle(x$word) +\n theme_void() + \n theme(legend.position=\"none\")\n```\n\n::: {.cell-output-display}\n![One of the sketches in the subset of training data.](A2-data_files/figure-html/fig-sketches-1.png){#fig-sketches fig-align='center' width=20%}\n:::\n:::\n\n\n## `multicluster`\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(mulgar)\ndata(\"multicluster\")\n```\n:::\n\n\n### Description {-}\n\nThis data has 10 numeric variables, and a class variable labelling groups.\n\n### Variables {-}\n\n|Name | Description |\n|:---------|:----------------|\n|`group` | cluster label |\n|`x1-x10` | numeric variables |\n\n### Purpose {-}\n\nThe primary goal is to find the different clusters. \n\n### Source {-}\n\nThis data is originally from http://ifs.tuwien.ac.at/dm/download/multiChallenge-matrix.txt, and provided as a challenge for non-linear dimension reduction.It was used as an example in Lee, Laa, Cook (2023) https://doi.org/10.52933/jdssv.v2i3.\n\n## `clusters`, `clusters_nonlin`, `simple_clusters`\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(mulgar)\ndata(\"clusters\")\ndata(\"clusters_nonlin\")\ndata(\"simple_clusters\")\n```\n:::\n\n\n### Description {-}\n\nThis data has a various number of numeric variables, and a class variable labelling the clusters.\n\n### Variables {-}\n\n|Name | Description |\n|:---------|:----------------|\n|`x1-x5` | numeric variables |\n|`cl` | cluster label |\n\n### Purpose {-}\n\nThe primary goal is to find the different clusters. \n\n### Source {-}\n\nSimulated using the code in the `simulate.R` file of the `data-raw` directory of the `mulgar` package.\n\n## `plane`, `plane_nonlin`, `box`\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(mulgar)\ndata(\"plane\")\ndata(\"plane_nonlin\")\ndata(\"box\")\n```\n:::\n\n\n### Description {-}\n\nThis data has a various number of numeric variables. \n\n### Variables {-}\n\n|Name | Description |\n|:---------|:----------------|\n|`x1-x5` | numeric variables |\n\n### Purpose {-}\n\nThe primary goal is to understand how many dimensions the data spreads out. \n\n### Source {-}\n\nSimulated using the code in the `simulate.R` file of the `data-raw` directory of the `mulgar` package.\n\n## Additional data used in the book\n\n@tbl-datalinks-html lists additional data available on the book web site at https://dicook.github.io/mulgar_book/data. \n\n\n::: {.cell}\n\n:::\n\n\n::: {.content-visible when-format=\"html\"}\n\n\n::: {#tbl-datalinks-html .cell tbl-cap='Links to other data sets used on the book.'}\n::: {.cell-output-display}\n\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n \n \n
DescriptionLink
Saved 2D tour path for the aflw data<a href='https://dicook.github.io/mulgar_book/data/aflw_pct.rda'> aflw_pct.rda </a>
Saved clusters of the penguins data from detourr<a href='https://dicook.github.io/mulgar_book/data/detourr_penguins.csv'> detourr_penguins.csv </a>
Saved clusters of the fake trees data from detourr<a href='https://dicook.github.io/mulgar_book/data/fake_trees_sb.csv'> fake_trees_sb.csv </a>
Tidied penguins data<a href='https://dicook.github.io/mulgar_book/data/penguins_sub.rda'> penguins_sub.rda </a>
Saved 2D tour path for penguins data<a href='https://dicook.github.io/mulgar_book/data/penguins_tour_path.rda'> penguins_tour_path.rda </a>
risk survey<a href='https://dicook.github.io/mulgar_book/data/risk_MSA.rds'> risk_MSA.rds </a>
penguins NN model<a href='https://dicook.github.io/mulgar_book/data/penguins_cnn> penguins_cnn </a>
fashion MNST NN model<a href='https://dicook.github.io/mulgar_book/data/fashion_cnn> fashion_cnn </a>
penguins SHAP values<a href='https://dicook.github.io/mulgar_book/data/p_exp_sv.rda> p_exp_sv.rda </a>
\n
\n```\n\n:::\n:::\n\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\n::: {#tbl-datalinks-pdf .cell tbl-cap='Links to other data sets used on the book.'}\n::: {.cell-output-display}\n\\begin{table}\n\\centering\n\\begin{tabular}{>{\\raggedright\\arraybackslash}p{4cm}>{\\raggedright\\arraybackslash}p{5cm}>{\\raggedright\\arraybackslash}p{2cm}}\n\\toprule\nFilename & Description & Chapter\\\\\n\\midrule\n\\ttfamily{aflw\\_pct.rda} & Saved 2D tour path for the aflw data & 4\\\\\n\\ttfamily{detourr\\_penguins.csv} & Saved clusters of the penguins data from detourr & 7\\\\\n\\ttfamily{fake\\_trees\\_sb.csv} & Saved clusters of the fake trees data from detourr & 7\\\\\n\\ttfamily{penguins\\_sub.rda} & Tidied penguins data & 5, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17\\\\\n\\ttfamily{penguins\\_tour\\_path.rda} & Saved 2D tour path for penguins data & 1, 8, 10, 13, 14, 15\\\\\n\\addlinespace\n\\ttfamily{risk\\_MSA.rds} & risk survey & 12\\\\\n\\ttfamily{penguins\\_cnn} & penguins NN model & 17\\\\\n\\ttfamily{fashion\\_cnn} & fashion MNST NN model & 17\\\\\n\\ttfamily{p\\_exp\\_sv.rda} & penguins SHAP values & 17\\\\\n\\bottomrule\n\\end{tabular}\n\\end{table}\n\n\n:::\n:::\n\n:::\n\n\n", + "supporting": [ + "A2-data_files/figure-html" + ], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": {}, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/A2-data/execute-results/tex.json b/_freeze/A2-data/execute-results/tex.json new file mode 100644 index 0000000..5e6b321 --- /dev/null +++ b/_freeze/A2-data/execute-results/tex.json @@ -0,0 +1,21 @@ +{ + "hash": "c852aeeee896e3930ad24b003e0e62a7", + "result": { + "engine": "knitr", + "markdown": "# Data {#data}\n\nThis chapter describes the datasets used throughout the book as listed in @tbl-datalist-pdf. \n\n\n\n::: {.cell}\n\n:::\n\n\n\n::: {.content-visible when-format=\"html\"}\n\n\n\n::: {#tbl-datalist-html .cell tbl-cap='List of data sets and their sources used in the book examples.'}\n::: {.cell-output-display}\n\\begin{longtable}{llll}\n\\toprule\nName & Description & Source & Analysis \\\\ \n\\midrule\naflw & Player statistics from the AFLW & mulgar & clustering, dimension reduction \\\\ \nbushfires & Multivariate spatio-temporal data for locations of bushfires & mulgar & clustering, classification with RF \\\\ \nAustralian election data & Socioecenomic characteristics of Australian electorates & https://github.com/jforbes14/eechidna-paper & dimension reduction, multicollinearity \\\\ \npenguins & Measure four physical characteristics of three species of penguins & https://allisonhorst.github.io/palmerpenguins/ & classification, and clustering \\\\ \npisa & OECD programme for international student assessment data & learningtower & dimension reduction, regression \\\\ \nsketches & Google's Quickdraw data & mulgar & neural networks, classification \\\\ \nmulticluster & Simulated data used to show various cluster examples & mulgar & clustering \\\\ \nfake trees & Simulated data showing branching structure & mulgar & clustering, dimension reduction \\\\ \nplane and box & Simulated data showing hyper-planes & mulgar & dimension reduction \\\\ \nclusters, clusters\\_nonlin, simple\\_clusters & Simulated data with various clustering & mulgar & clustering \\\\ \nc1-c7 & Simulated data with various clustering, challenge data & mulgar & clustering \\\\ \nfashion MNIST & Collection of apparel images & https://github.com/zalandoresearch/fashion-mnist & classification \\\\ \nrisk\\_MSA & Six types of risks and responses by survey & https://homepage.boku.ac.at/leisch/MSA/ & clustering \\\\ \npbmc & 50 PCs of scRNA-seq gene expression & https://satijalab.org/seurat/articles/pbmc3k\\_tutorial.html & dimension reduction, clustering \\\\ \n\\bottomrule\n\\end{longtable}\n\n:::\n:::\n\n\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\n\n\n::: {#tbl-datalist-pdf .cell tbl-cap='List of data sets and their sources used in the book examples.'}\n::: {.cell-output-display}\n\\begin{table}\n\\centering\n\\fontsize{8}{10}\\selectfont\n\\begin{tabular}{>{\\raggedright\\arraybackslash}p{2cm}>{\\raggedright\\arraybackslash}p{5cm}>{\\raggedright\\arraybackslash}p{4cm}}\n\\toprule\nName & Description & Analysis\\\\\n\\midrule\naflw & Player statistics from the AFLW & clustering, dimension reduction\\\\\nbushfires & Multivariate spatio-temporal data for locations of bushfires & clustering, classification with RF\\\\\nAustralian election data & Socioecenomic characteristics of Australian electorates & dimension reduction, multicollinearity\\\\\npenguins & Measure four physical characteristics of three species of penguins & classification, and clustering\\\\\npisa & OECD programme for international student assessment data & dimension reduction, regression\\\\\n\\addlinespace\nsketches & Google's Quickdraw data & neural networks, classification\\\\\nmulticluster & Simulated data used to show various cluster examples & clustering\\\\\nfake trees & Simulated data showing branching structure & clustering, dimension reduction\\\\\nplane and box & Simulated data showing hyper-planes & dimension reduction\\\\\nclusters, clusters\\_nonlin, simple\\_clusters & Simulated data with various clustering & clustering\\\\\n\\addlinespace\nc1-c7 & Simulated data with various clustering, challenge data & clustering\\\\\nfashion MNIST & Collection of apparel images & classification\\\\\nrisk\\_MSA & Six types of risks and responses by survey & clustering\\\\\npbmc & 50 PCs of scRNA-seq gene expression & dimension reduction, clustering\\\\\n\\bottomrule\n\\end{tabular}\n\\end{table}\n\n\n:::\n:::\n\n\n:::\n\n## Australian Football League Women\n\n### Description {-}\n\nThe `aflw` data is from the 2021 Women's Australian Football League. These are average player statistics across the season, with game statistics provided by the [fitzRoy](https://jimmyday12.github.io/fitzRoy/) package. If you are new to the game of AFL, there is a nice explanation on [Wikipedia](https://en.wikipedia.org/wiki/Women%27s_Australian_rules_football). \n\n### Variables {-}\n\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n::: {.cell-output .cell-output-stdout}\n\n```\nRows: 381\nColumns: 35\n$ id \"CD_I1001678\", \"CD_I1001679\", \"CD_I1~\n$ given_name \"Jordan\", \"Brianna\", \"Jodie\", \"Ebony~\n$ surname \"Zanchetta\", \"Green\", \"Hicks\", \"Anto~\n$ number 2, 3, 5, 12, 60, 21, 22, 23, 35, 14,~\n$ team \"Brisbane Lions\", \"West Coast Eagles~\n$ position \"INT\", \"INT\", \"HFFR\", \"WL\", \"RK\", \"B~\n$ time_pct 63.00000, 61.25000, 76.50000, 74.900~\n$ goals 0.0000000, 0.0000000, 0.0000000, 0.1~\n$ behinds 0.0000000, 0.0000000, 0.5000000, 0.4~\n$ kicks 5.000000, 2.500000, 3.750000, 8.8000~\n$ handballs 2.500000, 3.750000, 3.000000, 3.6000~\n$ disposals 7.500000, 6.250000, 6.750000, 12.400~\n$ marks 1.5000000, 0.2500000, 1.0000000, 3.7~\n$ bounces 0.0000000, 0.0000000, 0.0000000, 0.6~\n$ tackles 3.000000, 2.250000, 2.250000, 3.9000~\n$ contested 3.500000, 2.250000, 3.500000, 5.7000~\n$ uncontested 3.500000, 4.500000, 3.000000, 7.0000~\n$ possessions 7.000000, 6.750000, 6.500000, 12.700~\n$ marks_in50 1.0000000, 0.0000000, 0.2500000, 0.5~\n$ contested_marks 1.0000000, 0.0000000, 0.0000000, 0.4~\n$ hitouts 0.000000, 0.000000, 0.000000, 0.0000~\n$ one_pct 0.0000000, 1.5000000, 0.5000000, 1.2~\n$ disposal 60.25000, 67.15000, 37.20000, 65.960~\n$ clangers 2.000000, 0.500000, 2.500000, 3.1000~\n$ frees_for 1.0000000, 0.5000000, 0.2500000, 2.5~\n$ frees_against 1.0000000, 0.5000000, 1.2500000, 1.3~\n$ rebounds_in50 0.0000000, 0.5000000, 0.2500000, 1.1~\n$ assists 0.00000000, 0.00000000, 0.00000000, ~\n$ accuracy 0.00000, 0.00000, 0.00000, 5.00000, ~\n$ turnovers 1.500000, 1.000000, 2.500000, 4.0000~\n$ intercepts 2.0000000, 2.0000000, 0.5000000, 5.3~\n$ tackles_in50 0.5000000, 0.0000000, 0.7500000, 0.5~\n$ shots 0.5000000, 0.0000000, 0.7500000, 1.0~\n$ metres 72.50000, 58.50000, 76.00000, 225.90~\n$ clearances 0.5000000, 0.2500000, 1.2500000, 0.4~\n```\n\n\n:::\n:::\n\n\n\n### Purpose {-}\n\nThe primary analysis is to summarise the variation using principal component analysis, which gives information about relationships between the statistics or skills sets common in players. One also might be tempted to cluster the players, but there are no obvious clusters so it could be frustrating. At best one could partition the players into groups, while recognising there are no absolutely distinct and separated groups.\n\n### Source {-}\n\nSee the information provided with the [fitzRoy](https://jimmyday12.github.io/fitzRoy/) package.\n\n### Pre-processing {-}\n\nThe code for downloading and pre-processing the data is available at the [mulgar](https://dicook.github.io/mulgar/) website in the `data-raw` folder. The data provided by the fitzRoy package was pre-processed to reduce the variables to only those that relate to player skills and performance. It is possible that using some transformations on the variables would be useful to make them less skewed. \n\n## Bushfires\n\n### Description {-}\n\nThis data was collated by Weihao (Patrick) Li as part of his Honours research at Monash University. It contains fire ignitions as detected from satellite hotspots, and processed using the [spotoroo](https://tengmcing.github.io/spotoroo/) package, augmented with measurements on weather, vegetation, proximity to human activity. The cause variable is predicted based on historical fire ignition data collected by County Fire Authority personnel.\n\n### Variables {-}\n\n\n\n::: {.cell}\n::: {.cell-output .cell-output-stdout}\n\n```\nRows: 1,021\nColumns: 60\n$ id 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,~\n$ lon 141.1300, 141.3000, 141.4800, 147.1600~\n$ lat -37.13000, -37.65000, -37.35000, -37.8~\n$ time 2019-10-01, 2019-10-01, 2019-10-02, 2~\n$ FOR_CODE 41, 41, 91, 44, 0, 44, 0, 102, 0, 91, ~\n$ FOR_TYPE \"Eucalypt Medium Woodland\", \"Eucalypt ~\n$ FOR_CAT \"Native forest\", \"Native forest\", \"Com~\n$ COVER 1, 1, 4, 2, 6, 2, 6, 5, 6, 4, 2, 1, 2,~\n$ HEIGHT 2, 2, 4, 2, 6, 2, 6, 5, 6, 4, 3, 2, 3,~\n$ FOREST 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1,~\n$ rf 0.0, 0.0, 15.4, 4.8, 6.0, 11.6, 11.6, ~\n$ arf7 5.0857143, 2.4000000, 2.4000000, 0.714~\n$ arf14 2.8142857, 1.7428571, 1.8000000, 1.671~\n$ arf28 1.9785714, 1.5357143, 1.5357143, 3.785~\n$ arf60 2.3033333, 1.7966667, 1.7966667, 4.000~\n$ arf90 1.2566667, 1.0150000, 1.0150000, 2.960~\n$ arf180 0.9355556, 0.8444444, 0.8444444, 2.358~\n$ arf360 1.3644444, 1.5255556, 1.5255556, 1.727~\n$ arf720 1.3011111, 1.5213889, 1.5213889, 1.711~\n$ se 3.8, 4.6, 14.2, 23.7, 23.8, 16.8, 18.0~\n$ ase7 18.02857, 18.50000, 21.41429, 23.08571~\n$ ase14 17.03571, 17.44286, 18.03571, 19.17143~\n$ ase28 19.32857, 18.47500, 19.33929, 18.23571~\n$ ase60 20.38644, 19.99153, 20.39492, 19.90847~\n$ ase90 22.54118, 21.93193, 22.04370, 20.59328~\n$ ase180 20.79106, 19.93966, 19.99385, 19.11006~\n$ ase360 15.55153, 14.83259, 14.87883, 14.69276~\n$ ase720 15.52350, 14.75049, 14.77427, 14.53463~\n$ maxt 21.3, 17.8, 15.4, 20.8, 19.8, 15.8, 19~\n$ amaxt7 22.38571, 20.44286, 22.21429, 24.21429~\n$ amaxt14 21.42857, 19.72857, 19.86429, 21.80000~\n$ amaxt28 20.71071, 19.10000, 19.18929, 19.75000~\n$ amaxt60 24.02667, 22.28000, 22.38667, 22.93167~\n$ amaxt90 27.07750, 25.77667, 25.89833, 24.93667~\n$ amaxt180 26.92000, 25.92722, 25.98500, 24.84056~\n$ amaxt360 21.55389, 20.79778, 20.81333, 20.21972~\n$ amaxt720 21.47750, 20.57222, 20.57694, 20.13153~\n$ mint 9.6, 9.0, 7.3, 7.7, 8.3, 8.3, 6.1, 5.9~\n$ amint7 9.042857, 7.971429, 9.171429, 10.32857~\n$ amint14 9.928571, 9.235714, 9.421429, 10.00714~\n$ amint28 8.417857, 7.560714, 7.353571, 8.671429~\n$ amint60 11.156667, 9.903333, 9.971667, 10.9716~\n$ amint90 11.96667, 10.81250, 10.87833, 12.49000~\n$ amint180 11.96778, 11.01056, 11.02000, 12.41944~\n$ amint360 9.130556, 8.459722, 8.448333, 9.588611~\n$ amint720 8.854861, 8.266250, 8.254028, 9.674861~\n$ dist_cfa 9442.206, 6322.438, 7957.374, 7790.785~\n$ dist_camp 50966.485, 6592.893, 31767.235, 8816.2~\n$ ws 1.263783, 1.263783, 1.456564, 5.424445~\n$ aws_m0 2.644795, 2.644795, 2.644795, 5.008369~\n$ aws_m1 2.559202, 2.559202, 2.559202, 5.229680~\n$ aws_m3 2.446211, 2.446211, 2.446211, 5.386005~\n$ aws_m6 2.144843, 2.144843, 2.144843, 5.132617~\n$ aws_m12 2.545008, 2.545008, 2.548953, 5.045297~\n$ aws_m24 2.580671, 2.580671, 2.584047, 5.081100~\n$ dist_road 498.75145, 102.22032, 1217.22446, 281.~\n$ log_dist_cfa 9.152945, 8.751860, 8.981854, 8.960697~\n$ log_dist_camp 10.838924, 8.793748, 10.366191, 9.0843~\n$ log_dist_road 6.212108, 4.627130, 7.104329, 5.640813~\n$ cause \"lightning\", \"lightning\", \"lightning\",~\n```\n\n\n:::\n:::\n\n\n\n### Purpose {-}\n\nThe primary goal is to predict the cause of the bushfire using the weather and distance from human activity variables provided. \n\n### Source {-}\n\nCollated data was part of Weihao Li's Honours thesis, which is not publicly available. The hotspots data was collected from @jaxa, climate data was taken from the Australian Bureau of Meteorology using the `bomrang` package [@R-bomrang], wind data from @wind and @windasos, vegetation data from @forest, distance from roads calculated using @OpenStreetMap, CFA stations from @cfa, and campsites from @recreation. The cause was predicted from training data provided by @fireorigin.\n\n### Pre-processing {-}\n\nThe 60 variables are too many to view with a tour, so it should be pre-processed using principal component analysis. The categorical variables of FOR_TYPE and FOR_CAT are removed. It would be possible to keep these if they are converted to dummy (binary variables).\n\n\n## Australian election data\n\n### Description {-}\n\nThis is data from a study on the relationship between voting patterns and socio-demographic characteristics of Australian electorates reported in @eechidna. These are the predictor variables upon which voting percentages are modelled. There are two years of data in `oz_election_2001` and `oz_election_2016`.\n\n### Variables {-}\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nload(\"data/oz_election_2001.rda\")\nload(\"data/oz_election_2016.rda\")\nglimpse(oz_election_2001)\n```\n:::\n\n\n\n### Purpose {-}\n\nThe tour is used to check for multicollinearity between predictors, that might adversely affect the linear model fit. \n\n### Source {-}\n\nThe data was compiled from Australian Electoral Commission (AEC) and the Australian\n38 Bureau of Statistics (ABS). Code to construct the data, and the original data are available at https://github.com/jforbes14/eechidna-paper. \n\n### Pre-processing {-}\n\nConsiderable pre-processing was done to produce these data sets. The original data was wrangled into tidy form, some variables were log transformed to reduce skewness, and a subset of variables was chosen. \n\n## Palmer penguins\n\n### Description {-}\n\nThis data measure four physical characteristics of three species of penguins. \n\n### Variables {-}\n\n|Name | Description |\n|:---------|:----------------|\n|`bl` | a number denoting bill length (millimeters) |\n`bd` | a number denoting bill depth (millimeters) |\n|`fl` | an integer denoting flipper length (millimeters) |\n|`bm` | an integer denoting body mass (grams) |\n|`species` | a factor denoting penguin species (Adélie, Chinstrap and Gentoo) |\n\n### Purpose {-}\n\nThe primary goal is to find a combination of the four variables where the three species are distinct. This is also a useful data set to illustrate cluster analysis.\n\n### Source {-}\n\nDetails of the penguins data can be found at [https://allisonhorst.github.io/palmerpenguins/](https://allisonhorst.github.io/palmerpenguins/), and @R-palmerpenguins is the package source.\n\n### Pre-processing {-}\n\nThe data is loaded from the `palmerpenguins` package. The four physical measurement variables and the species are selected, and the penguins with missing values are removed. Variables are standardised, and their names are shortened.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nlibrary(palmerpenguins)\npenguins <- penguins %>%\n na.omit() # 11 observations out of 344 removed\n# use only vars of interest, and standardise\n# them for easier interpretation\npenguins_sub <- penguins[,c(3:6, 1)] %>% \n mutate(across(where(is.numeric), ~ scale(.)[,1])) %>%\n rename(bl = bill_length_mm,\n bd = bill_depth_mm,\n fl = flipper_length_mm,\n bm = body_mass_g) %>%\n as.data.frame()\nsave(penguins_sub, file=\"data/penguins_sub.rda\")\n```\n:::\n\n\n\n\n## Program for International Student Assessment \n\n### Description {-}\n\nThe `pisa` data contains plausible scores for math, reading and science of Australian and Indonesian students from the 2018 testing cycle. The plausible scores are simulated from a model fitted to the original data, to preserve privacy of the students.\n\n### Variables {-}\n\n|Name | Description |\n|:---------|:----------------|\n|`CNT` | country, either AUS for Australia or IDN for Indonesia |\n|`PV1MATH`-`PV10MATH` | plausible scores for math |\n|`PV1READ`-`PV10READ` | plausible scores for reading |\n|`PV1SCIE`-`PV10SCIE` | plausible scores for science |\n\n### Purpose {-}\n\nPrimarily this data is useful as an example for dimension reduction.\n\n### Source {-}\n\nThe full data is available from https://www.oecd.org/pisa/. There are records of the student test scores, along with survey data from the students, their households and their schools.\n\n### Pre-processing {-}\n\nThe data was reduced to country and the plausible scores, and filtered to the two countries. It may be helpful to know that the SPSS format data was used, and was read into R using the `read_sav()` function in the `haven` package. \n\n## Sketches \n\n### Description {-}\n\nThis data is a subset of images from https://quickdraw.withgoogle.com. The subset was created using the quickdraw R package at https://huizezhang-sherry.github.io/quickdraw/. It has 6 different groups: banana, boomerang, cactus, crab, flip flops, kangaroo. Each image is 28x28 pixels. The `sketches_train` data would be used to train a classification model, and the unlabelled `sketches_test` can be used for prediction. \n\n### Variables {-}\n\n|Name | Description |\n|:---------|:----------------|\n|`V1-V784` | grey scale 0-255 |\n|`word` | what the person was asked to draw, NA in the test data |\n|`id` | unique id for each sketch |\n\n### Purpose {-}\n\nPrimarily this data is useful as an example for supervised classification, and also dimension reduction.\n\n### Source {-}\n\nThe full data is available from https://quickdraw.withgoogle.com. \n\n### Pre-processing {-}\n\nIt is typically useful to pre-process this data into principal components. This code can also be useful for plotting one of the sketches in a recognisable form:\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code code-fold=\"false\"}\nlibrary(mulgar)\nlibrary(ggplot2)\ndata(\"sketches_train\")\nset.seed(77)\nx <- sketches_train[sample(1:nrow(sketches_train), 1), ]\nxm <- data.frame(gry=t(as.matrix(x[,1:784])),\n x=rep(1:28, 28),\n y=rep(28:1, rep(28, 28)))\nggplot(xm, aes(x=x, y=y, fill=gry)) +\n geom_tile() +\n scale_fill_gradientn(colors = gray.colors(256, \n start = 0, \n end = 1, \n rev = TRUE )) +\n ggtitle(x$word) +\n theme_void() + \n theme(legend.position=\"none\")\n```\n\n::: {.cell-output-display}\n![One of the sketches in the subset of training data.](A2-data_files/figure-pdf/fig-sketches-1.pdf){#fig-sketches fig-align='center' fig-pos='H' width=20%}\n:::\n:::\n\n\n\n## `multicluster`\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(mulgar)\ndata(\"multicluster\")\n```\n:::\n\n\n\n### Description {-}\n\nThis data has 10 numeric variables, and a class variable labelling groups.\n\n### Variables {-}\n\n|Name | Description |\n|:---------|:----------------|\n|`group` | cluster label |\n|`x1-x10` | numeric variables |\n\n### Purpose {-}\n\nThe primary goal is to find the different clusters. \n\n### Source {-}\n\nThis data is originally from http://ifs.tuwien.ac.at/dm/download/multiChallenge-matrix.txt, and provided as a challenge for non-linear dimension reduction. It was used as an example in Lee, Laa, Cook (2023) https://doi.org/10.52933/jdssv.v2i3.\n\n## `clusters`, `clusters_nonlin`, `simple_clusters`\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(mulgar)\ndata(\"clusters\")\ndata(\"clusters_nonlin\")\ndata(\"simple_clusters\")\n```\n:::\n\n\n\n### Description {-}\n\nThis data has a various number of numeric variables, and a class variable labelling the clusters.\n\n### Variables {-}\n\n|Name | Description |\n|:---------|:----------------|\n|`x1-x5` | numeric variables |\n|`cl` | cluster label |\n\n### Purpose {-}\n\nThe primary goal is to find the different clusters. \n\n### Source {-}\n\nSimulated using the code in the `simulate.R` file of the `data-raw` directory of the `mulgar` package.\n\n## `plane`, `plane_nonlin`, `box`\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(mulgar)\ndata(\"plane\")\ndata(\"plane_nonlin\")\ndata(\"box\")\n```\n:::\n\n\n\n### Description {-}\n\nThis data has a varying number of numeric variables. \n\n### Variables {-}\n\n|Name | Description |\n|:---------|:----------------|\n|`x1-x5` | numeric variables |\n\n### Purpose {-}\n\nThe primary goal is to understand how many dimensions the data spreads out. \n\n### Source {-}\n\nSimulated using the code in the `simulate.R` file of the `data-raw` directory of the `mulgar` package.\n\n## `c1` - `c7`\n\n### Description {-}\n\nThis data has a varying number of numeric variables, and a variety of cluster shapes. \n\n### Variables {-}\n\n|Name | Description |\n|:---------|:----------------|\n|`x1`, `x2`, ... | numeric variables |\n\n### Purpose {-}\n\nThe primary goal is to detect the different clusters. \n\n### Source {-}\n\nThese are challenge data sets, so the code to simulate them is not made available.\n\n## Fashion MNIST\n\n### Description {-}\n\nThis data is a subset of images from https://github.com/zalandoresearch/fashion-mnist. Each image is 28x28 grayscale image. The training set has 60,000 images, and the test set has 10,000 images.\n\n### Variables {-}\n\n|Name | Description |\n|:---------|:----------------|\n|`x` | arrays of grey scale values 0-255 |\n|`y` | label 0-9, corresponding to T-shirt/top, Trouser, Pullover, Dress, Coat, Sandal, Shirt, Sneaker, Bag, Ankle boot |\n\n### Purpose {-}\n\nPrimarily this data is useful as an example for neural network modeling, following the tutorial at https://tensorflow.rstudio.com/tutorials/keras/classification.\n\n### Source {-}\n\nThe data is available from https://github.com/zalandoresearch/fashion-mnist. \n\n## `risk_MSA`\n\n### Description {-}\n\nThe data was collected in Australia in 2015 [@risk-survey] and includes six types of risks (recreational, health, career, financial, safety and social) with responses on a scale from 1 (never) to 5 (very often).\n\n### Variables {-}\n\n\n\n::: {.cell}\n::: {.cell-output .cell-output-stdout}\n\n```\nRows: 563\nColumns: 6\n$ Recreational 3, 1, 2, 1, 5, 5, 1, 5, 1, 3, 1, 5, 2, ~\n$ Health 2, 1, 2, 1, 4, 2, 1, 5, 1, 5, 1, 5, 2, ~\n$ Career 1, 1, 1, 1, 1, 5, 1, 5, 1, 2, 1, 5, 2, ~\n$ Financial 2, 1, 1, 1, 3, 3, 1, 5, 1, 1, 1, 5, 2, ~\n$ Safety 2, 1, 2, 1, 5, 2, 1, 5, 1, 4, 1, 5, 1, ~\n$ Social 4, 1, 3, 1, 5, 3, 1, 5, 1, 5, 1, 5, 1, ~\n```\n\n\n:::\n:::\n\n\n\n### Purpose {-}\n\nThis data is useful for the demonstration of clustering methods, it was also used in @msabook.\n\n### Source {-}\n\nThe data is available from https://homepage.boku.ac.at/leisch/MSA/. \n\n## Peripheral Blood Mononuclear Cells \n\n### Description {-}\n\nThe data was described in @chen2023, which is available through the R package `Seurat` (@seurat1, @seurat2, @seurat3, @seurat4). Here the data has been pre-processed following the tutorial at https://satijalab.org/seurat/articles/pbmc3k_tutorial.html, and the first 50 PCs are made available in the data file `pbmc_pca_50.rds` which is read into R using the `readRDS()` function.\n\n### Variables {-}\n\n|Name | Description |\n|:---------|:----------------|\n|`PC_1`-`PC_50` | Principal component scores |\n\n\n### Purpose {-}\n\nThe purpose is to understand the clustering of cell types, relative to clustering in the gene expression. Here, our purpose is to determine if the low-dimensional representation provided by NLDR is an accurate representation of the clustering, as understood from using the tour on the PCs. We ignore the cell type labels, and focus on the geometric shapes of clumps and clusters in the high dimensions.\n\n### Source {-}\n\nThe data can be downloaded and pre-processed following https://satijalab.org/seurat/articles/pbmc3k_tutorial.html. \n\n\n\n", + "supporting": [ + "A2-data_files/figure-pdf" + ], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": {}, + "engineDependencies": { + "knitr": [ + "{\"type\":\"list\",\"attributes\":{\"knit_meta_id\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"tbl-datalist-html\",\"tbl-datalist-html\",\"tbl-datalist-html\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"\"]}},\"value\":[{\"type\":\"list\",\"attributes\":{\"names\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"name\",\"options\",\"extra_lines\"]},\"class\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"latex_dependency\"]}},\"value\":[{\"type\":\"character\",\"attributes\":{},\"value\":[\"booktabs\"]},{\"type\":\"NULL\"},{\"type\":\"NULL\"}]},{\"type\":\"list\",\"attributes\":{\"names\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"name\",\"options\",\"extra_lines\"]},\"class\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"latex_dependency\"]}},\"value\":[{\"type\":\"character\",\"attributes\":{},\"value\":[\"caption\"]},{\"type\":\"NULL\"},{\"type\":\"NULL\"}]},{\"type\":\"list\",\"attributes\":{\"names\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"name\",\"options\",\"extra_lines\"]},\"class\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"latex_dependency\"]}},\"value\":[{\"type\":\"character\",\"attributes\":{},\"value\":[\"longtable\"]},{\"type\":\"NULL\"},{\"type\":\"NULL\"}]},{\"type\":\"list\",\"attributes\":{\"names\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"name\",\"options\",\"extra_lines\"]},\"class\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"latex_dependency\"]}},\"value\":[{\"type\":\"character\",\"attributes\":{},\"value\":[\"booktabs\"]},{\"type\":\"NULL\"},{\"type\":\"NULL\"}]},{\"type\":\"list\",\"attributes\":{\"names\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"name\",\"options\",\"extra_lines\"]},\"class\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"latex_dependency\"]}},\"value\":[{\"type\":\"character\",\"attributes\":{},\"value\":[\"longtable\"]},{\"type\":\"NULL\"},{\"type\":\"NULL\"}]},{\"type\":\"list\",\"attributes\":{\"names\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"name\",\"options\",\"extra_lines\"]},\"class\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"latex_dependency\"]}},\"value\":[{\"type\":\"character\",\"attributes\":{},\"value\":[\"array\"]},{\"type\":\"NULL\"},{\"type\":\"NULL\"}]},{\"type\":\"list\",\"attributes\":{\"names\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"name\",\"options\",\"extra_lines\"]},\"class\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"latex_dependency\"]}},\"value\":[{\"type\":\"character\",\"attributes\":{},\"value\":[\"multirow\"]},{\"type\":\"NULL\"},{\"type\":\"NULL\"}]},{\"type\":\"list\",\"attributes\":{\"names\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"name\",\"options\",\"extra_lines\"]},\"class\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"latex_dependency\"]}},\"value\":[{\"type\":\"character\",\"attributes\":{},\"value\":[\"wrapfig\"]},{\"type\":\"NULL\"},{\"type\":\"NULL\"}]},{\"type\":\"list\",\"attributes\":{\"names\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"name\",\"options\",\"extra_lines\"]},\"class\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"latex_dependency\"]}},\"value\":[{\"type\":\"character\",\"attributes\":{},\"value\":[\"float\"]},{\"type\":\"NULL\"},{\"type\":\"NULL\"}]},{\"type\":\"list\",\"attributes\":{\"names\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"name\",\"options\",\"extra_lines\"]},\"class\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"latex_dependency\"]}},\"value\":[{\"type\":\"character\",\"attributes\":{},\"value\":[\"colortbl\"]},{\"type\":\"NULL\"},{\"type\":\"NULL\"}]},{\"type\":\"list\",\"attributes\":{\"names\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"name\",\"options\",\"extra_lines\"]},\"class\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"latex_dependency\"]}},\"value\":[{\"type\":\"character\",\"attributes\":{},\"value\":[\"pdflscape\"]},{\"type\":\"NULL\"},{\"type\":\"NULL\"}]},{\"type\":\"list\",\"attributes\":{\"names\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"name\",\"options\",\"extra_lines\"]},\"class\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"latex_dependency\"]}},\"value\":[{\"type\":\"character\",\"attributes\":{},\"value\":[\"tabu\"]},{\"type\":\"NULL\"},{\"type\":\"NULL\"}]},{\"type\":\"list\",\"attributes\":{\"names\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"name\",\"options\",\"extra_lines\"]},\"class\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"latex_dependency\"]}},\"value\":[{\"type\":\"character\",\"attributes\":{},\"value\":[\"threeparttable\"]},{\"type\":\"NULL\"},{\"type\":\"NULL\"}]},{\"type\":\"list\",\"attributes\":{\"names\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"name\",\"options\",\"extra_lines\"]},\"class\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"latex_dependency\"]}},\"value\":[{\"type\":\"character\",\"attributes\":{},\"value\":[\"threeparttablex\"]},{\"type\":\"NULL\"},{\"type\":\"NULL\"}]},{\"type\":\"list\",\"attributes\":{\"names\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"name\",\"options\",\"extra_lines\"]},\"class\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"latex_dependency\"]}},\"value\":[{\"type\":\"character\",\"attributes\":{},\"value\":[\"ulem\"]},{\"type\":\"character\",\"attributes\":{},\"value\":[\"normalem\"]},{\"type\":\"NULL\"}]},{\"type\":\"list\",\"attributes\":{\"names\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"name\",\"options\",\"extra_lines\"]},\"class\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"latex_dependency\"]}},\"value\":[{\"type\":\"character\",\"attributes\":{},\"value\":[\"makecell\"]},{\"type\":\"NULL\"},{\"type\":\"NULL\"}]},{\"type\":\"list\",\"attributes\":{\"names\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"name\",\"options\",\"extra_lines\"]},\"class\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"latex_dependency\"]}},\"value\":[{\"type\":\"character\",\"attributes\":{},\"value\":[\"xcolor\"]},{\"type\":\"NULL\"},{\"type\":\"NULL\"}]}]}" + ] + }, + "preserve": null, + "postProcess": false + } +} \ No newline at end of file diff --git a/_freeze/A2-data/figure-html/fig-sketches-1.png b/_freeze/A2-data/figure-html/fig-sketches-1.png new file mode 100644 index 0000000..eb43d4b Binary files /dev/null and b/_freeze/A2-data/figure-html/fig-sketches-1.png differ diff --git a/_freeze/A2-data/figure-pdf/fig-sketches-1.pdf b/_freeze/A2-data/figure-pdf/fig-sketches-1.pdf new file mode 100644 index 0000000..af49301 Binary files /dev/null and b/_freeze/A2-data/figure-pdf/fig-sketches-1.pdf differ diff --git a/_freeze/A3-book-code-and-data/execute-results/html.json b/_freeze/A3-book-code-and-data/execute-results/html.json new file mode 100644 index 0000000..019a8e3 --- /dev/null +++ b/_freeze/A3-book-code-and-data/execute-results/html.json @@ -0,0 +1,17 @@ +{ + "hash": "b53cb903334cdb5f58791f333e4ab11a", + "result": { + "engine": "knitr", + "markdown": "# Links to Book Code and Additional Data {#book-data}\n\nThe code and data and an RStudio project can be downloaded with \n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nusethis::use_zip(url=\"https://dicook.github.io/mulgar_book/code_and_data.zip\")\n```\n:::\n\n\nAlternatively, individual files can be downloaded from the links below.\n\n## Additional data\n\n@tbl-datalinks-html lists additional data available on the book web site at https://dicook.github.io/mulgar_book/data. \n\n\n::: {.cell}\n\n:::\n\n\n::: {.content-visible when-format=\"html\"}\n\n\n::: {#tbl-datalinks-html .cell tbl-cap='Links to other data sets used on the book.'}\n::: {.cell-output-display}\n\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n \n \n
DescriptionLink
Saved 2D tour path for the aflw data<a href='https://dicook.github.io/mulgar_book/data/aflw_pct.rda'> aflw_pct.rda </a>
Saved clusters of the penguins data from detourr<a href='https://dicook.github.io/mulgar_book/data/detourr_penguins.csv'> detourr_penguins.csv </a>
Saved clusters of the fake trees data from detourr<a href='https://dicook.github.io/mulgar_book/data/fake_trees_sb.csv'> fake_trees_sb.csv </a>
Tidied penguins data<a href='https://dicook.github.io/mulgar_book/data/penguins_sub.rda'> penguins_sub.rda </a>
Saved 2D tour path for penguins data<a href='https://dicook.github.io/mulgar_book/data/penguins_tour_path.rda'> penguins_tour_path.rda </a>
risk survey<a href='https://dicook.github.io/mulgar_book/data/risk_MSA.rds'> risk_MSA.rds </a>
penguins NN model<a href='https://dicook.github.io/mulgar_book/data/penguins_cnn> penguins_cnn </a>
fashion MNST NN model<a href='https://dicook.github.io/mulgar_book/data/fashion_cnn> fashion_cnn </a>
penguins SHAP values<a href='https://dicook.github.io/mulgar_book/data/p_exp_sv.rda> p_exp_sv.rda </a>
\n
\n```\n\n:::\n:::\n\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\n::: {#tbl-datalinks-pdf .cell tbl-cap='Links to other data sets used on the book.'}\n::: {.cell-output-display}\n\\begin{table}\n\\centering\n\\begin{tabular}{>{\\raggedright\\arraybackslash}p{4cm}>{\\raggedright\\arraybackslash}p{5cm}>{\\raggedright\\arraybackslash}p{2cm}}\n\\toprule\nFilename & Description & Chapter\\\\\n\\midrule\n\\ttfamily{aflw\\_pct.rda} & Saved 2D tour path for the aflw data & 4\\\\\n\\ttfamily{detourr\\_penguins.csv} & Saved clusters of the penguins data from detourr & 7\\\\\n\\ttfamily{fake\\_trees\\_sb.csv} & Saved clusters of the fake trees data from detourr & 7\\\\\n\\ttfamily{penguins\\_sub.rda} & Tidied penguins data & 5, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17\\\\\n\\ttfamily{penguins\\_tour\\_path.rda} & Saved 2D tour path for penguins data & 1, 8, 10, 13, 14, 15\\\\\n\\addlinespace\n\\ttfamily{risk\\_MSA.rds} & risk survey & 12\\\\\n\\ttfamily{penguins\\_cnn} & penguins NN model & 17\\\\\n\\ttfamily{fashion\\_cnn} & fashion MNST NN model & 17\\\\\n\\ttfamily{p\\_exp\\_sv.rda} & penguins SHAP values & 17\\\\\n\\bottomrule\n\\end{tabular}\n\\end{table}\n\n\n:::\n:::\n\n:::\n\n## Code files\n\n@tbl-codelinks-html lists additional data available on the book web site at https://dicook.github.io/mulgar_book/code. \n\n\n::: {.cell}\n\n:::\n\n\n::: {.content-visible when-format=\"html\"}\n\n\n::: {#tbl-codelinks-html .cell tbl-cap='Links to code used on the book.'}\n::: {.cell-output-display}\n\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n\n \n\n\n \n\n\n \n\n\n \n\n\n \n\n\n \n\n\n \n\n\n \n\n\n \n\n\n \n\n\n \n\n\n \n\n\n \n\n\n \n\n\n \n\n\n \n\n\n \n\n\n \n\n\n \n \n \n
ChapterLinkFilename
1 Picturing high dimensions 1-intro.R 1-intro.R
2 Notation conventions and R objects 2-notation.R 2-notation.R
3 Overview of dimension reduction 3-overview-dimred.R 3-overview-dimred.R
4 Principal component analysis 4-pca.R 4-pca.R
5 Non-linear dimension reduction 5-nldr.R 5-nldr.R
6 Overview of clustering 6-overview-clust.R 6-overview-clust.R
7 Spin-and-brush approach 7-spin-and-brush.R 7-spin-and-brush.R
8 Hierarchical clustering 8-hierarchical.R 8-hierarchical.R
9 k-means clustering 9-kmeans.R 9-kmeans.R
10 Model-based clustering 10-model-based.R 10-model-based.R
11 Self-organizing maps 11-som.R 11-som.R
12 Summarising and comparing clustering results 12-summary-clust.R 12-summary-clust.R
13 Overview of supervised classification 13-intro-class.R 13-intro-class.R
14 Linear discriminant analysis 14-lda.R 14-lda.R
15 Trees and forests 15-forests.R 15-forests.R
16 Support vector machines 16-svm.R 16-svm.R
17 Neural networks and deep learning 17-nn.R 17-nn.R
18 Exploring misclassifications 18-summary-class.R 18-summary-class.R
A Toolbox A-toolbox.R NA
\n
\n```\n\n:::\n:::\n\n\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\n::: {#tbl-codelinks-pdf .cell tbl-cap='Links to code used on the book.'}\n::: {.cell-output-display}\n\\begin{table}\n\\centering\n\\begin{tabular}{>{\\raggedright\\arraybackslash}p{4cm}>{\\raggedright\\arraybackslash}p{5cm}}\n\\toprule\nFilename & Chapter\\\\\n\\midrule\n\\ttfamily{1-intro.R} & 1 Picturing high dimensions\\\\\n\\ttfamily{2-notation.R} & 2 Notation conventions and R objects\\\\\n\\ttfamily{3-overview-dimred.R} & 3 Overview of dimension reduction\\\\\n\\ttfamily{4-pca.R} & 4 Principal component analysis\\\\\n\\ttfamily{5-nldr.R} & 5 Non-linear dimension reduction\\\\\n\\addlinespace\n\\ttfamily{6-overview-clust.R} & 6 Overview of clustering\\\\\n\\ttfamily{7-spin-and-brush.R} & 7 Spin-and-brush approach\\\\\n\\ttfamily{8-hierarchical.R} & 8 Hierarchical clustering\\\\\n\\ttfamily{9-kmeans.R} & 9 k-means clustering\\\\\n\\ttfamily{10-model-based.R} & 10 Model-based clustering\\\\\n\\addlinespace\n\\ttfamily{11-som.R} & 11 Self-organizing maps\\\\\n\\ttfamily{12-summary-clust.R} & 12 Summarising and comparing clustering results\\\\\n\\ttfamily{13-intro-class.R} & 13 Overview of supervised classification\\\\\n\\ttfamily{14-lda.R} & 14 Linear discriminant analysis\\\\\n\\ttfamily{15-forests.R} & 15 Trees and forests\\\\\n\\addlinespace\n\\ttfamily{16-svm.R} & 16 Support vector machines\\\\\n\\ttfamily{17-nn.R} & 17 Neural networks and deep learning\\\\\n\\ttfamily{18-summary-class.R} & 18 Exploring misclassifications\\\\\n\\ttfamily{NA} & A Toolbox\\\\\n\\bottomrule\n\\end{tabular}\n\\end{table}\n\n\n:::\n:::\n\n:::\n", + "supporting": [ + "A3-book-code-and-data_files" + ], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": {}, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/A3-book-code-and-data/execute-results/tex.json b/_freeze/A3-book-code-and-data/execute-results/tex.json new file mode 100644 index 0000000..2039fb2 --- /dev/null +++ b/_freeze/A3-book-code-and-data/execute-results/tex.json @@ -0,0 +1,19 @@ +{ + "hash": "6538783cc9c1259a3d0b6430b91fd304", + "result": { + "engine": "knitr", + "markdown": "# Links to Book Code and Additional Data {#book-data}\n\nThe code and data and an RStudio project can be downloaded with \n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nusethis::use_zip(\n url=\"https://dicook.github.io/mulgar_book/code_and_data.zip\")\n```\n:::\n\n\n\nAlternatively, individual files can be downloaded from the links below.\n\n## Additional data\n\n@tbl-datalinks-pdf lists additional data available on the book web site at https://github.com/dicook/mulgar_book. \n\n\n\n::: {.cell}\n\n:::\n\n\n\n::: {.content-visible when-format=\"html\"}\n\n\n\n::: {#tbl-datalinks-html .cell tbl-cap='Links to other data sets used in the book.'}\n::: {.cell-output-display}\n\\begin{longtable}{ll}\n\\toprule\nDescription & Link \\\\ \n\\midrule\nSaved 2D tour path for the aflw data & aflw\\_pct.rda \\\\ \nSaved clusters of the penguins data from detourr & detourr\\_penguins.csv \\\\ \nSaved clusters of the fake trees data from detourr & fake\\_trees\\_sb.csv \\\\ \nTidied penguins data & penguins\\_sub.rda \\\\ \nSaved 2D tour path for penguins data & penguins\\_tour\\_path.rda \\\\ \nrisk survey & risk\\_MSA.rds \\\\ \npenguins NN model & fashion\\_cnn \\\\ \npenguins SHAP values & {\\raggedright\\arraybackslash}p{4cm}>{\\raggedright\\arraybackslash}p{5cm}>{\\raggedright\\arraybackslash}p{2cm}}\n\\toprule\nFilename & Description & Chapter\\\\\n\\midrule\n\\ttfamily{aflw\\_pct.rda} & Saved 2D tour path for the aflw data & 4\\\\\n\\ttfamily{detourr\\_penguins.csv} & Saved clusters of the penguins data from detourr & 7\\\\\n\\ttfamily{fake\\_trees\\_sb.csv} & Saved clusters of the fake trees data from detourr & 7\\\\\n\\ttfamily{penguins\\_sub.rda} & Tidied penguins data & 5, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17\\\\\n\\ttfamily{penguins\\_tour\\_path.rda} & Saved 2D tour path for penguins data & 1, 8, 10, 13, 14, 15\\\\\n\\addlinespace\n\\ttfamily{risk\\_MSA.rds} & risk survey & 12\\\\\n\\ttfamily{penguins\\_cnn} & penguins NN model & 17\\\\\n\\ttfamily{fashion\\_cnn} & fashion MNST NN model & 17\\\\\n\\ttfamily{p\\_exp\\_sv.rda} & penguins SHAP values & 17\\\\\n\\bottomrule\n\\end{tabular}\n\\end{table}\n\n\n:::\n:::\n\n\n:::\n\n## Code files\n\n@tbl-codelinks-pdf lists additional data available on the book web site at https://dicook.github.io/mulgar_book/code. \n\n\n\n::: {.cell}\n\n:::\n\n\n\n::: {.content-visible when-format=\"html\"}\n\n\n\n::: {#tbl-codelinks-html .cell tbl-cap='Links to code used on the book.'}\n::: {.cell-output-display}\n\\begin{longtable}{lll}\n\\toprule\nChapter & Link & Filename \\\\ \n\\midrule\n1 Picturing high dimensions & 1-intro.R & 1-intro.R \\\\ \n2 Notation conventions and R objects & 2-notation.R & 2-notation.R \\\\ \n3 Overview of dimension reduction & 3-overview-dimred.R & 3-overview-dimred.R \\\\ \n4 Principal component analysis & 4-pca.R & 4-pca.R \\\\ \n5 Non-linear dimension reduction & 5-nldr.R & 5-nldr.R \\\\ \n6 Overview of clustering & 6-overview-clust.R & 6-overview-clust.R \\\\ \n7 Spin-and-brush approach & 7-spin-and-brush.R & 7-spin-and-brush.R \\\\ \n8 Hierarchical clustering & 8-hierarchical.R & 8-hierarchical.R \\\\ \n9 k-means clustering & 9-kmeans.R & 9-kmeans.R \\\\ \n10 Model-based clustering & 10-model-based.R & 10-model-based.R \\\\ \n11 Self-organizing maps & 11-som.R & 11-som.R \\\\ \n12 Summarising and comparing clustering results & 12-summary-clust.R & 12-summary-clust.R \\\\ \n13 Overview of supervised classification & 13-intro-class.R & 13-intro-class.R \\\\ \n14 Linear discriminant analysis & 14-lda.R & 14-lda.R \\\\ \n15 Trees and forests & 15-forests.R & 15-forests.R \\\\ \n16 Support vector machines & 16-svm.R & 16-svm.R \\\\ \n17 Neural networks and deep learning & 17-nn.R & 17-nn.R \\\\ \n18 Exploring misclassifications & 18-summary-class.R & 18-summary-class.R \\\\ \nA Toolbox & A-toolbox.R & A-toolbox.R \\\\ \n\\bottomrule\n\\end{longtable}\n\n:::\n:::\n\n\n\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\n\n::: {#tbl-codelinks-pdf .cell tbl-cap='Links to code used on the book.'}\n::: {.cell-output-display}\n\\begin{table}\n\\centering\n\\begin{tabular}{>{\\raggedright\\arraybackslash}p{4cm}>{\\raggedright\\arraybackslash}p{5cm}}\n\\toprule\nFilename & Chapter\\\\\n\\midrule\n\\ttfamily{1-intro.R} & 1 Picturing high dimensions\\\\\n\\ttfamily{2-notation.R} & 2 Notation conventions and R objects\\\\\n\\ttfamily{3-overview-dimred.R} & 3 Overview of dimension reduction\\\\\n\\ttfamily{4-pca.R} & 4 Principal component analysis\\\\\n\\ttfamily{5-nldr.R} & 5 Non-linear dimension reduction\\\\\n\\addlinespace\n\\ttfamily{6-overview-clust.R} & 6 Overview of clustering\\\\\n\\ttfamily{7-spin-and-brush.R} & 7 Spin-and-brush approach\\\\\n\\ttfamily{8-hierarchical.R} & 8 Hierarchical clustering\\\\\n\\ttfamily{9-kmeans.R} & 9 k-means clustering\\\\\n\\ttfamily{10-model-based.R} & 10 Model-based clustering\\\\\n\\addlinespace\n\\ttfamily{11-som.R} & 11 Self-organizing maps\\\\\n\\ttfamily{12-summary-clust.R} & 12 Summarising and comparing clustering results\\\\\n\\ttfamily{13-intro-class.R} & 13 Overview of supervised classification\\\\\n\\ttfamily{14-lda.R} & 14 Linear discriminant analysis\\\\\n\\ttfamily{15-forests.R} & 15 Trees and forests\\\\\n\\addlinespace\n\\ttfamily{16-svm.R} & 16 Support vector machines\\\\\n\\ttfamily{17-nn.R} & 17 Neural networks and deep learning\\\\\n\\ttfamily{18-summary-class.R} & 18 Exploring misclassifications\\\\\n\\ttfamily{A-toolbox.R} & A Toolbox\\\\\n\\bottomrule\n\\end{tabular}\n\\end{table}\n\n\n:::\n:::\n\n\n:::\n", + "supporting": [], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": {}, + "engineDependencies": { + "knitr": [ + "{\"type\":\"list\",\"attributes\":{\"knit_meta_id\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"tbl-datalinks-html\",\"tbl-datalinks-html\",\"tbl-datalinks-html\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"tbl-codelinks-html\",\"tbl-codelinks-html\",\"tbl-codelinks-html\"]}},\"value\":[{\"type\":\"list\",\"attributes\":{\"names\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"name\",\"options\",\"extra_lines\"]},\"class\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"latex_dependency\"]}},\"value\":[{\"type\":\"character\",\"attributes\":{},\"value\":[\"booktabs\"]},{\"type\":\"NULL\"},{\"type\":\"NULL\"}]},{\"type\":\"list\",\"attributes\":{\"names\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"name\",\"options\",\"extra_lines\"]},\"class\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"latex_dependency\"]}},\"value\":[{\"type\":\"character\",\"attributes\":{},\"value\":[\"caption\"]},{\"type\":\"NULL\"},{\"type\":\"NULL\"}]},{\"type\":\"list\",\"attributes\":{\"names\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"name\",\"options\",\"extra_lines\"]},\"class\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"latex_dependency\"]}},\"value\":[{\"type\":\"character\",\"attributes\":{},\"value\":[\"longtable\"]},{\"type\":\"NULL\"},{\"type\":\"NULL\"}]},{\"type\":\"list\",\"attributes\":{\"names\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"name\",\"options\",\"extra_lines\"]},\"class\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"latex_dependency\"]}},\"value\":[{\"type\":\"character\",\"attributes\":{},\"value\":[\"booktabs\"]},{\"type\":\"NULL\"},{\"type\":\"NULL\"}]},{\"type\":\"list\",\"attributes\":{\"names\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"name\",\"options\",\"extra_lines\"]},\"class\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"latex_dependency\"]}},\"value\":[{\"type\":\"character\",\"attributes\":{},\"value\":[\"longtable\"]},{\"type\":\"NULL\"},{\"type\":\"NULL\"}]},{\"type\":\"list\",\"attributes\":{\"names\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"name\",\"options\",\"extra_lines\"]},\"class\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"latex_dependency\"]}},\"value\":[{\"type\":\"character\",\"attributes\":{},\"value\":[\"array\"]},{\"type\":\"NULL\"},{\"type\":\"NULL\"}]},{\"type\":\"list\",\"attributes\":{\"names\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"name\",\"options\",\"extra_lines\"]},\"class\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"latex_dependency\"]}},\"value\":[{\"type\":\"character\",\"attributes\":{},\"value\":[\"multirow\"]},{\"type\":\"NULL\"},{\"type\":\"NULL\"}]},{\"type\":\"list\",\"attributes\":{\"names\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"name\",\"options\",\"extra_lines\"]},\"class\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"latex_dependency\"]}},\"value\":[{\"type\":\"character\",\"attributes\":{},\"value\":[\"wrapfig\"]},{\"type\":\"NULL\"},{\"type\":\"NULL\"}]},{\"type\":\"list\",\"attributes\":{\"names\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"name\",\"options\",\"extra_lines\"]},\"class\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"latex_dependency\"]}},\"value\":[{\"type\":\"character\",\"attributes\":{},\"value\":[\"float\"]},{\"type\":\"NULL\"},{\"type\":\"NULL\"}]},{\"type\":\"list\",\"attributes\":{\"names\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"name\",\"options\",\"extra_lines\"]},\"class\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"latex_dependency\"]}},\"value\":[{\"type\":\"character\",\"attributes\":{},\"value\":[\"colortbl\"]},{\"type\":\"NULL\"},{\"type\":\"NULL\"}]},{\"type\":\"list\",\"attributes\":{\"names\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"name\",\"options\",\"extra_lines\"]},\"class\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"latex_dependency\"]}},\"value\":[{\"type\":\"character\",\"attributes\":{},\"value\":[\"pdflscape\"]},{\"type\":\"NULL\"},{\"type\":\"NULL\"}]},{\"type\":\"list\",\"attributes\":{\"names\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"name\",\"options\",\"extra_lines\"]},\"class\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"latex_dependency\"]}},\"value\":[{\"type\":\"character\",\"attributes\":{},\"value\":[\"tabu\"]},{\"type\":\"NULL\"},{\"type\":\"NULL\"}]},{\"type\":\"list\",\"attributes\":{\"names\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"name\",\"options\",\"extra_lines\"]},\"class\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"latex_dependency\"]}},\"value\":[{\"type\":\"character\",\"attributes\":{},\"value\":[\"threeparttable\"]},{\"type\":\"NULL\"},{\"type\":\"NULL\"}]},{\"type\":\"list\",\"attributes\":{\"names\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"name\",\"options\",\"extra_lines\"]},\"class\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"latex_dependency\"]}},\"value\":[{\"type\":\"character\",\"attributes\":{},\"value\":[\"threeparttablex\"]},{\"type\":\"NULL\"},{\"type\":\"NULL\"}]},{\"type\":\"list\",\"attributes\":{\"names\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"name\",\"options\",\"extra_lines\"]},\"class\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"latex_dependency\"]}},\"value\":[{\"type\":\"character\",\"attributes\":{},\"value\":[\"ulem\"]},{\"type\":\"character\",\"attributes\":{},\"value\":[\"normalem\"]},{\"type\":\"NULL\"}]},{\"type\":\"list\",\"attributes\":{\"names\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"name\",\"options\",\"extra_lines\"]},\"class\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"latex_dependency\"]}},\"value\":[{\"type\":\"character\",\"attributes\":{},\"value\":[\"makecell\"]},{\"type\":\"NULL\"},{\"type\":\"NULL\"}]},{\"type\":\"list\",\"attributes\":{\"names\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"name\",\"options\",\"extra_lines\"]},\"class\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"latex_dependency\"]}},\"value\":[{\"type\":\"character\",\"attributes\":{},\"value\":[\"xcolor\"]},{\"type\":\"NULL\"},{\"type\":\"NULL\"}]},{\"type\":\"list\",\"attributes\":{\"names\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"name\",\"options\",\"extra_lines\"]},\"class\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"latex_dependency\"]}},\"value\":[{\"type\":\"character\",\"attributes\":{},\"value\":[\"booktabs\"]},{\"type\":\"NULL\"},{\"type\":\"NULL\"}]},{\"type\":\"list\",\"attributes\":{\"names\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"name\",\"options\",\"extra_lines\"]},\"class\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"latex_dependency\"]}},\"value\":[{\"type\":\"character\",\"attributes\":{},\"value\":[\"caption\"]},{\"type\":\"NULL\"},{\"type\":\"NULL\"}]},{\"type\":\"list\",\"attributes\":{\"names\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"name\",\"options\",\"extra_lines\"]},\"class\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"latex_dependency\"]}},\"value\":[{\"type\":\"character\",\"attributes\":{},\"value\":[\"longtable\"]},{\"type\":\"NULL\"},{\"type\":\"NULL\"}]}]}" + ] + }, + "preserve": null, + "postProcess": false + } +} \ No newline at end of file diff --git a/_freeze/A3-book_code_and_data/execute-results/tex.json b/_freeze/A3-book_code_and_data/execute-results/tex.json new file mode 100644 index 0000000..05bb563 --- /dev/null +++ b/_freeze/A3-book_code_and_data/execute-results/tex.json @@ -0,0 +1,19 @@ +{ + "hash": "94365c68d90a16e30aefb8953315aa40", + "result": { + "engine": "knitr", + "markdown": "# Links to Book Code and Additional Data {#book-data}\n\nThe code and data and an RStudio project can be downloaded with \n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nusethis::use_zip(url=\"https://dicook.github.io/mulgar_book/code_and_data.zip\")\n```\n:::\n\n\n\nAlternatively, individual files can be downloaded from the links below.\n\n## Additional data \n\n\n\n::: {.cell}\n::: {.cell-output-display}\n\\begin{longtable}{lll}\n\\toprule\nChapter & Name & Link \\\\ \n\\midrule\n4 & aflw tour path & aflw_pct.rda \\\\ \n7 & detourr saved clusters & detourr_penguins.csv \\\\ \n7 & detourr saved clusters & fake_trees_sb.csv \\\\ \n1 & 2001 election & oz_election_2001.rda \\\\ \n1 & 2016 election & oz_election_2016.rda \\\\ \n5, 7, 8, 9, 10, 11, 12, 13, 14, 15 & Penguins re-organsed & penguins_sub.rda \\\\ \n8, 10, 13, 14 & 2D tour path for penguins & penguins_tour_path.rda \\\\ \n12 & risk survey & risk_MSA.rds \\\\ \n\\bottomrule\n\\end{longtable}\n\n:::\n:::\n\n\n\n## Code files\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n\\begin{longtable}{ll}\n\\toprule\nChapter & Link \\\\ \n\\midrule\n1 Picturing high dimensions & 1-intro.R \\\\ \n2 Notation conventions and R objects & 2-notation.R \\\\ \n3 Overview of dimension reduction & 3-overview-dimred.R \\\\ \n4 Principal component analysis & 4-pca.R \\\\ \n5 Non-linear dimension reduction & 5-nldr.R \\\\ \n6 Overview of clustering & 3-overview-clust.R \\\\ \n7 Spin-and-brush approach & 7-spin-and-brush.R \\\\ \n8 Hierarchical clustering & 8-hierarchical.R \\\\ \n9 k-means clustering & 9-kmeans.R \\\\ \n10 Model-based clustering & 10-model-based.R \\\\ \n11 Self-organizing maps & 11-som.R \\\\ \n12 Summarising and comparing clustering results & 12-summary-clust.R \\\\ \n13 Overview of supervised classification & 13-intro-class.R \\\\ \n14 Linear discriminant analysis & 14-lda.R \\\\ \n15 Trees and forests & 15-forests.R \\\\ \n16 Support vector machines & 16-svm.R \\\\ \n17 Neural networks and deep learning & 17-nn.R \\\\ \n18 Exploring misclassifications & 18-summary-class.R \\\\ \n19 Multivariate and multiple time series & 19-mv-time-series.R \\\\ \n\\bottomrule\n\\end{longtable}\n\n:::\n:::\n", + "supporting": [], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": {}, + "engineDependencies": { + "knitr": [ + "{\"type\":\"list\",\"attributes\":{\"knit_meta_id\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"unnamed-chunk-2\",\"unnamed-chunk-2\",\"unnamed-chunk-2\",\"unnamed-chunk-3\",\"unnamed-chunk-3\",\"unnamed-chunk-3\"]}},\"value\":[{\"type\":\"list\",\"attributes\":{\"names\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"name\",\"options\",\"extra_lines\"]},\"class\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"latex_dependency\"]}},\"value\":[{\"type\":\"character\",\"attributes\":{},\"value\":[\"booktabs\"]},{\"type\":\"NULL\"},{\"type\":\"NULL\"}]},{\"type\":\"list\",\"attributes\":{\"names\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"name\",\"options\",\"extra_lines\"]},\"class\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"latex_dependency\"]}},\"value\":[{\"type\":\"character\",\"attributes\":{},\"value\":[\"caption\"]},{\"type\":\"NULL\"},{\"type\":\"NULL\"}]},{\"type\":\"list\",\"attributes\":{\"names\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"name\",\"options\",\"extra_lines\"]},\"class\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"latex_dependency\"]}},\"value\":[{\"type\":\"character\",\"attributes\":{},\"value\":[\"longtable\"]},{\"type\":\"NULL\"},{\"type\":\"NULL\"}]},{\"type\":\"list\",\"attributes\":{\"names\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"name\",\"options\",\"extra_lines\"]},\"class\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"latex_dependency\"]}},\"value\":[{\"type\":\"character\",\"attributes\":{},\"value\":[\"booktabs\"]},{\"type\":\"NULL\"},{\"type\":\"NULL\"}]},{\"type\":\"list\",\"attributes\":{\"names\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"name\",\"options\",\"extra_lines\"]},\"class\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"latex_dependency\"]}},\"value\":[{\"type\":\"character\",\"attributes\":{},\"value\":[\"caption\"]},{\"type\":\"NULL\"},{\"type\":\"NULL\"}]},{\"type\":\"list\",\"attributes\":{\"names\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"name\",\"options\",\"extra_lines\"]},\"class\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"latex_dependency\"]}},\"value\":[{\"type\":\"character\",\"attributes\":{},\"value\":[\"longtable\"]},{\"type\":\"NULL\"},{\"type\":\"NULL\"}]}]}" + ] + }, + "preserve": null, + "postProcess": false + } +} \ No newline at end of file diff --git a/_freeze/A4-glossary/execute-results/html.json b/_freeze/A4-glossary/execute-results/html.json new file mode 100644 index 0000000..eb880d0 --- /dev/null +++ b/_freeze/A4-glossary/execute-results/html.json @@ -0,0 +1,15 @@ +{ + "hash": "d2b9b3163bbeb342cf1a5cafbdd0b17a", + "result": { + "engine": "knitr", + "markdown": "# Glossary {#glossary}\n\n\n\n::: {.cell}\n\n:::\n\n\n::: {.content-visible when-format=\"html\"}\n\n\n::: {#tbl-glossary-html .cell tbl-cap='List of common terms used in the book and their synonyms used elsewhere.'}\n::: {.cell-output-display}\n\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n\n \n\n\n \n\n\n \n\n\n \n\n\n \n\n\n \n\n\n \n\n\n \n\n\n \n\n\n \n \n \n
TermSynonymsDescription
variablefeature, attributea characteristic, number or quantity that can be measured
observationscases, items, experimental units, observational units, records, statistical units, instances, examplesindividuals on which the observations are made
data set-collection of observations made on one or more variables
responsetargetvariable that one wishes to predict
predictorindependent variable, featurevariables used to produce a mode to predict the response
similaritycorrelationa measure ranging between 0 and 1, with 1 indicating that the cases are closer
dissimilaritydistancea measure where a smaller number means the cases are closer
PCA-principal component analysis
LDA-linear discriminant analysis
SOM-self-organising map
\n
\n```\n\n:::\n:::\n\n\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\n\n::: {#tbl-glossary-pdf .cell tbl-cap='List of common terms used in the book and their synonyms used elsewhere.'}\n::: {.cell-output-display}\n\\begin{table}\n\\centering\n\\begin{tabular}{>{\\raggedright\\arraybackslash}p{2cm}>{\\raggedright\\arraybackslash}p{5cm}>{\\raggedright\\arraybackslash}p{cm}}\n\\toprule\nTerm & Synonyms & Description\\\\\n\\midrule\nvariable & feature, attribute & a characteristic, number or quantity that can be measured\\\\\nobservations & cases, items, experimental units, observational units, records, statistical units, instances, examples & individuals on which the observations are made\\\\\ndata set & - & collection of observations made on one or more variables\\\\\nresponse & target & variable that one wishes to predict\\\\\npredictor & independent variable, feature & variables used to produce a mode to predict the response\\\\\n\\addlinespace\nsimilarity & correlation & a measure ranging between 0 and 1, with 1 indicating that the cases are closer\\\\\ndissimilarity & distance & a measure where a smaller number means the cases are closer\\\\\nPCA & - & principal component analysis\\\\\nLDA & - & linear discriminant analysis\\\\\nSOM & - & self-organising map\\\\\n\\bottomrule\n\\end{tabular}\n\\end{table}\n\n\n:::\n:::\n\n\n:::", + "supporting": [], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": {}, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/A4-glossary/execute-results/tex.json b/_freeze/A4-glossary/execute-results/tex.json new file mode 100644 index 0000000..ace1fbc --- /dev/null +++ b/_freeze/A4-glossary/execute-results/tex.json @@ -0,0 +1,19 @@ +{ + "hash": "227ea83acf1988b0ec43f004f018fedb", + "result": { + "engine": "knitr", + "markdown": "# Glossary {#glossary}\n\n \n\n\n\n::: {.cell}\n\n:::\n\n\n\n::: {.content-visible when-format=\"html\"}\n\n\n\n::: {#tbl-glossary-html .cell tbl-cap='List of common terms used in the book and their synonyms used elsewhere.'}\n::: {.cell-output-display}\n\\begin{longtable}{lll}\n\\toprule\nTerm & Synonyms & Description \\\\ \n\\midrule\nvariable & feature, attribute & a characteristic, number or quantity that can be measured \\\\ \nobservations & cases, items, experimental units, observational units, records, statistical units, instances, examples & individuals on which the observations are made \\\\ \ndata set & data, file & collection of observations made on one or more variables \\\\ \nresponse & target & variable that one wishes to predict \\\\ \npredictor & independent variable, feature & variables used to produce a mode to predict the response \\\\ \nsimilarity & correlation & a measure ranging between 0 and 1, with 1 indicating that the cases are closer \\\\ \ndissimilarity & distance & a measure where a smaller number means the cases are closer \\\\ \nprincipal component analysis (PCA) & empirical orthogonal functions, eigenvalue decomposition & summarise a high-dimensional variance-covariance using an orthonormal matrix and set of variances. Related methods include factor analysis, multidimensional scaling. \\\\ \nlinear discriminant analysis (LDA) & Fisher's linear discriminant & reduce the dimension to the space where the classes are most separated relative to the class means and pooled variance-covariance. \\\\ \nself-organising map (SOM) & Kohonen map & use a grid-constrained set of means to cluster high-dimensional data, and also provide a 2D view of the clusters \\\\ \n\\bottomrule\n\\end{longtable}\n\n:::\n:::\n\n\n\n:::\n\n::: {.content-visible when-format=\"pdf\"}\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n\\begin{table}\n\\centering\n\\begin{tabular}{>{\\raggedright\\arraybackslash}p{2cm}>{\\raggedright\\arraybackslash}p{4cm}>{\\raggedright\\arraybackslash}p{6cm}}\n\\toprule\nTerm & Synonyms & Description\\\\\n\\midrule\nvariable & feature, attribute & a characteristic, number or quantity that can be measured\\\\\n\\midrule\nobservations & cases, items, experimental units, observational units, records, statistical units, instances, examples & individuals on which the observations are made\\\\\n\\midrule\ndata set & data, file & collection of observations made on one or more variables\\\\\n\\midrule\nresponse & target & variable that one wishes to predict\\\\\n\\midrule\npredictor & independent variable, feature & variables used to produce a mode to predict the response\\\\\n\\midrule\n\\addlinespace\nsimilarity & correlation & a measure ranging between 0 and 1, with 1 indicating that the cases are closer\\\\\n\\midrule\ndissimilarity & distance & a measure where a smaller number means the cases are closer\\\\\n\\midrule\nprincipal component analysis (PCA) & empirical orthogonal functions, eigenvalue decomposition & summarise a high-dimensional variance-covariance using an orthonormal matrix and set of variances. Related methods include factor analysis, multidimensional scaling.\\\\\n\\midrule\nlinear discriminant analysis (LDA) & Fisher's linear discriminant & reduce the dimension to the space where the classes are most separated relative to the class means and pooled variance-covariance.\\\\\n\\midrule\nself-organising map (SOM) & Kohonen map & use a grid-constrained set of means to cluster high-dimensional data, and also provide a 2D view of the clusters\\\\\n\\bottomrule\n\\end{tabular}\n\\end{table}\n\n\n:::\n:::\n\n\n\n:::", + "supporting": [], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": {}, + "engineDependencies": { + "knitr": [ + "{\"type\":\"list\",\"attributes\":{\"knit_meta_id\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"tbl-glossary-html\",\"tbl-glossary-html\",\"tbl-glossary-html\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"\"]}},\"value\":[{\"type\":\"list\",\"attributes\":{\"names\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"name\",\"options\",\"extra_lines\"]},\"class\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"latex_dependency\"]}},\"value\":[{\"type\":\"character\",\"attributes\":{},\"value\":[\"booktabs\"]},{\"type\":\"NULL\"},{\"type\":\"NULL\"}]},{\"type\":\"list\",\"attributes\":{\"names\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"name\",\"options\",\"extra_lines\"]},\"class\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"latex_dependency\"]}},\"value\":[{\"type\":\"character\",\"attributes\":{},\"value\":[\"caption\"]},{\"type\":\"NULL\"},{\"type\":\"NULL\"}]},{\"type\":\"list\",\"attributes\":{\"names\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"name\",\"options\",\"extra_lines\"]},\"class\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"latex_dependency\"]}},\"value\":[{\"type\":\"character\",\"attributes\":{},\"value\":[\"longtable\"]},{\"type\":\"NULL\"},{\"type\":\"NULL\"}]},{\"type\":\"list\",\"attributes\":{\"names\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"name\",\"options\",\"extra_lines\"]},\"class\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"latex_dependency\"]}},\"value\":[{\"type\":\"character\",\"attributes\":{},\"value\":[\"booktabs\"]},{\"type\":\"NULL\"},{\"type\":\"NULL\"}]},{\"type\":\"list\",\"attributes\":{\"names\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"name\",\"options\",\"extra_lines\"]},\"class\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"latex_dependency\"]}},\"value\":[{\"type\":\"character\",\"attributes\":{},\"value\":[\"longtable\"]},{\"type\":\"NULL\"},{\"type\":\"NULL\"}]},{\"type\":\"list\",\"attributes\":{\"names\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"name\",\"options\",\"extra_lines\"]},\"class\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"latex_dependency\"]}},\"value\":[{\"type\":\"character\",\"attributes\":{},\"value\":[\"array\"]},{\"type\":\"NULL\"},{\"type\":\"NULL\"}]},{\"type\":\"list\",\"attributes\":{\"names\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"name\",\"options\",\"extra_lines\"]},\"class\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"latex_dependency\"]}},\"value\":[{\"type\":\"character\",\"attributes\":{},\"value\":[\"multirow\"]},{\"type\":\"NULL\"},{\"type\":\"NULL\"}]},{\"type\":\"list\",\"attributes\":{\"names\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"name\",\"options\",\"extra_lines\"]},\"class\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"latex_dependency\"]}},\"value\":[{\"type\":\"character\",\"attributes\":{},\"value\":[\"wrapfig\"]},{\"type\":\"NULL\"},{\"type\":\"NULL\"}]},{\"type\":\"list\",\"attributes\":{\"names\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"name\",\"options\",\"extra_lines\"]},\"class\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"latex_dependency\"]}},\"value\":[{\"type\":\"character\",\"attributes\":{},\"value\":[\"float\"]},{\"type\":\"NULL\"},{\"type\":\"NULL\"}]},{\"type\":\"list\",\"attributes\":{\"names\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"name\",\"options\",\"extra_lines\"]},\"class\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"latex_dependency\"]}},\"value\":[{\"type\":\"character\",\"attributes\":{},\"value\":[\"colortbl\"]},{\"type\":\"NULL\"},{\"type\":\"NULL\"}]},{\"type\":\"list\",\"attributes\":{\"names\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"name\",\"options\",\"extra_lines\"]},\"class\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"latex_dependency\"]}},\"value\":[{\"type\":\"character\",\"attributes\":{},\"value\":[\"pdflscape\"]},{\"type\":\"NULL\"},{\"type\":\"NULL\"}]},{\"type\":\"list\",\"attributes\":{\"names\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"name\",\"options\",\"extra_lines\"]},\"class\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"latex_dependency\"]}},\"value\":[{\"type\":\"character\",\"attributes\":{},\"value\":[\"tabu\"]},{\"type\":\"NULL\"},{\"type\":\"NULL\"}]},{\"type\":\"list\",\"attributes\":{\"names\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"name\",\"options\",\"extra_lines\"]},\"class\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"latex_dependency\"]}},\"value\":[{\"type\":\"character\",\"attributes\":{},\"value\":[\"threeparttable\"]},{\"type\":\"NULL\"},{\"type\":\"NULL\"}]},{\"type\":\"list\",\"attributes\":{\"names\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"name\",\"options\",\"extra_lines\"]},\"class\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"latex_dependency\"]}},\"value\":[{\"type\":\"character\",\"attributes\":{},\"value\":[\"threeparttablex\"]},{\"type\":\"NULL\"},{\"type\":\"NULL\"}]},{\"type\":\"list\",\"attributes\":{\"names\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"name\",\"options\",\"extra_lines\"]},\"class\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"latex_dependency\"]}},\"value\":[{\"type\":\"character\",\"attributes\":{},\"value\":[\"ulem\"]},{\"type\":\"character\",\"attributes\":{},\"value\":[\"normalem\"]},{\"type\":\"NULL\"}]},{\"type\":\"list\",\"attributes\":{\"names\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"name\",\"options\",\"extra_lines\"]},\"class\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"latex_dependency\"]}},\"value\":[{\"type\":\"character\",\"attributes\":{},\"value\":[\"makecell\"]},{\"type\":\"NULL\"},{\"type\":\"NULL\"}]},{\"type\":\"list\",\"attributes\":{\"names\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"name\",\"options\",\"extra_lines\"]},\"class\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"latex_dependency\"]}},\"value\":[{\"type\":\"character\",\"attributes\":{},\"value\":[\"xcolor\"]},{\"type\":\"NULL\"},{\"type\":\"NULL\"}]}]}" + ] + }, + "preserve": null, + "postProcess": false + } +} \ No newline at end of file diff --git a/_freeze/index/execute-results/tex.json b/_freeze/index/execute-results/tex.json index 8fe9eaf..8ce9d02 100644 --- a/_freeze/index/execute-results/tex.json +++ b/_freeze/index/execute-results/tex.json @@ -1,8 +1,8 @@ { - "hash": "e7edae309dd2a28bec5f0352d2db911f", + "hash": "4c1c53e6418743cf0560e856a4615dde", "result": { "engine": "knitr", - "markdown": "# Preface {#preface .unnumbered .unlisted}\n\nIt is important to visualise your data because you might discover things that you could never have anticipated. Although there are many resources available for data visualisation, there are few comprehensive resources on high-dimensional data visualisation. High-dimensional (or multivariate) data arises when many different things are measured for each observation. While we can learn many things from plotting with 1D and 2D or 3D methods there are likely more structures hidden in the higher dimensions. This book provides guidance on visualising high-dimensional data and models using linear projections, with R.\n\nHigh-dimensional data spaces are fascinating places. You may think that there's a lot of ways to plot one or two variables, and a lot of types of patterns that can be found. You might use a density plot and see skewness or a dot plot to find outliers. A scatterplot of two variables might reveal a non-linear relationship or a barrier beyond which no observations exist. We don't as yet have so many different choices of plot types for high-dimensions, but these types of patterns are also what we seek in scatterplots of high-dimensional data. The additional dimensions can clarify these patterns, that clusters are likely to be more distinct. Observations that did not appear to be very different can be seen to be lonely anomalies in high-dimensions, that no other observations have quite the same combination of values. \n\n## What's in this book? {-}\n\nThe book is divided into these parts:\n\n- **Introduction**: Here we introduce you to high-dimensional spaces, how they can be visualised, and notation that is useful for describing methods in later chapters. \n- **Dimension reduction**: This part covers linear and non-linear dimension reduction. It includes ways to help decide on the number of dimensions needed to summarise the high dimensional data, whether linear dimension reduction is appropriate, detecting problems that might affect the dimension reduction, and examining how well or badly a non-linear dimension reduction is representing the data. \n- **Cluster analysis**: This part described methods for finding groups in data. Although it includes an explanation of a purely graphical approach, it is mostly on using graphics in association with numerical clustering algorithms. There are explanations of assessing the suitability of different numerical techniques for extracting clusters, based on the data shapes, evaluating the clustering result, and showing the solutions in high dimensions.\n- **Classification**: This part describes methods for exploring known groups in the data. You'll learn how to check model assumptions, to help decide if a method is suited to the data, examine classification boundaries and explore where errors arise. \n- **Miscellaneous**: The material in this part focuses on examining data from different contexts. This includes multiple time series, longitudinal data. A key pre-processing step is to convert the data into Euclidean space. \n\nIn each of these parts an emphasis is also showing your model with your data in the high dimensional space. \n\nOur hopes are that you will come away with understanding the importance of plotting your high dimensional data as a regular step in your statistical or machine learning analyses. There are many examples of what you might miss if you don't plot the data. Effective use of graphics goes hand-in-hand with analytical techniques. With high dimensions visualisation is a challenge but it is fascinating, and leads to many surprising moments.\n\n## Audience {-}\n\nHigh-dimensional data arises in many fields such as biology, social sciences, finance, and more. Anyone who is doing exploratory data analysis and model fitting for more than two variables will benefit from learning how to effectively visualise high-dimensions. This book will be useful for students and teachers of multivariate data analysis and machine learning, and researchers, data analysts, and industry professionals who work in these areas. \n\n## How to use the book? {.unnumbered}\n\nThe book is written with explanations accompanied by examples with R code. The chapters are organised by types of analysis and focus on how to use the high-dimensional visualisation to complement the commonly used analytical methods. The toolbox chapter in the Appendix provides an overview of the primary high-dimensional visualisation methods discussed in the book and how to get started. \n\n## What should I know before reading this book? {.unnumbered}\n\nThe examples assume that you already use R, and have a working knowledge of base R and tidyverse way of thinking about data analysis. It also assumes that you have some knowledge of statistical methods, and some experience with machine learning methods. \n\nIf you feel like you need build up your skills in these areas in preparation for working through this book, these are our recommended resources:\n\n- [R for Data Science](https://r4ds.had.co.nz) by Wickham and Grolemund for learning about data wrangling and visualisation. \n- [Introduction to Modern Statistics](https://openintro-ims.netlify.app) by \nÇetinkaya-Rundel and Hardin to learn about introductory statistics.\n- [Hands-On Machine Learning with R](https://bradleyboehmke.github.io/HOML/) by Boehmke and Greenwell to learn about machine learning.\n\nWe will assume you know how to plot your data and models in 2D. Our material starts from 2D and beyond.\n\n## Setting up your workflow {.unnumbered}\n\nTo get started set up your computer with the current versions of [R](https://cran.r-project.org) and ideally also with [Rstudio Desktop](https://posit.co/download/rstudio-desktop/).\n\nIn addition, we have made an R package to share the data and functions used in this book, called [`mulgar`](http://dicook.github.io/mulgar).[^mulga][^photo]\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\ninstall.packages(\"mulgar\", dependencies=TRUE)\n# or the development version\ndevtools::install_github(\"dicook/mulgar\")\n```\n:::\n\n\n\nTo get a copy of the code and data used and an RStudio project to get started, you can download with this code: \n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nbook_url <- \"https://dicook.github.io/mulgar_book/code_and_data.zip\"\nusethis::use_zip(url=book_url)\n```\n:::\n\n\n\nYou will be able to click on the `mulgar_book.Rproj` to get started with the code.\n\n[^mulga]: Mulga is a type of Australian habitat composed of woodland or open forest dominated by the mulga tree. Massive clearing of mulga led to the vast wheat fields of Western Australia. Here **mulgar** is an acronym for **MUL**tivariate **G**raphical **A**nalysis with **R**.\n\n[^photo]: Photo of mulga tree taken by L. G. Cook.\n\n\n\n\n\n\n\n\n## Suggestion, feedback or error?\n\nWe welcome suggestions, feedback or details of errors. You can report them as an issue at the [Github repo for this book](https://github.com/dicook/mulgar_book). \n\nPlease make a small [reproducible example](https://reprex.tidyverse.org) and report the error encountered. Reproducible examples have these components:\n\n- a small amount of data \n- small amount of code that generates the error\n- copy of the error message that was generated\n\n\n\n::: {.cell}\n\n:::\n\n\n\n::: {.content-hidden when-format=\"pdf\"}\n\n## License\n\n\"Creative
The online version of this book is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.\n:::\n", + "markdown": "# Preface {#preface .unnumbered .unlisted}\n\nIt is important to visualise your data because you might discover things that you could never have anticipated. Although there are many resources available for data visualisation, there are few comprehensive resources on high-dimensional data visualisation. High-dimensional (or multivariate) data arises when many different things are measured for each observation. While we can learn many things from plotting with 1D and 2D or 3D methods there are likely more structures hidden in the higher dimensions. This book provides guidance on visualising high-dimensional data and models using linear projections, with R.\n\nHigh-dimensional data spaces are fascinating places. You may think that there's a lot of ways to plot one or two variables, and a lot of types of patterns that can be found. You might use a density plot and see skewness or a dot plot to find outliers. A scatterplot of two variables might reveal a non-linear relationship or a barrier beyond which no observations exist. We don't as yet have so many different choices of plot types for high-dimensions, but these types of patterns are also what we seek in scatterplots of high-dimensional data. The additional dimensions can clarify these patterns, that clusters are likely to be more distinct. Observations that did not appear to be very different can be seen to be lonely anomalies in high-dimensions, that no other observations have quite the same combination of values. \n\n## What's in this book? {-}\n\nThe book can be divided into these parts:\n\n- **Introduction**: Here we introduce you to high-dimensional spaces, how they can be visualised, and notation that is useful for describing methods in later chapters. \n- **Dimension reduction**: This part covers linear and non-linear dimension reduction. It includes ways to help decide on the number of dimensions needed to summarise the high dimensional data, whether linear dimension reduction is appropriate, detecting problems that might affect the dimension reduction, and examining how well or badly a non-linear dimension reduction is representing the data. \n- **Cluster analysis**: This part described methods for finding groups in data. Although it includes an explanation of a purely graphical approach, it is mostly on using graphics in association with numerical clustering algorithms. There are explanations of assessing the suitability of different numerical techniques for extracting clusters, based on the data shapes, evaluating the clustering result, and showing the solutions in high dimensions.\n- **Classification**: This part describes methods for exploring known groups in the data. You'll learn how to check model assumptions, to help decide if a method is suited to the data, examine classification boundaries and explore where errors arise. \n\n\nIn each of these parts an emphasis is also showing your model with your data in the high dimensional space. \n\nOur hopes are that you will come away with understanding the importance of plotting your high dimensional data as a regular step in your statistical or machine learning analyses. There are many examples of what you might miss if you don't plot the data. Effective use of graphics goes hand-in-hand with analytical techniques. With high dimensions visualisation is a challenge but it is fascinating, and leads to many surprising moments.\n\n\n## Audience {-}\n\nHigh-dimensional data arises in many fields such as biology, social sciences, finance, and more. Anyone who is doing exploratory data analysis and model fitting for more than two variables will benefit from learning how to effectively visualise high-dimensions. This book will be useful for students and teachers of multivariate data analysis and machine learning, and researchers, data analysts, and industry professionals who work in these areas. \n\n## How to use the book? {.unnumbered}\n\nThe book provides explanations and plots accompanied by R code. We would hope that you run the code to explore the examples as you read the explanations. The chapters are organised by types of analysis and focus on how to use the high-dimensional visualisation to complement the commonly used analytical methods. An overview of the primary high-dimensional visualisation methods discussed in the book and how to get started is provided in the toolbox chapter in the Appendix. \n\n::: {.content-visible when-format=\"pdf\"}\n\nThe PDF version of the book has many static plots replacing the animated gifs and interactive plots available in the HTML version. This is indicated in the figure caption by the {{< fa play-circle >}} symbol. \n\n:::\n\n## What should I know before reading this book? {.unnumbered}\n\nThe examples assume that you already use R, and have a working knowledge of base R and tidyverse way of thinking about data analysis. It also assumes that you have some knowledge of statistical methods, and some experience with machine learning methods. \n\nIf you feel like you need build up your skills in these areas in preparation for working through this book, these are our recommended resources:\n\n- [R for Data Science](https://r4ds.had.co.nz) by Wickham and Grolemund for learning about data wrangling and visualisation.\n- [Introduction to Modern Statistics](https://openintro-ims.netlify.app) by \nÇetinkaya-Rundel and Hardin to learn about introductory statistics.\n- [Hands-On Machine Learning with R](https://bradleyboehmke.github.io/HOML/) by Boehmke and Greenwell to learn about machine learning.\n- [Tidy Modeling with R](https://www.tmwr.org) by Kuhn and Silge to learn how to tidily do machine learning.\n\nWe will assume you know how to plot your data and models in 2D. Our material starts from 2D and beyond.\n\n## Setting up your workflow {.unnumbered}\n\nTo get started set up your computer with the current versions of [R](https://cran.r-project.org) and ideally also with [Rstudio Desktop](https://posit.co/download/rstudio-desktop/).\n\nIn addition, we have made an R package to share the data and functions used in this book, called [`mulgar`](http://dicook.github.io/mulgar).[^mulga][^photo]\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\ninstall.packages(\"mulgar\", dependencies=TRUE)\n# or the development version\ndevtools::install_github(\"dicook/mulgar\")\n```\n:::\n\n\n\nTo get a copy of the code and data used and an RStudio project to get started, you can download with this code: \n\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"false\"}\nbook_url <- \"https://dicook.github.io/mulgar_book/code_and_data.zip\"\nusethis::use_zip(url=book_url)\n```\n:::\n\n\n\nYou will be able to click on the `mulgar_book.Rproj` to get started with the code.\n\n[^mulga]: Mulga is a type of Australian habitat composed of woodland or open forest dominated by the mulga tree. Massive clearing of mulga led to the vast wheat fields of Western Australia. Here **mulgar** is an acronym for **MUL**tivariate **G**raphical **A**nalysis with **R**.\n\n[^photo]: Photo of mulga tree taken by L. G. Cook.\n\n\n\n\n\n\n\n\n## Suggestion, feedback or error?\n\nWe welcome suggestions, feedback or details of errors. You can report them as an issue at the [Github repo for this book](https://github.com/dicook/mulgar_book). \n\nPlease make a small [reproducible example](https://reprex.tidyverse.org) and report the error encountered. Reproducible examples have these components:\n\n- a small amount of data \n- small amount of code that generates the error\n- copy of the error message that was generated\n\n\n\n::: {.cell}\n\n:::\n\n\n\n::: {.content-hidden when-format=\"pdf\"}\n\n## Citing\n\nPlease use this text and bibtex for citing the book:\n\n```\nCook D., Laa, U. (2024) Interactively exploringhigh-dimensional data and models in R, https://dicook.github.io/mulgar_book/, accessed on YYYY/MM/DD. \n\n@misc{cook-laa,\n title = {Interactively exploringhigh-dimensional data and models in R},\n author = {Dianne Cook and Ursula Laa},\n year = 2024,\n url = {https://dicook.github.io/mulgar_book/},\n note = {accessed on YYYY/MM/DD}\n}\n```\n\n## License\n\n\"Creative
The online version of this book is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.\n:::\n", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/_freeze/site_libs/clipboard/clipboard.min.js b/_freeze/site_libs/clipboard/clipboard.min.js new file mode 100644 index 0000000..1103f81 --- /dev/null +++ b/_freeze/site_libs/clipboard/clipboard.min.js @@ -0,0 +1,7 @@ +/*! + * clipboard.js v2.0.11 + * https://clipboardjs.com/ + * + * Licensed MIT © Zeno Rocha + */ +!function(t,e){"object"==typeof exports&&"object"==typeof module?module.exports=e():"function"==typeof define&&define.amd?define([],e):"object"==typeof exports?exports.ClipboardJS=e():t.ClipboardJS=e()}(this,function(){return n={686:function(t,e,n){"use strict";n.d(e,{default:function(){return b}});var e=n(279),i=n.n(e),e=n(370),u=n.n(e),e=n(817),r=n.n(e);function c(t){try{return document.execCommand(t)}catch(t){return}}var a=function(t){t=r()(t);return c("cut"),t};function o(t,e){var n,o,t=(n=t,o="rtl"===document.documentElement.getAttribute("dir"),(t=document.createElement("textarea")).style.fontSize="12pt",t.style.border="0",t.style.padding="0",t.style.margin="0",t.style.position="absolute",t.style[o?"right":"left"]="-9999px",o=window.pageYOffset||document.documentElement.scrollTop,t.style.top="".concat(o,"px"),t.setAttribute("readonly",""),t.value=n,t);return e.container.appendChild(t),e=r()(t),c("copy"),t.remove(),e}var f=function(t){var e=1.container-fluid.crosstalk-bscols{margin-left:auto;margin-right:auto}.crosstalk-input-checkboxgroup .crosstalk-options-group .crosstalk-options-column{display:inline-block;padding-right:12px;vertical-align:top}@media only screen and (max-width: 480px){.crosstalk-input-checkboxgroup .crosstalk-options-group .crosstalk-options-column{display:block;padding-right:inherit}}.crosstalk-input{margin-bottom:15px}.crosstalk-input .control-label{margin-bottom:0;vertical-align:middle}.crosstalk-input input[type="checkbox"]{margin:4px 0 0;margin-top:1px;line-height:normal}.crosstalk-input .checkbox{position:relative;display:block;margin-top:10px;margin-bottom:10px}.crosstalk-input .checkbox>label{padding-left:20px;margin-bottom:0;font-weight:400;cursor:pointer}.crosstalk-input .checkbox input[type="checkbox"],.crosstalk-input .checkbox-inline input[type="checkbox"]{position:absolute;margin-top:2px;margin-left:-20px}.crosstalk-input .checkbox+.checkbox{margin-top:-5px}.crosstalk-input .checkbox-inline{position:relative;display:inline-block;padding-left:20px;margin-bottom:0;font-weight:400;vertical-align:middle;cursor:pointer}.crosstalk-input .checkbox-inline+.checkbox-inline{margin-top:0;margin-left:10px} diff --git a/_freeze/site_libs/crosstalk-1.2.0/js/crosstalk.js b/_freeze/site_libs/crosstalk-1.2.0/js/crosstalk.js new file mode 100644 index 0000000..fd9eb53 --- /dev/null +++ b/_freeze/site_libs/crosstalk-1.2.0/js/crosstalk.js @@ -0,0 +1,1474 @@ +(function(){function e(t,n,r){function s(o,u){if(!n[o]){if(!t[o]){var a=typeof require=="function"&&require;if(!u&&a)return a(o,!0);if(i)return i(o,!0);var f=new Error("Cannot find module '"+o+"'");throw f.code="MODULE_NOT_FOUND",f}var l=n[o]={exports:{}};t[o][0].call(l.exports,function(e){var n=t[o][1][e];return s(n?n:e)},l,l.exports,e,t,n,r)}return n[o].exports}var i=typeof require=="function"&&require;for(var o=0;o b) { + return 1; + } +} + +/** + * @private + */ + +var FilterSet = function () { + function FilterSet() { + _classCallCheck(this, FilterSet); + + this.reset(); + } + + _createClass(FilterSet, [{ + key: "reset", + value: function reset() { + // Key: handle ID, Value: array of selected keys, or null + this._handles = {}; + // Key: key string, Value: count of handles that include it + this._keys = {}; + this._value = null; + this._activeHandles = 0; + } + }, { + key: "update", + value: function update(handleId, keys) { + if (keys !== null) { + keys = keys.slice(0); // clone before sorting + keys.sort(naturalComparator); + } + + var _diffSortedLists = (0, _util.diffSortedLists)(this._handles[handleId], keys), + added = _diffSortedLists.added, + removed = _diffSortedLists.removed; + + this._handles[handleId] = keys; + + for (var i = 0; i < added.length; i++) { + this._keys[added[i]] = (this._keys[added[i]] || 0) + 1; + } + for (var _i = 0; _i < removed.length; _i++) { + this._keys[removed[_i]]--; + } + + this._updateValue(keys); + } + + /** + * @param {string[]} keys Sorted array of strings that indicate + * a superset of possible keys. + * @private + */ + + }, { + key: "_updateValue", + value: function _updateValue() { + var keys = arguments.length > 0 && arguments[0] !== undefined ? arguments[0] : this._allKeys; + + var handleCount = Object.keys(this._handles).length; + if (handleCount === 0) { + this._value = null; + } else { + this._value = []; + for (var i = 0; i < keys.length; i++) { + var count = this._keys[keys[i]]; + if (count === handleCount) { + this._value.push(keys[i]); + } + } + } + } + }, { + key: "clear", + value: function clear(handleId) { + if (typeof this._handles[handleId] === "undefined") { + return; + } + + var keys = this._handles[handleId]; + if (!keys) { + keys = []; + } + + for (var i = 0; i < keys.length; i++) { + this._keys[keys[i]]--; + } + delete this._handles[handleId]; + + this._updateValue(); + } + }, { + key: "value", + get: function get() { + return this._value; + } + }, { + key: "_allKeys", + get: function get() { + var allKeys = Object.keys(this._keys); + allKeys.sort(naturalComparator); + return allKeys; + } + }]); + + return FilterSet; +}(); + +exports.default = FilterSet; + +},{"./util":11}],4:[function(require,module,exports){ +(function (global){ +"use strict"; + +Object.defineProperty(exports, "__esModule", { + value: true +}); + +var _createClass = function () { function defineProperties(target, props) { for (var i = 0; i < props.length; i++) { var descriptor = props[i]; descriptor.enumerable = descriptor.enumerable || false; descriptor.configurable = true; if ("value" in descriptor) descriptor.writable = true; Object.defineProperty(target, descriptor.key, descriptor); } } return function (Constructor, protoProps, staticProps) { if (protoProps) defineProperties(Constructor.prototype, protoProps); if (staticProps) defineProperties(Constructor, staticProps); return Constructor; }; }(); + +var _typeof = typeof Symbol === "function" && typeof Symbol.iterator === "symbol" ? function (obj) { return typeof obj; } : function (obj) { return obj && typeof Symbol === "function" && obj.constructor === Symbol && obj !== Symbol.prototype ? "symbol" : typeof obj; }; + +exports.default = group; + +var _var2 = require("./var"); + +var _var3 = _interopRequireDefault(_var2); + +function _interopRequireDefault(obj) { return obj && obj.__esModule ? obj : { default: obj }; } + +function _classCallCheck(instance, Constructor) { if (!(instance instanceof Constructor)) { throw new TypeError("Cannot call a class as a function"); } } + +// Use a global so that multiple copies of crosstalk.js can be loaded and still +// have groups behave as singletons across all copies. +global.__crosstalk_groups = global.__crosstalk_groups || {}; +var groups = global.__crosstalk_groups; + +function group(groupName) { + if (groupName && typeof groupName === "string") { + if (!groups.hasOwnProperty(groupName)) { + groups[groupName] = new Group(groupName); + } + return groups[groupName]; + } else if ((typeof groupName === "undefined" ? "undefined" : _typeof(groupName)) === "object" && groupName._vars && groupName.var) { + // Appears to already be a group object + return groupName; + } else if (Array.isArray(groupName) && groupName.length == 1 && typeof groupName[0] === "string") { + return group(groupName[0]); + } else { + throw new Error("Invalid groupName argument"); + } +} + +var Group = function () { + function Group(name) { + _classCallCheck(this, Group); + + this.name = name; + this._vars = {}; + } + + _createClass(Group, [{ + key: "var", + value: function _var(name) { + if (!name || typeof name !== "string") { + throw new Error("Invalid var name"); + } + + if (!this._vars.hasOwnProperty(name)) this._vars[name] = new _var3.default(this, name); + return this._vars[name]; + } + }, { + key: "has", + value: function has(name) { + if (!name || typeof name !== "string") { + throw new Error("Invalid var name"); + } + + return this._vars.hasOwnProperty(name); + } + }]); + + return Group; +}(); + +}).call(this,typeof global !== "undefined" ? global : typeof self !== "undefined" ? self : typeof window !== "undefined" ? window : {}) + +},{"./var":12}],5:[function(require,module,exports){ +(function (global){ +"use strict"; + +Object.defineProperty(exports, "__esModule", { + value: true +}); + +var _group = require("./group"); + +var _group2 = _interopRequireDefault(_group); + +var _selection = require("./selection"); + +var _filter = require("./filter"); + +var _input = require("./input"); + +require("./input_selectize"); + +require("./input_checkboxgroup"); + +require("./input_slider"); + +function _interopRequireDefault(obj) { return obj && obj.__esModule ? obj : { default: obj }; } + +var defaultGroup = (0, _group2.default)("default"); + +function var_(name) { + return defaultGroup.var(name); +} + +function has(name) { + return defaultGroup.has(name); +} + +if (global.Shiny) { + global.Shiny.addCustomMessageHandler("update-client-value", function (message) { + if (typeof message.group === "string") { + (0, _group2.default)(message.group).var(message.name).set(message.value); + } else { + var_(message.name).set(message.value); + } + }); +} + +var crosstalk = { + group: _group2.default, + var: var_, + has: has, + SelectionHandle: _selection.SelectionHandle, + FilterHandle: _filter.FilterHandle, + bind: _input.bind +}; + +/** + * @namespace crosstalk + */ +exports.default = crosstalk; + +global.crosstalk = crosstalk; + +}).call(this,typeof global !== "undefined" ? global : typeof self !== "undefined" ? self : typeof window !== "undefined" ? window : {}) + +},{"./filter":2,"./group":4,"./input":6,"./input_checkboxgroup":7,"./input_selectize":8,"./input_slider":9,"./selection":10}],6:[function(require,module,exports){ +(function (global){ +"use strict"; + +Object.defineProperty(exports, "__esModule", { + value: true +}); +exports.register = register; +exports.bind = bind; +var $ = global.jQuery; + +var bindings = {}; + +function register(reg) { + bindings[reg.className] = reg; + if (global.document && global.document.readyState !== "complete") { + $(function () { + bind(); + }); + } else if (global.document) { + setTimeout(bind, 100); + } +} + +function bind() { + Object.keys(bindings).forEach(function (className) { + var binding = bindings[className]; + $("." + binding.className).not(".crosstalk-input-bound").each(function (i, el) { + bindInstance(binding, el); + }); + }); +} + +// Escape jQuery identifier +function $escape(val) { + return val.replace(/([!"#$%&'()*+,./:;<=>?@[\\\]^`{|}~])/g, "\\$1"); +} + +function bindEl(el) { + var $el = $(el); + Object.keys(bindings).forEach(function (className) { + if ($el.hasClass(className) && !$el.hasClass("crosstalk-input-bound")) { + var binding = bindings[className]; + bindInstance(binding, el); + } + }); +} + +function bindInstance(binding, el) { + var jsonEl = $(el).find("script[type='application/json'][data-for='" + $escape(el.id) + "']"); + var data = JSON.parse(jsonEl[0].innerText); + + var instance = binding.factory(el, data); + $(el).data("crosstalk-instance", instance); + $(el).addClass("crosstalk-input-bound"); +} + +if (global.Shiny) { + var inputBinding = new global.Shiny.InputBinding(); + var _$ = global.jQuery; + _$.extend(inputBinding, { + find: function find(scope) { + return _$(scope).find(".crosstalk-input"); + }, + initialize: function initialize(el) { + if (!_$(el).hasClass("crosstalk-input-bound")) { + bindEl(el); + } + }, + getId: function getId(el) { + return el.id; + }, + getValue: function getValue(el) {}, + setValue: function setValue(el, value) {}, + receiveMessage: function receiveMessage(el, data) {}, + subscribe: function subscribe(el, callback) { + _$(el).data("crosstalk-instance").resume(); + }, + unsubscribe: function unsubscribe(el) { + _$(el).data("crosstalk-instance").suspend(); + } + }); + global.Shiny.inputBindings.register(inputBinding, "crosstalk.inputBinding"); +} + +}).call(this,typeof global !== "undefined" ? global : typeof self !== "undefined" ? self : typeof window !== "undefined" ? window : {}) + +},{}],7:[function(require,module,exports){ +(function (global){ +"use strict"; + +var _input = require("./input"); + +var input = _interopRequireWildcard(_input); + +var _filter = require("./filter"); + +function _interopRequireWildcard(obj) { if (obj && obj.__esModule) { return obj; } else { var newObj = {}; if (obj != null) { for (var key in obj) { if (Object.prototype.hasOwnProperty.call(obj, key)) newObj[key] = obj[key]; } } newObj.default = obj; return newObj; } } + +var $ = global.jQuery; + +input.register({ + className: "crosstalk-input-checkboxgroup", + + factory: function factory(el, data) { + /* + * map: {"groupA": ["keyA", "keyB", ...], ...} + * group: "ct-groupname" + */ + var ctHandle = new _filter.FilterHandle(data.group); + + var lastKnownKeys = void 0; + var $el = $(el); + $el.on("change", "input[type='checkbox']", function () { + var checked = $el.find("input[type='checkbox']:checked"); + if (checked.length === 0) { + lastKnownKeys = null; + ctHandle.clear(); + } else { + var keys = {}; + checked.each(function () { + data.map[this.value].forEach(function (key) { + keys[key] = true; + }); + }); + var keyArray = Object.keys(keys); + keyArray.sort(); + lastKnownKeys = keyArray; + ctHandle.set(keyArray); + } + }); + + return { + suspend: function suspend() { + ctHandle.clear(); + }, + resume: function resume() { + if (lastKnownKeys) ctHandle.set(lastKnownKeys); + } + }; + } +}); + +}).call(this,typeof global !== "undefined" ? global : typeof self !== "undefined" ? self : typeof window !== "undefined" ? window : {}) + +},{"./filter":2,"./input":6}],8:[function(require,module,exports){ +(function (global){ +"use strict"; + +var _input = require("./input"); + +var input = _interopRequireWildcard(_input); + +var _util = require("./util"); + +var util = _interopRequireWildcard(_util); + +var _filter = require("./filter"); + +function _interopRequireWildcard(obj) { if (obj && obj.__esModule) { return obj; } else { var newObj = {}; if (obj != null) { for (var key in obj) { if (Object.prototype.hasOwnProperty.call(obj, key)) newObj[key] = obj[key]; } } newObj.default = obj; return newObj; } } + +var $ = global.jQuery; + +input.register({ + className: "crosstalk-input-select", + + factory: function factory(el, data) { + /* + * items: {value: [...], label: [...]} + * map: {"groupA": ["keyA", "keyB", ...], ...} + * group: "ct-groupname" + */ + + var first = [{ value: "", label: "(All)" }]; + var items = util.dataframeToD3(data.items); + var opts = { + options: first.concat(items), + valueField: "value", + labelField: "label", + searchField: "label" + }; + + var select = $(el).find("select")[0]; + + var selectize = $(select).selectize(opts)[0].selectize; + + var ctHandle = new _filter.FilterHandle(data.group); + + var lastKnownKeys = void 0; + selectize.on("change", function () { + if (selectize.items.length === 0) { + lastKnownKeys = null; + ctHandle.clear(); + } else { + var keys = {}; + selectize.items.forEach(function (group) { + data.map[group].forEach(function (key) { + keys[key] = true; + }); + }); + var keyArray = Object.keys(keys); + keyArray.sort(); + lastKnownKeys = keyArray; + ctHandle.set(keyArray); + } + }); + + return { + suspend: function suspend() { + ctHandle.clear(); + }, + resume: function resume() { + if (lastKnownKeys) ctHandle.set(lastKnownKeys); + } + }; + } +}); + +}).call(this,typeof global !== "undefined" ? global : typeof self !== "undefined" ? self : typeof window !== "undefined" ? window : {}) + +},{"./filter":2,"./input":6,"./util":11}],9:[function(require,module,exports){ +(function (global){ +"use strict"; + +var _slicedToArray = function () { function sliceIterator(arr, i) { var _arr = []; var _n = true; var _d = false; var _e = undefined; try { for (var _i = arr[Symbol.iterator](), _s; !(_n = (_s = _i.next()).done); _n = true) { _arr.push(_s.value); if (i && _arr.length === i) break; } } catch (err) { _d = true; _e = err; } finally { try { if (!_n && _i["return"]) _i["return"](); } finally { if (_d) throw _e; } } return _arr; } return function (arr, i) { if (Array.isArray(arr)) { return arr; } else if (Symbol.iterator in Object(arr)) { return sliceIterator(arr, i); } else { throw new TypeError("Invalid attempt to destructure non-iterable instance"); } }; }(); + +var _input = require("./input"); + +var input = _interopRequireWildcard(_input); + +var _filter = require("./filter"); + +function _interopRequireWildcard(obj) { if (obj && obj.__esModule) { return obj; } else { var newObj = {}; if (obj != null) { for (var key in obj) { if (Object.prototype.hasOwnProperty.call(obj, key)) newObj[key] = obj[key]; } } newObj.default = obj; return newObj; } } + +var $ = global.jQuery; +var strftime = global.strftime; + +input.register({ + className: "crosstalk-input-slider", + + factory: function factory(el, data) { + /* + * map: {"groupA": ["keyA", "keyB", ...], ...} + * group: "ct-groupname" + */ + var ctHandle = new _filter.FilterHandle(data.group); + + var opts = {}; + var $el = $(el).find("input"); + var dataType = $el.data("data-type"); + var timeFormat = $el.data("time-format"); + var round = $el.data("round"); + var timeFormatter = void 0; + + // Set up formatting functions + if (dataType === "date") { + timeFormatter = strftime.utc(); + opts.prettify = function (num) { + return timeFormatter(timeFormat, new Date(num)); + }; + } else if (dataType === "datetime") { + var timezone = $el.data("timezone"); + if (timezone) timeFormatter = strftime.timezone(timezone);else timeFormatter = strftime; + + opts.prettify = function (num) { + return timeFormatter(timeFormat, new Date(num)); + }; + } else if (dataType === "number") { + if (typeof round !== "undefined") opts.prettify = function (num) { + var factor = Math.pow(10, round); + return Math.round(num * factor) / factor; + }; + } + + $el.ionRangeSlider(opts); + + function getValue() { + var result = $el.data("ionRangeSlider").result; + + // Function for converting numeric value from slider to appropriate type. + var convert = void 0; + var dataType = $el.data("data-type"); + if (dataType === "date") { + convert = function convert(val) { + return formatDateUTC(new Date(+val)); + }; + } else if (dataType === "datetime") { + convert = function convert(val) { + // Convert ms to s + return +val / 1000; + }; + } else { + convert = function convert(val) { + return +val; + }; + } + + if ($el.data("ionRangeSlider").options.type === "double") { + return [convert(result.from), convert(result.to)]; + } else { + return convert(result.from); + } + } + + var lastKnownKeys = null; + + $el.on("change.crosstalkSliderInput", function (event) { + if (!$el.data("updating") && !$el.data("animating")) { + var _getValue = getValue(), + _getValue2 = _slicedToArray(_getValue, 2), + from = _getValue2[0], + to = _getValue2[1]; + + var keys = []; + for (var i = 0; i < data.values.length; i++) { + var val = data.values[i]; + if (val >= from && val <= to) { + keys.push(data.keys[i]); + } + } + keys.sort(); + ctHandle.set(keys); + lastKnownKeys = keys; + } + }); + + // let $el = $(el); + // $el.on("change", "input[type="checkbox"]", function() { + // let checked = $el.find("input[type="checkbox"]:checked"); + // if (checked.length === 0) { + // ctHandle.clear(); + // } else { + // let keys = {}; + // checked.each(function() { + // data.map[this.value].forEach(function(key) { + // keys[key] = true; + // }); + // }); + // let keyArray = Object.keys(keys); + // keyArray.sort(); + // ctHandle.set(keyArray); + // } + // }); + + return { + suspend: function suspend() { + ctHandle.clear(); + }, + resume: function resume() { + if (lastKnownKeys) ctHandle.set(lastKnownKeys); + } + }; + } +}); + +// Convert a number to a string with leading zeros +function padZeros(n, digits) { + var str = n.toString(); + while (str.length < digits) { + str = "0" + str; + }return str; +} + +// Given a Date object, return a string in yyyy-mm-dd format, using the +// UTC date. This may be a day off from the date in the local time zone. +function formatDateUTC(date) { + if (date instanceof Date) { + return date.getUTCFullYear() + "-" + padZeros(date.getUTCMonth() + 1, 2) + "-" + padZeros(date.getUTCDate(), 2); + } else { + return null; + } +} + +}).call(this,typeof global !== "undefined" ? global : typeof self !== "undefined" ? self : typeof window !== "undefined" ? window : {}) + +},{"./filter":2,"./input":6}],10:[function(require,module,exports){ +"use strict"; + +Object.defineProperty(exports, "__esModule", { + value: true +}); +exports.SelectionHandle = undefined; + +var _createClass = function () { function defineProperties(target, props) { for (var i = 0; i < props.length; i++) { var descriptor = props[i]; descriptor.enumerable = descriptor.enumerable || false; descriptor.configurable = true; if ("value" in descriptor) descriptor.writable = true; Object.defineProperty(target, descriptor.key, descriptor); } } return function (Constructor, protoProps, staticProps) { if (protoProps) defineProperties(Constructor.prototype, protoProps); if (staticProps) defineProperties(Constructor, staticProps); return Constructor; }; }(); + +var _events = require("./events"); + +var _events2 = _interopRequireDefault(_events); + +var _group = require("./group"); + +var _group2 = _interopRequireDefault(_group); + +var _util = require("./util"); + +var util = _interopRequireWildcard(_util); + +function _interopRequireWildcard(obj) { if (obj && obj.__esModule) { return obj; } else { var newObj = {}; if (obj != null) { for (var key in obj) { if (Object.prototype.hasOwnProperty.call(obj, key)) newObj[key] = obj[key]; } } newObj.default = obj; return newObj; } } + +function _interopRequireDefault(obj) { return obj && obj.__esModule ? obj : { default: obj }; } + +function _classCallCheck(instance, Constructor) { if (!(instance instanceof Constructor)) { throw new TypeError("Cannot call a class as a function"); } } + +/** + * Use this class to read and write (and listen for changes to) the selection + * for a Crosstalk group. This is intended to be used for linked brushing. + * + * If two (or more) `SelectionHandle` instances in the same webpage share the + * same group name, they will share the same state. Setting the selection using + * one `SelectionHandle` instance will result in the `value` property instantly + * changing across the others, and `"change"` event listeners on all instances + * (including the one that initiated the sending) will fire. + * + * @param {string} [group] - The name of the Crosstalk group, or if none, + * null or undefined (or any other falsy value). This can be changed later + * via the [SelectionHandle#setGroup](#setGroup) method. + * @param {Object} [extraInfo] - An object whose properties will be copied to + * the event object whenever an event is emitted. + */ +var SelectionHandle = exports.SelectionHandle = function () { + function SelectionHandle() { + var group = arguments.length > 0 && arguments[0] !== undefined ? arguments[0] : null; + var extraInfo = arguments.length > 1 && arguments[1] !== undefined ? arguments[1] : null; + + _classCallCheck(this, SelectionHandle); + + this._eventRelay = new _events2.default(); + this._emitter = new util.SubscriptionTracker(this._eventRelay); + + // Name of the group we're currently tracking, if any. Can change over time. + this._group = null; + // The Var we're currently tracking, if any. Can change over time. + this._var = null; + // The event handler subscription we currently have on var.on("change"). + this._varOnChangeSub = null; + + this._extraInfo = util.extend({ sender: this }, extraInfo); + + this.setGroup(group); + } + + /** + * Changes the Crosstalk group membership of this SelectionHandle. The group + * being switched away from (if any) will not have its selection value + * modified as a result of calling `setGroup`, even if this handle was the + * most recent handle to set the selection of the group. + * + * The group being switched to (if any) will also not have its selection value + * modified as a result of calling `setGroup`. If you want to set the + * selection value of the new group, call `set` explicitly. + * + * @param {string} group - The name of the Crosstalk group, or null (or + * undefined) to clear the group. + */ + + + _createClass(SelectionHandle, [{ + key: "setGroup", + value: function setGroup(group) { + var _this = this; + + // If group is unchanged, do nothing + if (this._group === group) return; + // Treat null, undefined, and other falsy values the same + if (!this._group && !group) return; + + if (this._var) { + this._var.off("change", this._varOnChangeSub); + this._var = null; + this._varOnChangeSub = null; + } + + this._group = group; + + if (group) { + this._var = (0, _group2.default)(group).var("selection"); + var sub = this._var.on("change", function (e) { + _this._eventRelay.trigger("change", e, _this); + }); + this._varOnChangeSub = sub; + } + } + + /** + * Retrieves the current selection for the group represented by this + * `SelectionHandle`. + * + * - If no selection is active, then this value will be falsy. + * - If a selection is active, but no data points are selected, then this + * value will be an empty array. + * - If a selection is active, and data points are selected, then the keys + * of the selected data points will be present in the array. + */ + + }, { + key: "_mergeExtraInfo", + + + /** + * Combines the given `extraInfo` (if any) with the handle's default + * `_extraInfo` (if any). + * @private + */ + value: function _mergeExtraInfo(extraInfo) { + // Important incidental effect: shallow clone is returned + return util.extend({}, this._extraInfo ? this._extraInfo : null, extraInfo ? extraInfo : null); + } + + /** + * Overwrites the current selection for the group, and raises the `"change"` + * event among all of the group's '`SelectionHandle` instances (including + * this one). + * + * @fires SelectionHandle#change + * @param {string[]} selectedKeys - Falsy, empty array, or array of keys (see + * {@link SelectionHandle#value}). + * @param {Object} [extraInfo] - Extra properties to be included on the event + * object that's passed to listeners (in addition to any options that were + * passed into the `SelectionHandle` constructor). + */ + + }, { + key: "set", + value: function set(selectedKeys, extraInfo) { + if (this._var) this._var.set(selectedKeys, this._mergeExtraInfo(extraInfo)); + } + + /** + * Overwrites the current selection for the group, and raises the `"change"` + * event among all of the group's '`SelectionHandle` instances (including + * this one). + * + * @fires SelectionHandle#change + * @param {Object} [extraInfo] - Extra properties to be included on the event + * object that's passed to listeners (in addition to any that were passed + * into the `SelectionHandle` constructor). + */ + + }, { + key: "clear", + value: function clear(extraInfo) { + if (this._var) this.set(void 0, this._mergeExtraInfo(extraInfo)); + } + + /** + * Subscribes to events on this `SelectionHandle`. + * + * @param {string} eventType - Indicates the type of events to listen to. + * Currently, only `"change"` is supported. + * @param {SelectionHandle~listener} listener - The callback function that + * will be invoked when the event occurs. + * @return {string} - A token to pass to {@link SelectionHandle#off} to cancel + * this subscription. + */ + + }, { + key: "on", + value: function on(eventType, listener) { + return this._emitter.on(eventType, listener); + } + + /** + * Cancels event subscriptions created by {@link SelectionHandle#on}. + * + * @param {string} eventType - The type of event to unsubscribe. + * @param {string|SelectionHandle~listener} listener - Either the callback + * function previously passed into {@link SelectionHandle#on}, or the + * string that was returned from {@link SelectionHandle#on}. + */ + + }, { + key: "off", + value: function off(eventType, listener) { + return this._emitter.off(eventType, listener); + } + + /** + * Shuts down the `SelectionHandle` object. + * + * Removes all event listeners that were added through this handle. + */ + + }, { + key: "close", + value: function close() { + this._emitter.removeAllListeners(); + this.setGroup(null); + } + }, { + key: "value", + get: function get() { + return this._var ? this._var.get() : null; + } + }]); + + return SelectionHandle; +}(); + +/** + * @callback SelectionHandle~listener + * @param {Object} event - An object containing details of the event. For + * `"change"` events, this includes the properties `value` (the new + * value of the selection, or `undefined` if no selection is active), + * `oldValue` (the previous value of the selection), and `sender` (the + * `SelectionHandle` instance that made the change). + */ + +/** + * @event SelectionHandle#change + * @type {object} + * @property {object} value - The new value of the selection, or `undefined` + * if no selection is active. + * @property {object} oldValue - The previous value of the selection. + * @property {SelectionHandle} sender - The `SelectionHandle` instance that + * changed the value. + */ + +},{"./events":1,"./group":4,"./util":11}],11:[function(require,module,exports){ +"use strict"; + +Object.defineProperty(exports, "__esModule", { + value: true +}); + +var _createClass = function () { function defineProperties(target, props) { for (var i = 0; i < props.length; i++) { var descriptor = props[i]; descriptor.enumerable = descriptor.enumerable || false; descriptor.configurable = true; if ("value" in descriptor) descriptor.writable = true; Object.defineProperty(target, descriptor.key, descriptor); } } return function (Constructor, protoProps, staticProps) { if (protoProps) defineProperties(Constructor.prototype, protoProps); if (staticProps) defineProperties(Constructor, staticProps); return Constructor; }; }(); + +var _typeof = typeof Symbol === "function" && typeof Symbol.iterator === "symbol" ? function (obj) { return typeof obj; } : function (obj) { return obj && typeof Symbol === "function" && obj.constructor === Symbol && obj !== Symbol.prototype ? "symbol" : typeof obj; }; + +exports.extend = extend; +exports.checkSorted = checkSorted; +exports.diffSortedLists = diffSortedLists; +exports.dataframeToD3 = dataframeToD3; + +function _classCallCheck(instance, Constructor) { if (!(instance instanceof Constructor)) { throw new TypeError("Cannot call a class as a function"); } } + +function extend(target) { + for (var _len = arguments.length, sources = Array(_len > 1 ? _len - 1 : 0), _key = 1; _key < _len; _key++) { + sources[_key - 1] = arguments[_key]; + } + + for (var i = 0; i < sources.length; i++) { + var src = sources[i]; + if (typeof src === "undefined" || src === null) continue; + + for (var key in src) { + if (src.hasOwnProperty(key)) { + target[key] = src[key]; + } + } + } + return target; +} + +function checkSorted(list) { + for (var i = 1; i < list.length; i++) { + if (list[i] <= list[i - 1]) { + throw new Error("List is not sorted or contains duplicate"); + } + } +} + +function diffSortedLists(a, b) { + var i_a = 0; + var i_b = 0; + + if (!a) a = []; + if (!b) b = []; + + var a_only = []; + var b_only = []; + + checkSorted(a); + checkSorted(b); + + while (i_a < a.length && i_b < b.length) { + if (a[i_a] === b[i_b]) { + i_a++; + i_b++; + } else if (a[i_a] < b[i_b]) { + a_only.push(a[i_a++]); + } else { + b_only.push(b[i_b++]); + } + } + + if (i_a < a.length) a_only = a_only.concat(a.slice(i_a)); + if (i_b < b.length) b_only = b_only.concat(b.slice(i_b)); + return { + removed: a_only, + added: b_only + }; +} + +// Convert from wide: { colA: [1,2,3], colB: [4,5,6], ... } +// to long: [ {colA: 1, colB: 4}, {colA: 2, colB: 5}, ... ] +function dataframeToD3(df) { + var names = []; + var length = void 0; + for (var name in df) { + if (df.hasOwnProperty(name)) names.push(name); + if (_typeof(df[name]) !== "object" || typeof df[name].length === "undefined") { + throw new Error("All fields must be arrays"); + } else if (typeof length !== "undefined" && length !== df[name].length) { + throw new Error("All fields must be arrays of the same length"); + } + length = df[name].length; + } + var results = []; + var item = void 0; + for (var row = 0; row < length; row++) { + item = {}; + for (var col = 0; col < names.length; col++) { + item[names[col]] = df[names[col]][row]; + } + results.push(item); + } + return results; +} + +/** + * Keeps track of all event listener additions/removals and lets all active + * listeners be removed with a single operation. + * + * @private + */ + +var SubscriptionTracker = exports.SubscriptionTracker = function () { + function SubscriptionTracker(emitter) { + _classCallCheck(this, SubscriptionTracker); + + this._emitter = emitter; + this._subs = {}; + } + + _createClass(SubscriptionTracker, [{ + key: "on", + value: function on(eventType, listener) { + var sub = this._emitter.on(eventType, listener); + this._subs[sub] = eventType; + return sub; + } + }, { + key: "off", + value: function off(eventType, listener) { + var sub = this._emitter.off(eventType, listener); + if (sub) { + delete this._subs[sub]; + } + return sub; + } + }, { + key: "removeAllListeners", + value: function removeAllListeners() { + var _this = this; + + var current_subs = this._subs; + this._subs = {}; + Object.keys(current_subs).forEach(function (sub) { + _this._emitter.off(current_subs[sub], sub); + }); + } + }]); + + return SubscriptionTracker; +}(); + +},{}],12:[function(require,module,exports){ +(function (global){ +"use strict"; + +Object.defineProperty(exports, "__esModule", { + value: true +}); + +var _typeof = typeof Symbol === "function" && typeof Symbol.iterator === "symbol" ? function (obj) { return typeof obj; } : function (obj) { return obj && typeof Symbol === "function" && obj.constructor === Symbol && obj !== Symbol.prototype ? "symbol" : typeof obj; }; + +var _createClass = function () { function defineProperties(target, props) { for (var i = 0; i < props.length; i++) { var descriptor = props[i]; descriptor.enumerable = descriptor.enumerable || false; descriptor.configurable = true; if ("value" in descriptor) descriptor.writable = true; Object.defineProperty(target, descriptor.key, descriptor); } } return function (Constructor, protoProps, staticProps) { if (protoProps) defineProperties(Constructor.prototype, protoProps); if (staticProps) defineProperties(Constructor, staticProps); return Constructor; }; }(); + +var _events = require("./events"); + +var _events2 = _interopRequireDefault(_events); + +function _interopRequireDefault(obj) { return obj && obj.__esModule ? obj : { default: obj }; } + +function _classCallCheck(instance, Constructor) { if (!(instance instanceof Constructor)) { throw new TypeError("Cannot call a class as a function"); } } + +var Var = function () { + function Var(group, name, /*optional*/value) { + _classCallCheck(this, Var); + + this._group = group; + this._name = name; + this._value = value; + this._events = new _events2.default(); + } + + _createClass(Var, [{ + key: "get", + value: function get() { + return this._value; + } + }, { + key: "set", + value: function set(value, /*optional*/event) { + if (this._value === value) { + // Do nothing; the value hasn't changed + return; + } + var oldValue = this._value; + this._value = value; + // Alert JavaScript listeners that the value has changed + var evt = {}; + if (event && (typeof event === "undefined" ? "undefined" : _typeof(event)) === "object") { + for (var k in event) { + if (event.hasOwnProperty(k)) evt[k] = event[k]; + } + } + evt.oldValue = oldValue; + evt.value = value; + this._events.trigger("change", evt, this); + + // TODO: Make this extensible, to let arbitrary back-ends know that + // something has changed + if (global.Shiny && global.Shiny.onInputChange) { + global.Shiny.onInputChange(".clientValue-" + (this._group.name !== null ? this._group.name + "-" : "") + this._name, typeof value === "undefined" ? null : value); + } + } + }, { + key: "on", + value: function on(eventType, listener) { + return this._events.on(eventType, listener); + } + }, { + key: "off", + value: function off(eventType, listener) { + return this._events.off(eventType, listener); + } + }]); + + return Var; +}(); + +exports.default = Var; + +}).call(this,typeof global !== "undefined" ? global : typeof self !== "undefined" ? self : typeof window !== "undefined" ? window : {}) + +},{"./events":1}]},{},[5]) +//# sourceMappingURL=crosstalk.js.map diff --git a/_freeze/site_libs/crosstalk-1.2.0/js/crosstalk.js.map b/_freeze/site_libs/crosstalk-1.2.0/js/crosstalk.js.map new file mode 100644 index 0000000..cff94f0 --- /dev/null +++ b/_freeze/site_libs/crosstalk-1.2.0/js/crosstalk.js.map @@ -0,0 +1,37 @@ +{ + "version": 3, + "sources": [ + "node_modules/browser-pack/_prelude.js", + "javascript/src/events.js", + "javascript/src/filter.js", + "javascript/src/filterset.js", + "javascript/src/group.js", + "javascript/src/index.js", + "javascript/src/input.js", + "javascript/src/input_checkboxgroup.js", + "javascript/src/input_selectize.js", + "javascript/src/input_slider.js", + "javascript/src/selection.js", + "javascript/src/util.js", + "javascript/src/var.js" + ], + "names": [], + "mappings": "AAAA;;;;;;;;;;;ICAqB,M;AACnB,oBAAc;AAAA;;AACZ,SAAK,MAAL,GAAc,EAAd;AACA,SAAK,IAAL,GAAY,CAAZ;AACD;;;;uBAEE,S,EAAW,Q,EAAU;AACtB,UAAI,OAAO,KAAK,MAAL,CAAY,SAAZ,CAAX;AACA,UAAI,CAAC,IAAL,EAAW;AACT,eAAO,KAAK,MAAL,CAAY,SAAZ,IAAyB,EAAhC;AACD;AACD,UAAI,MAAM,QAAS,KAAK,IAAL,EAAnB;AACA,WAAK,GAAL,IAAY,QAAZ;AACA,aAAO,GAAP;AACD;;AAED;;;;wBACI,S,EAAW,Q,EAAU;AACvB,UAAI,OAAO,KAAK,MAAL,CAAY,SAAZ,CAAX;AACA,UAAI,OAAO,QAAP,KAAqB,UAAzB,EAAqC;AACnC,aAAK,IAAI,GAAT,IAAgB,IAAhB,EAAsB;AACpB,cAAI,KAAK,cAAL,CAAoB,GAApB,CAAJ,EAA8B;AAC5B,gBAAI,KAAK,GAAL,MAAc,QAAlB,EAA4B;AAC1B,qBAAO,KAAK,GAAL,CAAP;AACA,qBAAO,GAAP;AACD;AACF;AACF;AACD,eAAO,KAAP;AACD,OAVD,MAUO,IAAI,OAAO,QAAP,KAAqB,QAAzB,EAAmC;AACxC,YAAI,QAAQ,KAAK,QAAL,CAAZ,EAA4B;AAC1B,iBAAO,KAAK,QAAL,CAAP;AACA,iBAAO,QAAP;AACD;AACD,eAAO,KAAP;AACD,OANM,MAMA;AACL,cAAM,IAAI,KAAJ,CAAU,8BAAV,CAAN;AACD;AACF;;;4BAEO,S,EAAW,G,EAAK,O,EAAS;AAC/B,UAAI,OAAO,KAAK,MAAL,CAAY,SAAZ,CAAX;AACA,WAAK,IAAI,GAAT,IAAgB,IAAhB,EAAsB;AACpB,YAAI,KAAK,cAAL,CAAoB,GAApB,CAAJ,EAA8B;AAC5B,eAAK,GAAL,EAAU,IAAV,CAAe,OAAf,EAAwB,GAAxB;AACD;AACF;AACF;;;;;;kBA/CkB,M;;;;;;;;;;;;ACArB;;;;AACA;;;;AACA;;;;AACA;;IAAY,I;;;;;;;;AAEZ,SAAS,YAAT,CAAsB,KAAtB,EAA6B;AAC3B,MAAI,QAAQ,MAAM,GAAN,CAAU,WAAV,CAAZ;AACA,MAAI,SAAS,MAAM,GAAN,EAAb;AACA,MAAI,CAAC,MAAL,EAAa;AACX,aAAS,yBAAT;AACA,UAAM,GAAN,CAAU,MAAV;AACD;AACD,SAAO,MAAP;AACD;;AAED,IAAI,KAAK,CAAT;AACA,SAAS,MAAT,GAAkB;AAChB,SAAO,IAAP;AACD;;AAED;;;;;;;;;;;;;;;;;;;;;;;;;IAwBa,Y,WAAA,Y;AACX,wBAAY,KAAZ,EAAmB,SAAnB,EAA8B;AAAA;;AAC5B,SAAK,WAAL,GAAmB,sBAAnB;AACA,SAAK,QAAL,GAAgB,IAAI,KAAK,mBAAT,CAA6B,KAAK,WAAlC,CAAhB;;AAEA;AACA,SAAK,MAAL,GAAc,IAAd;AACA;AACA,SAAK,UAAL,GAAkB,IAAlB;AACA;AACA,SAAK,UAAL,GAAkB,IAAlB;AACA;AACA,SAAK,eAAL,GAAuB,IAAvB;;AAEA,SAAK,UAAL,GAAkB,KAAK,MAAL,CAAY,EAAE,QAAQ,IAAV,EAAZ,EAA8B,SAA9B,CAAlB;;AAEA,SAAK,GAAL,GAAW,WAAW,QAAtB;;AAEA,SAAK,QAAL,CAAc,KAAd;AACD;;AAED;;;;;;;;;;;;;;6BAUS,K,EAAO;AAAA;;AACd;AACA,UAAI,KAAK,MAAL,KAAgB,KAApB,EACE;AACF;AACA,UAAI,CAAC,KAAK,MAAN,IAAgB,CAAC,KAArB,EACE;;AAEF,UAAI,KAAK,UAAT,EAAqB;AACnB,aAAK,UAAL,CAAgB,GAAhB,CAAoB,QAApB,EAA8B,KAAK,eAAnC;AACA,aAAK,KAAL;AACA,aAAK,eAAL,GAAuB,IAAvB;AACA,aAAK,UAAL,GAAkB,IAAlB;AACA,aAAK,UAAL,GAAkB,IAAlB;AACD;;AAED,WAAK,MAAL,GAAc,KAAd;;AAEA,UAAI,KAAJ,EAAW;AACT,gBAAQ,qBAAI,KAAJ,CAAR;AACA,aAAK,UAAL,GAAkB,aAAa,KAAb,CAAlB;AACA,aAAK,UAAL,GAAkB,qBAAI,KAAJ,EAAW,GAAX,CAAe,QAAf,CAAlB;AACA,YAAI,MAAM,KAAK,UAAL,CAAgB,EAAhB,CAAmB,QAAnB,EAA6B,UAAC,CAAD,EAAO;AAC5C,gBAAK,WAAL,CAAiB,OAAjB,CAAyB,QAAzB,EAAmC,CAAnC;AACD,SAFS,CAAV;AAGA,aAAK,eAAL,GAAuB,GAAvB;AACD;AACF;;AAED;;;;;;;;oCAKgB,S,EAAW;AACzB,aAAO,KAAK,MAAL,CAAY,EAAZ,EACL,KAAK,UAAL,GAAkB,KAAK,UAAvB,GAAoC,IAD/B,EAEL,YAAY,SAAZ,GAAwB,IAFnB,CAAP;AAGD;;AAED;;;;;;;4BAIQ;AACN,WAAK,QAAL,CAAc,kBAAd;AACA,WAAK,KAAL;AACA,WAAK,QAAL,CAAc,IAAd;AACD;;AAED;;;;;;;;;;;;0BASM,S,EAAW;AACf,UAAI,CAAC,KAAK,UAAV,EACE;AACF,WAAK,UAAL,CAAgB,KAAhB,CAAsB,KAAK,GAA3B;AACA,WAAK,SAAL,CAAe,SAAf;AACD;;AAED;;;;;;;;;;;;;;;;;;;;wBAiBI,I,EAAM,S,EAAW;AACnB,UAAI,CAAC,KAAK,UAAV,EACE;AACF,WAAK,UAAL,CAAgB,MAAhB,CAAuB,KAAK,GAA5B,EAAiC,IAAjC;AACA,WAAK,SAAL,CAAe,SAAf;AACD;;AAED;;;;;;;;;;AASA;;;;;;;;;;uBAUG,S,EAAW,Q,EAAU;AACtB,aAAO,KAAK,QAAL,CAAc,EAAd,CAAiB,SAAjB,EAA4B,QAA5B,CAAP;AACD;;AAED;;;;;;;;;;;wBAQI,S,EAAW,Q,EAAU;AACvB,aAAO,KAAK,QAAL,CAAc,GAAd,CAAkB,SAAlB,EAA6B,QAA7B,CAAP;AACD;;;8BAES,S,EAAW;AACnB,UAAI,CAAC,KAAK,UAAV,EACE;AACF,WAAK,UAAL,CAAgB,GAAhB,CAAoB,KAAK,UAAL,CAAgB,KAApC,EAA2C,KAAK,eAAL,CAAqB,SAArB,CAA3C;AACD;;AAED;;;;;;;;;;;wBApCmB;AACjB,aAAO,KAAK,UAAL,GAAkB,KAAK,UAAL,CAAgB,KAAlC,GAA0C,IAAjD;AACD;;;;;;AA6CH;;;;;;;;;;;;;;;;;;;ACzNA;;;;AAEA,SAAS,iBAAT,CAA2B,CAA3B,EAA8B,CAA9B,EAAiC;AAC/B,MAAI,MAAM,CAAV,EAAa;AACX,WAAO,CAAP;AACD,GAFD,MAEO,IAAI,IAAI,CAAR,EAAW;AAChB,WAAO,CAAC,CAAR;AACD,GAFM,MAEA,IAAI,IAAI,CAAR,EAAW;AAChB,WAAO,CAAP;AACD;AACF;;AAED;;;;IAGqB,S;AACnB,uBAAc;AAAA;;AACZ,SAAK,KAAL;AACD;;;;4BAEO;AACN;AACA,WAAK,QAAL,GAAgB,EAAhB;AACA;AACA,WAAK,KAAL,GAAa,EAAb;AACA,WAAK,MAAL,GAAc,IAAd;AACA,WAAK,cAAL,GAAsB,CAAtB;AACD;;;2BAMM,Q,EAAU,I,EAAM;AACrB,UAAI,SAAS,IAAb,EAAmB;AACjB,eAAO,KAAK,KAAL,CAAW,CAAX,CAAP,CADiB,CACK;AACtB,aAAK,IAAL,CAAU,iBAAV;AACD;;AAJoB,6BAME,2BAAgB,KAAK,QAAL,CAAc,QAAd,CAAhB,EAAyC,IAAzC,CANF;AAAA,UAMhB,KANgB,oBAMhB,KANgB;AAAA,UAMT,OANS,oBAMT,OANS;;AAOrB,WAAK,QAAL,CAAc,QAAd,IAA0B,IAA1B;;AAEA,WAAK,IAAI,IAAI,CAAb,EAAgB,IAAI,MAAM,MAA1B,EAAkC,GAAlC,EAAuC;AACrC,aAAK,KAAL,CAAW,MAAM,CAAN,CAAX,IAAuB,CAAC,KAAK,KAAL,CAAW,MAAM,CAAN,CAAX,KAAwB,CAAzB,IAA8B,CAArD;AACD;AACD,WAAK,IAAI,KAAI,CAAb,EAAgB,KAAI,QAAQ,MAA5B,EAAoC,IAApC,EAAyC;AACvC,aAAK,KAAL,CAAW,QAAQ,EAAR,CAAX;AACD;;AAED,WAAK,YAAL,CAAkB,IAAlB;AACD;;AAED;;;;;;;;mCAKmC;AAAA,UAAtB,IAAsB,uEAAf,KAAK,QAAU;;AACjC,UAAI,cAAc,OAAO,IAAP,CAAY,KAAK,QAAjB,EAA2B,MAA7C;AACA,UAAI,gBAAgB,CAApB,EAAuB;AACrB,aAAK,MAAL,GAAc,IAAd;AACD,OAFD,MAEO;AACL,aAAK,MAAL,GAAc,EAAd;AACA,aAAK,IAAI,IAAI,CAAb,EAAgB,IAAI,KAAK,MAAzB,EAAiC,GAAjC,EAAsC;AACpC,cAAI,QAAQ,KAAK,KAAL,CAAW,KAAK,CAAL,CAAX,CAAZ;AACA,cAAI,UAAU,WAAd,EAA2B;AACzB,iBAAK,MAAL,CAAY,IAAZ,CAAiB,KAAK,CAAL,CAAjB;AACD;AACF;AACF;AACF;;;0BAEK,Q,EAAU;AACd,UAAI,OAAO,KAAK,QAAL,CAAc,QAAd,CAAP,KAAoC,WAAxC,EAAqD;AACnD;AACD;;AAED,UAAI,OAAO,KAAK,QAAL,CAAc,QAAd,CAAX;AACA,UAAI,CAAC,IAAL,EAAW;AACT,eAAO,EAAP;AACD;;AAED,WAAK,IAAI,IAAI,CAAb,EAAgB,IAAI,KAAK,MAAzB,EAAiC,GAAjC,EAAsC;AACpC,aAAK,KAAL,CAAW,KAAK,CAAL,CAAX;AACD;AACD,aAAO,KAAK,QAAL,CAAc,QAAd,CAAP;;AAEA,WAAK,YAAL;AACD;;;wBA3DW;AACV,aAAO,KAAK,MAAZ;AACD;;;wBA2Dc;AACb,UAAI,UAAU,OAAO,IAAP,CAAY,KAAK,KAAjB,CAAd;AACA,cAAQ,IAAR,CAAa,iBAAb;AACA,aAAO,OAAP;AACD;;;;;;kBA/EkB,S;;;;;;;;;;;;;;kBCRG,K;;AAPxB;;;;;;;;AAEA;AACA;AACA,OAAO,kBAAP,GAA4B,OAAO,kBAAP,IAA6B,EAAzD;AACA,IAAI,SAAS,OAAO,kBAApB;;AAEe,SAAS,KAAT,CAAe,SAAf,EAA0B;AACvC,MAAI,aAAa,OAAO,SAAP,KAAsB,QAAvC,EAAiD;AAC/C,QAAI,CAAC,OAAO,cAAP,CAAsB,SAAtB,CAAL,EAAuC;AACrC,aAAO,SAAP,IAAoB,IAAI,KAAJ,CAAU,SAAV,CAApB;AACD;AACD,WAAO,OAAO,SAAP,CAAP;AACD,GALD,MAKO,IAAI,QAAO,SAAP,yCAAO,SAAP,OAAsB,QAAtB,IAAkC,UAAU,KAA5C,IAAqD,UAAU,GAAnE,EAAwE;AAC7E;AACA,WAAO,SAAP;AACD,GAHM,MAGA,IAAI,MAAM,OAAN,CAAc,SAAd,KACP,UAAU,MAAV,IAAoB,CADb,IAEP,OAAO,UAAU,CAAV,CAAP,KAAyB,QAFtB,EAEgC;AACrC,WAAO,MAAM,UAAU,CAAV,CAAN,CAAP;AACD,GAJM,MAIA;AACL,UAAM,IAAI,KAAJ,CAAU,4BAAV,CAAN;AACD;AACF;;IAEK,K;AACJ,iBAAY,IAAZ,EAAkB;AAAA;;AAChB,SAAK,IAAL,GAAY,IAAZ;AACA,SAAK,KAAL,GAAa,EAAb;AACD;;;;yBAEG,I,EAAM;AACR,UAAI,CAAC,IAAD,IAAS,OAAO,IAAP,KAAiB,QAA9B,EAAwC;AACtC,cAAM,IAAI,KAAJ,CAAU,kBAAV,CAAN;AACD;;AAED,UAAI,CAAC,KAAK,KAAL,CAAW,cAAX,CAA0B,IAA1B,CAAL,EACE,KAAK,KAAL,CAAW,IAAX,IAAmB,kBAAQ,IAAR,EAAc,IAAd,CAAnB;AACF,aAAO,KAAK,KAAL,CAAW,IAAX,CAAP;AACD;;;wBAEG,I,EAAM;AACR,UAAI,CAAC,IAAD,IAAS,OAAO,IAAP,KAAiB,QAA9B,EAAwC;AACtC,cAAM,IAAI,KAAJ,CAAU,kBAAV,CAAN;AACD;;AAED,aAAO,KAAK,KAAL,CAAW,cAAX,CAA0B,IAA1B,CAAP;AACD;;;;;;;;;;;;;;;;AC/CH;;;;AACA;;AACA;;AACA;;AACA;;AACA;;AACA;;;;AAEA,IAAM,eAAe,qBAAM,SAAN,CAArB;;AAEA,SAAS,IAAT,CAAc,IAAd,EAAoB;AAClB,SAAO,aAAa,GAAb,CAAiB,IAAjB,CAAP;AACD;;AAED,SAAS,GAAT,CAAa,IAAb,EAAmB;AACjB,SAAO,aAAa,GAAb,CAAiB,IAAjB,CAAP;AACD;;AAED,IAAI,OAAO,KAAX,EAAkB;AAChB,SAAO,KAAP,CAAa,uBAAb,CAAqC,qBAArC,EAA4D,UAAS,OAAT,EAAkB;AAC5E,QAAI,OAAO,QAAQ,KAAf,KAA0B,QAA9B,EAAwC;AACtC,2BAAM,QAAQ,KAAd,EAAqB,GAArB,CAAyB,QAAQ,IAAjC,EAAuC,GAAvC,CAA2C,QAAQ,KAAnD;AACD,KAFD,MAEO;AACL,WAAK,QAAQ,IAAb,EAAmB,GAAnB,CAAuB,QAAQ,KAA/B;AACD;AACF,GAND;AAOD;;AAED,IAAM,YAAY;AAChB,wBADgB;AAEhB,OAAK,IAFW;AAGhB,OAAK,GAHW;AAIhB,6CAJgB;AAKhB,oCALgB;AAMhB;AANgB,CAAlB;;AASA;;;kBAGe,S;;AACf,OAAO,SAAP,GAAmB,SAAnB;;;;;;;;;;;QCrCgB,Q,GAAA,Q;QAWA,I,GAAA,I;AAfhB,IAAI,IAAI,OAAO,MAAf;;AAEA,IAAI,WAAW,EAAf;;AAEO,SAAS,QAAT,CAAkB,GAAlB,EAAuB;AAC5B,WAAS,IAAI,SAAb,IAA0B,GAA1B;AACA,MAAI,OAAO,QAAP,IAAmB,OAAO,QAAP,CAAgB,UAAhB,KAA+B,UAAtD,EAAkE;AAChE,MAAE,YAAM;AACN;AACD,KAFD;AAGD,GAJD,MAIO,IAAI,OAAO,QAAX,EAAqB;AAC1B,eAAW,IAAX,EAAiB,GAAjB;AACD;AACF;;AAEM,SAAS,IAAT,GAAgB;AACrB,SAAO,IAAP,CAAY,QAAZ,EAAsB,OAAtB,CAA8B,UAAS,SAAT,EAAoB;AAChD,QAAI,UAAU,SAAS,SAAT,CAAd;AACA,MAAE,MAAM,QAAQ,SAAhB,EAA2B,GAA3B,CAA+B,wBAA/B,EAAyD,IAAzD,CAA8D,UAAS,CAAT,EAAY,EAAZ,EAAgB;AAC5E,mBAAa,OAAb,EAAsB,EAAtB;AACD,KAFD;AAGD,GALD;AAMD;;AAED;AACA,SAAS,OAAT,CAAiB,GAAjB,EAAsB;AACpB,SAAO,IAAI,OAAJ,CAAY,uCAAZ,EAAqD,MAArD,CAAP;AACD;;AAED,SAAS,MAAT,CAAgB,EAAhB,EAAoB;AAClB,MAAI,MAAM,EAAE,EAAF,CAAV;AACA,SAAO,IAAP,CAAY,QAAZ,EAAsB,OAAtB,CAA8B,UAAS,SAAT,EAAoB;AAChD,QAAI,IAAI,QAAJ,CAAa,SAAb,KAA2B,CAAC,IAAI,QAAJ,CAAa,uBAAb,CAAhC,EAAuE;AACrE,UAAI,UAAU,SAAS,SAAT,CAAd;AACA,mBAAa,OAAb,EAAsB,EAAtB;AACD;AACF,GALD;AAMD;;AAED,SAAS,YAAT,CAAsB,OAAtB,EAA+B,EAA/B,EAAmC;AACjC,MAAI,SAAS,EAAE,EAAF,EAAM,IAAN,CAAW,+CAA+C,QAAQ,GAAG,EAAX,CAA/C,GAAgE,IAA3E,CAAb;AACA,MAAI,OAAO,KAAK,KAAL,CAAW,OAAO,CAAP,EAAU,SAArB,CAAX;;AAEA,MAAI,WAAW,QAAQ,OAAR,CAAgB,EAAhB,EAAoB,IAApB,CAAf;AACA,IAAE,EAAF,EAAM,IAAN,CAAW,oBAAX,EAAiC,QAAjC;AACA,IAAE,EAAF,EAAM,QAAN,CAAe,uBAAf;AACD;;AAED,IAAI,OAAO,KAAX,EAAkB;AAChB,MAAI,eAAe,IAAI,OAAO,KAAP,CAAa,YAAjB,EAAnB;AACA,MAAI,KAAI,OAAO,MAAf;AACA,KAAE,MAAF,CAAS,YAAT,EAAuB;AACrB,UAAM,cAAS,KAAT,EAAgB;AACpB,aAAO,GAAE,KAAF,EAAS,IAAT,CAAc,kBAAd,CAAP;AACD,KAHoB;AAIrB,gBAAY,oBAAS,EAAT,EAAa;AACvB,UAAI,CAAC,GAAE,EAAF,EAAM,QAAN,CAAe,uBAAf,CAAL,EAA8C;AAC5C,eAAO,EAAP;AACD;AACF,KARoB;AASrB,WAAO,eAAS,EAAT,EAAa;AAClB,aAAO,GAAG,EAAV;AACD,KAXoB;AAYrB,cAAU,kBAAS,EAAT,EAAa,CAEtB,CAdoB;AAerB,cAAU,kBAAS,EAAT,EAAa,KAAb,EAAoB,CAE7B,CAjBoB;AAkBrB,oBAAgB,wBAAS,EAAT,EAAa,IAAb,EAAmB,CAElC,CApBoB;AAqBrB,eAAW,mBAAS,EAAT,EAAa,QAAb,EAAuB;AAChC,SAAE,EAAF,EAAM,IAAN,CAAW,oBAAX,EAAiC,MAAjC;AACD,KAvBoB;AAwBrB,iBAAa,qBAAS,EAAT,EAAa;AACxB,SAAE,EAAF,EAAM,IAAN,CAAW,oBAAX,EAAiC,OAAjC;AACD;AA1BoB,GAAvB;AA4BA,SAAO,KAAP,CAAa,aAAb,CAA2B,QAA3B,CAAoC,YAApC,EAAkD,wBAAlD;AACD;;;;;;;;AChFD;;IAAY,K;;AACZ;;;;AAEA,IAAI,IAAI,OAAO,MAAf;;AAEA,MAAM,QAAN,CAAe;AACb,aAAW,+BADE;;AAGb,WAAS,iBAAS,EAAT,EAAa,IAAb,EAAmB;AAC1B;;;;AAIA,QAAI,WAAW,yBAAiB,KAAK,KAAtB,CAAf;;AAEA,QAAI,sBAAJ;AACA,QAAI,MAAM,EAAE,EAAF,CAAV;AACA,QAAI,EAAJ,CAAO,QAAP,EAAiB,wBAAjB,EAA2C,YAAW;AACpD,UAAI,UAAU,IAAI,IAAJ,CAAS,gCAAT,CAAd;AACA,UAAI,QAAQ,MAAR,KAAmB,CAAvB,EAA0B;AACxB,wBAAgB,IAAhB;AACA,iBAAS,KAAT;AACD,OAHD,MAGO;AACL,YAAI,OAAO,EAAX;AACA,gBAAQ,IAAR,CAAa,YAAW;AACtB,eAAK,GAAL,CAAS,KAAK,KAAd,EAAqB,OAArB,CAA6B,UAAS,GAAT,EAAc;AACzC,iBAAK,GAAL,IAAY,IAAZ;AACD,WAFD;AAGD,SAJD;AAKA,YAAI,WAAW,OAAO,IAAP,CAAY,IAAZ,CAAf;AACA,iBAAS,IAAT;AACA,wBAAgB,QAAhB;AACA,iBAAS,GAAT,CAAa,QAAb;AACD;AACF,KAjBD;;AAmBA,WAAO;AACL,eAAS,mBAAW;AAClB,iBAAS,KAAT;AACD,OAHI;AAIL,cAAQ,kBAAW;AACjB,YAAI,aAAJ,EACE,SAAS,GAAT,CAAa,aAAb;AACH;AAPI,KAAP;AASD;AAxCY,CAAf;;;;;;;;ACLA;;IAAY,K;;AACZ;;IAAY,I;;AACZ;;;;AAEA,IAAI,IAAI,OAAO,MAAf;;AAEA,MAAM,QAAN,CAAe;AACb,aAAW,wBADE;;AAGb,WAAS,iBAAS,EAAT,EAAa,IAAb,EAAmB;AAC1B;;;;;;AAMA,QAAI,QAAQ,CAAC,EAAC,OAAO,EAAR,EAAY,OAAO,OAAnB,EAAD,CAAZ;AACA,QAAI,QAAQ,KAAK,aAAL,CAAmB,KAAK,KAAxB,CAAZ;AACA,QAAI,OAAO;AACT,eAAS,MAAM,MAAN,CAAa,KAAb,CADA;AAET,kBAAY,OAFH;AAGT,kBAAY,OAHH;AAIT,mBAAa;AAJJ,KAAX;;AAOA,QAAI,SAAS,EAAE,EAAF,EAAM,IAAN,CAAW,QAAX,EAAqB,CAArB,CAAb;;AAEA,QAAI,YAAY,EAAE,MAAF,EAAU,SAAV,CAAoB,IAApB,EAA0B,CAA1B,EAA6B,SAA7C;;AAEA,QAAI,WAAW,yBAAiB,KAAK,KAAtB,CAAf;;AAEA,QAAI,sBAAJ;AACA,cAAU,EAAV,CAAa,QAAb,EAAuB,YAAW;AAChC,UAAI,UAAU,KAAV,CAAgB,MAAhB,KAA2B,CAA/B,EAAkC;AAChC,wBAAgB,IAAhB;AACA,iBAAS,KAAT;AACD,OAHD,MAGO;AACL,YAAI,OAAO,EAAX;AACA,kBAAU,KAAV,CAAgB,OAAhB,CAAwB,UAAS,KAAT,EAAgB;AACtC,eAAK,GAAL,CAAS,KAAT,EAAgB,OAAhB,CAAwB,UAAS,GAAT,EAAc;AACpC,iBAAK,GAAL,IAAY,IAAZ;AACD,WAFD;AAGD,SAJD;AAKA,YAAI,WAAW,OAAO,IAAP,CAAY,IAAZ,CAAf;AACA,iBAAS,IAAT;AACA,wBAAgB,QAAhB;AACA,iBAAS,GAAT,CAAa,QAAb;AACD;AACF,KAhBD;;AAkBA,WAAO;AACL,eAAS,mBAAW;AAClB,iBAAS,KAAT;AACD,OAHI;AAIL,cAAQ,kBAAW;AACjB,YAAI,aAAJ,EACE,SAAS,GAAT,CAAa,aAAb;AACH;AAPI,KAAP;AASD;AArDY,CAAf;;;;;;;;;;ACNA;;IAAY,K;;AACZ;;;;AAEA,IAAI,IAAI,OAAO,MAAf;AACA,IAAI,WAAW,OAAO,QAAtB;;AAEA,MAAM,QAAN,CAAe;AACb,aAAW,wBADE;;AAGb,WAAS,iBAAS,EAAT,EAAa,IAAb,EAAmB;AAC1B;;;;AAIA,QAAI,WAAW,yBAAiB,KAAK,KAAtB,CAAf;;AAEA,QAAI,OAAO,EAAX;AACA,QAAI,MAAM,EAAE,EAAF,EAAM,IAAN,CAAW,OAAX,CAAV;AACA,QAAI,WAAW,IAAI,IAAJ,CAAS,WAAT,CAAf;AACA,QAAI,aAAa,IAAI,IAAJ,CAAS,aAAT,CAAjB;AACA,QAAI,QAAQ,IAAI,IAAJ,CAAS,OAAT,CAAZ;AACA,QAAI,sBAAJ;;AAEA;AACA,QAAI,aAAa,MAAjB,EAAyB;AACvB,sBAAgB,SAAS,GAAT,EAAhB;AACA,WAAK,QAAL,GAAgB,UAAS,GAAT,EAAc;AAC5B,eAAO,cAAc,UAAd,EAA0B,IAAI,IAAJ,CAAS,GAAT,CAA1B,CAAP;AACD,OAFD;AAID,KAND,MAMO,IAAI,aAAa,UAAjB,EAA6B;AAClC,UAAI,WAAW,IAAI,IAAJ,CAAS,UAAT,CAAf;AACA,UAAI,QAAJ,EACE,gBAAgB,SAAS,QAAT,CAAkB,QAAlB,CAAhB,CADF,KAGE,gBAAgB,QAAhB;;AAEF,WAAK,QAAL,GAAgB,UAAS,GAAT,EAAc;AAC5B,eAAO,cAAc,UAAd,EAA0B,IAAI,IAAJ,CAAS,GAAT,CAA1B,CAAP;AACD,OAFD;AAGD,KAVM,MAUA,IAAI,aAAa,QAAjB,EAA2B;AAChC,UAAI,OAAO,KAAP,KAAiB,WAArB,EACE,KAAK,QAAL,GAAgB,UAAS,GAAT,EAAc;AAC5B,YAAI,SAAS,KAAK,GAAL,CAAS,EAAT,EAAa,KAAb,CAAb;AACA,eAAO,KAAK,KAAL,CAAW,MAAM,MAAjB,IAA2B,MAAlC;AACD,OAHD;AAIH;;AAED,QAAI,cAAJ,CAAmB,IAAnB;;AAEA,aAAS,QAAT,GAAoB;AAClB,UAAI,SAAS,IAAI,IAAJ,CAAS,gBAAT,EAA2B,MAAxC;;AAEA;AACA,UAAI,gBAAJ;AACA,UAAI,WAAW,IAAI,IAAJ,CAAS,WAAT,CAAf;AACA,UAAI,aAAa,MAAjB,EAAyB;AACvB,kBAAU,iBAAS,GAAT,EAAc;AACtB,iBAAO,cAAc,IAAI,IAAJ,CAAS,CAAC,GAAV,CAAd,CAAP;AACD,SAFD;AAGD,OAJD,MAIO,IAAI,aAAa,UAAjB,EAA6B;AAClC,kBAAU,iBAAS,GAAT,EAAc;AACtB;AACA,iBAAO,CAAC,GAAD,GAAO,IAAd;AACD,SAHD;AAID,OALM,MAKA;AACL,kBAAU,iBAAS,GAAT,EAAc;AAAE,iBAAO,CAAC,GAAR;AAAc,SAAxC;AACD;;AAED,UAAI,IAAI,IAAJ,CAAS,gBAAT,EAA2B,OAA3B,CAAmC,IAAnC,KAA4C,QAAhD,EAA0D;AACxD,eAAO,CAAC,QAAQ,OAAO,IAAf,CAAD,EAAuB,QAAQ,OAAO,EAAf,CAAvB,CAAP;AACD,OAFD,MAEO;AACL,eAAO,QAAQ,OAAO,IAAf,CAAP;AACD;AACF;;AAED,QAAI,gBAAgB,IAApB;;AAEA,QAAI,EAAJ,CAAO,6BAAP,EAAsC,UAAS,KAAT,EAAgB;AACpD,UAAI,CAAC,IAAI,IAAJ,CAAS,UAAT,CAAD,IAAyB,CAAC,IAAI,IAAJ,CAAS,WAAT,CAA9B,EAAqD;AAAA,wBAClC,UADkC;AAAA;AAAA,YAC9C,IAD8C;AAAA,YACxC,EADwC;;AAEnD,YAAI,OAAO,EAAX;AACA,aAAK,IAAI,IAAI,CAAb,EAAgB,IAAI,KAAK,MAAL,CAAY,MAAhC,EAAwC,GAAxC,EAA6C;AAC3C,cAAI,MAAM,KAAK,MAAL,CAAY,CAAZ,CAAV;AACA,cAAI,OAAO,IAAP,IAAe,OAAO,EAA1B,EAA8B;AAC5B,iBAAK,IAAL,CAAU,KAAK,IAAL,CAAU,CAAV,CAAV;AACD;AACF;AACD,aAAK,IAAL;AACA,iBAAS,GAAT,CAAa,IAAb;AACA,wBAAgB,IAAhB;AACD;AACF,KAdD;;AAiBA;AACA;AACA;AACA;AACA;AACA;AACA;AACA;AACA;AACA;AACA;AACA;AACA;AACA;AACA;AACA;AACA;;AAEA,WAAO;AACL,eAAS,mBAAW;AAClB,iBAAS,KAAT;AACD,OAHI;AAIL,cAAQ,kBAAW;AACjB,YAAI,aAAJ,EACE,SAAS,GAAT,CAAa,aAAb;AACH;AAPI,KAAP;AASD;AApHY,CAAf;;AAwHA;AACA,SAAS,QAAT,CAAkB,CAAlB,EAAqB,MAArB,EAA6B;AAC3B,MAAI,MAAM,EAAE,QAAF,EAAV;AACA,SAAO,IAAI,MAAJ,GAAa,MAApB;AACE,UAAM,MAAM,GAAZ;AADF,GAEA,OAAO,GAAP;AACD;;AAED;AACA;AACA,SAAS,aAAT,CAAuB,IAAvB,EAA6B;AAC3B,MAAI,gBAAgB,IAApB,EAA0B;AACxB,WAAO,KAAK,cAAL,KAAwB,GAAxB,GACA,SAAS,KAAK,WAAL,KAAmB,CAA5B,EAA+B,CAA/B,CADA,GACoC,GADpC,GAEA,SAAS,KAAK,UAAL,EAAT,EAA4B,CAA5B,CAFP;AAID,GALD,MAKO;AACL,WAAO,IAAP;AACD;AACF;;;;;;;;;;;;;;ACjJD;;;;AACA;;;;AACA;;IAAY,I;;;;;;;;AAEZ;;;;;;;;;;;;;;;;IAgBa,e,WAAA,e;AAEX,6BAA4C;AAAA,QAAhC,KAAgC,uEAAxB,IAAwB;AAAA,QAAlB,SAAkB,uEAAN,IAAM;;AAAA;;AAC1C,SAAK,WAAL,GAAmB,sBAAnB;AACA,SAAK,QAAL,GAAgB,IAAI,KAAK,mBAAT,CAA6B,KAAK,WAAlC,CAAhB;;AAEA;AACA,SAAK,MAAL,GAAc,IAAd;AACA;AACA,SAAK,IAAL,GAAY,IAAZ;AACA;AACA,SAAK,eAAL,GAAuB,IAAvB;;AAEA,SAAK,UAAL,GAAkB,KAAK,MAAL,CAAY,EAAE,QAAQ,IAAV,EAAZ,EAA8B,SAA9B,CAAlB;;AAEA,SAAK,QAAL,CAAc,KAAd;AACD;;AAED;;;;;;;;;;;;;;;;;6BAaS,K,EAAO;AAAA;;AACd;AACA,UAAI,KAAK,MAAL,KAAgB,KAApB,EACE;AACF;AACA,UAAI,CAAC,KAAK,MAAN,IAAgB,CAAC,KAArB,EACE;;AAEF,UAAI,KAAK,IAAT,EAAe;AACb,aAAK,IAAL,CAAU,GAAV,CAAc,QAAd,EAAwB,KAAK,eAA7B;AACA,aAAK,IAAL,GAAY,IAAZ;AACA,aAAK,eAAL,GAAuB,IAAvB;AACD;;AAED,WAAK,MAAL,GAAc,KAAd;;AAEA,UAAI,KAAJ,EAAW;AACT,aAAK,IAAL,GAAY,qBAAI,KAAJ,EAAW,GAAX,CAAe,WAAf,CAAZ;AACA,YAAI,MAAM,KAAK,IAAL,CAAU,EAAV,CAAa,QAAb,EAAuB,UAAC,CAAD,EAAO;AACtC,gBAAK,WAAL,CAAiB,OAAjB,CAAyB,QAAzB,EAAmC,CAAnC;AACD,SAFS,CAAV;AAGA,aAAK,eAAL,GAAuB,GAAvB;AACD;AACF;;AAED;;;;;;;;;;;;;;;AAcA;;;;;oCAKgB,S,EAAW;AACzB;AACA,aAAO,KAAK,MAAL,CAAY,EAAZ,EACL,KAAK,UAAL,GAAkB,KAAK,UAAvB,GAAoC,IAD/B,EAEL,YAAY,SAAZ,GAAwB,IAFnB,CAAP;AAGD;;AAED;;;;;;;;;;;;;;;wBAYI,Y,EAAc,S,EAAW;AAC3B,UAAI,KAAK,IAAT,EACE,KAAK,IAAL,CAAU,GAAV,CAAc,YAAd,EAA4B,KAAK,eAAL,CAAqB,SAArB,CAA5B;AACH;;AAED;;;;;;;;;;;;;0BAUM,S,EAAW;AACf,UAAI,KAAK,IAAT,EACE,KAAK,GAAL,CAAS,KAAK,CAAd,EAAiB,KAAK,eAAL,CAAqB,SAArB,CAAjB;AACH;;AAED;;;;;;;;;;;;;uBAUG,S,EAAW,Q,EAAU;AACtB,aAAO,KAAK,QAAL,CAAc,EAAd,CAAiB,SAAjB,EAA4B,QAA5B,CAAP;AACD;;AAED;;;;;;;;;;;wBAQI,S,EAAW,Q,EAAU;AACvB,aAAO,KAAK,QAAL,CAAc,GAAd,CAAkB,SAAlB,EAA6B,QAA7B,CAAP;AACD;;AAED;;;;;;;;4BAKQ;AACN,WAAK,QAAL,CAAc,kBAAd;AACA,WAAK,QAAL,CAAc,IAAd;AACD;;;wBAlFW;AACV,aAAO,KAAK,IAAL,GAAY,KAAK,IAAL,CAAU,GAAV,EAAZ,GAA8B,IAArC;AACD;;;;;;AAmFH;;;;;;;;;AASA;;;;;;;;;;;;;;;;;;;;;QCpLgB,M,GAAA,M;QAeA,W,GAAA,W;QAQA,e,GAAA,e;QAoCA,a,GAAA,a;;;;AA3DT,SAAS,MAAT,CAAgB,MAAhB,EAAoC;AAAA,oCAAT,OAAS;AAAT,WAAS;AAAA;;AACzC,OAAK,IAAI,IAAI,CAAb,EAAgB,IAAI,QAAQ,MAA5B,EAAoC,GAApC,EAAyC;AACvC,QAAI,MAAM,QAAQ,CAAR,CAAV;AACA,QAAI,OAAO,GAAP,KAAgB,WAAhB,IAA+B,QAAQ,IAA3C,EACE;;AAEF,SAAK,IAAI,GAAT,IAAgB,GAAhB,EAAqB;AACnB,UAAI,IAAI,cAAJ,CAAmB,GAAnB,CAAJ,EAA6B;AAC3B,eAAO,GAAP,IAAc,IAAI,GAAJ,CAAd;AACD;AACF;AACF;AACD,SAAO,MAAP;AACD;;AAEM,SAAS,WAAT,CAAqB,IAArB,EAA2B;AAChC,OAAK,IAAI,IAAI,CAAb,EAAgB,IAAI,KAAK,MAAzB,EAAiC,GAAjC,EAAsC;AACpC,QAAI,KAAK,CAAL,KAAW,KAAK,IAAE,CAAP,CAAf,EAA0B;AACxB,YAAM,IAAI,KAAJ,CAAU,0CAAV,CAAN;AACD;AACF;AACF;;AAEM,SAAS,eAAT,CAAyB,CAAzB,EAA4B,CAA5B,EAA+B;AACpC,MAAI,MAAM,CAAV;AACA,MAAI,MAAM,CAAV;;AAEA,MAAI,CAAC,CAAL,EAAQ,IAAI,EAAJ;AACR,MAAI,CAAC,CAAL,EAAQ,IAAI,EAAJ;;AAER,MAAI,SAAS,EAAb;AACA,MAAI,SAAS,EAAb;;AAEA,cAAY,CAAZ;AACA,cAAY,CAAZ;;AAEA,SAAO,MAAM,EAAE,MAAR,IAAkB,MAAM,EAAE,MAAjC,EAAyC;AACvC,QAAI,EAAE,GAAF,MAAW,EAAE,GAAF,CAAf,EAAuB;AACrB;AACA;AACD,KAHD,MAGO,IAAI,EAAE,GAAF,IAAS,EAAE,GAAF,CAAb,EAAqB;AAC1B,aAAO,IAAP,CAAY,EAAE,KAAF,CAAZ;AACD,KAFM,MAEA;AACL,aAAO,IAAP,CAAY,EAAE,KAAF,CAAZ;AACD;AACF;;AAED,MAAI,MAAM,EAAE,MAAZ,EACE,SAAS,OAAO,MAAP,CAAc,EAAE,KAAF,CAAQ,GAAR,CAAd,CAAT;AACF,MAAI,MAAM,EAAE,MAAZ,EACE,SAAS,OAAO,MAAP,CAAc,EAAE,KAAF,CAAQ,GAAR,CAAd,CAAT;AACF,SAAO;AACL,aAAS,MADJ;AAEL,WAAO;AAFF,GAAP;AAID;;AAED;AACA;AACO,SAAS,aAAT,CAAuB,EAAvB,EAA2B;AAChC,MAAI,QAAQ,EAAZ;AACA,MAAI,eAAJ;AACA,OAAK,IAAI,IAAT,IAAiB,EAAjB,EAAqB;AACnB,QAAI,GAAG,cAAH,CAAkB,IAAlB,CAAJ,EACE,MAAM,IAAN,CAAW,IAAX;AACF,QAAI,QAAO,GAAG,IAAH,CAAP,MAAqB,QAArB,IAAiC,OAAO,GAAG,IAAH,EAAS,MAAhB,KAA4B,WAAjE,EAA8E;AAC5E,YAAM,IAAI,KAAJ,CAAU,2BAAV,CAAN;AACD,KAFD,MAEO,IAAI,OAAO,MAAP,KAAmB,WAAnB,IAAkC,WAAW,GAAG,IAAH,EAAS,MAA1D,EAAkE;AACvE,YAAM,IAAI,KAAJ,CAAU,8CAAV,CAAN;AACD;AACD,aAAS,GAAG,IAAH,EAAS,MAAlB;AACD;AACD,MAAI,UAAU,EAAd;AACA,MAAI,aAAJ;AACA,OAAK,IAAI,MAAM,CAAf,EAAkB,MAAM,MAAxB,EAAgC,KAAhC,EAAuC;AACrC,WAAO,EAAP;AACA,SAAK,IAAI,MAAM,CAAf,EAAkB,MAAM,MAAM,MAA9B,EAAsC,KAAtC,EAA6C;AAC3C,WAAK,MAAM,GAAN,CAAL,IAAmB,GAAG,MAAM,GAAN,CAAH,EAAe,GAAf,CAAnB;AACD;AACD,YAAQ,IAAR,CAAa,IAAb;AACD;AACD,SAAO,OAAP;AACD;;AAED;;;;;;;IAMa,mB,WAAA,mB;AACX,+BAAY,OAAZ,EAAqB;AAAA;;AACnB,SAAK,QAAL,GAAgB,OAAhB;AACA,SAAK,KAAL,GAAa,EAAb;AACD;;;;uBAEE,S,EAAW,Q,EAAU;AACtB,UAAI,MAAM,KAAK,QAAL,CAAc,EAAd,CAAiB,SAAjB,EAA4B,QAA5B,CAAV;AACA,WAAK,KAAL,CAAW,GAAX,IAAkB,SAAlB;AACA,aAAO,GAAP;AACD;;;wBAEG,S,EAAW,Q,EAAU;AACvB,UAAI,MAAM,KAAK,QAAL,CAAc,GAAd,CAAkB,SAAlB,EAA6B,QAA7B,CAAV;AACA,UAAI,GAAJ,EAAS;AACP,eAAO,KAAK,KAAL,CAAW,GAAX,CAAP;AACD;AACD,aAAO,GAAP;AACD;;;yCAEoB;AAAA;;AACnB,UAAI,eAAe,KAAK,KAAxB;AACA,WAAK,KAAL,GAAa,EAAb;AACA,aAAO,IAAP,CAAY,YAAZ,EAA0B,OAA1B,CAAkC,UAAC,GAAD,EAAS;AACzC,cAAK,QAAL,CAAc,GAAd,CAAkB,aAAa,GAAb,CAAlB,EAAqC,GAArC;AACD,OAFD;AAGD;;;;;;;;;;;;;;;;;;ACpHH;;;;;;;;IAEqB,G;AACnB,eAAY,KAAZ,EAAmB,IAAnB,EAAyB,YAAa,KAAtC,EAA6C;AAAA;;AAC3C,SAAK,MAAL,GAAc,KAAd;AACA,SAAK,KAAL,GAAa,IAAb;AACA,SAAK,MAAL,GAAc,KAAd;AACA,SAAK,OAAL,GAAe,sBAAf;AACD;;;;0BAEK;AACJ,aAAO,KAAK,MAAZ;AACD;;;wBAEG,K,EAAO,YAAa,K,EAAO;AAC7B,UAAI,KAAK,MAAL,KAAgB,KAApB,EAA2B;AACzB;AACA;AACD;AACD,UAAI,WAAW,KAAK,MAApB;AACA,WAAK,MAAL,GAAc,KAAd;AACA;AACA,UAAI,MAAM,EAAV;AACA,UAAI,SAAS,QAAO,KAAP,yCAAO,KAAP,OAAkB,QAA/B,EAAyC;AACvC,aAAK,IAAI,CAAT,IAAc,KAAd,EAAqB;AACnB,cAAI,MAAM,cAAN,CAAqB,CAArB,CAAJ,EACE,IAAI,CAAJ,IAAS,MAAM,CAAN,CAAT;AACH;AACF;AACD,UAAI,QAAJ,GAAe,QAAf;AACA,UAAI,KAAJ,GAAY,KAAZ;AACA,WAAK,OAAL,CAAa,OAAb,CAAqB,QAArB,EAA+B,GAA/B,EAAoC,IAApC;;AAEA;AACA;AACA,UAAI,OAAO,KAAP,IAAgB,OAAO,KAAP,CAAa,aAAjC,EAAgD;AAC9C,eAAO,KAAP,CAAa,aAAb,CACE,mBACG,KAAK,MAAL,CAAY,IAAZ,KAAqB,IAArB,GAA4B,KAAK,MAAL,CAAY,IAAZ,GAAmB,GAA/C,GAAqD,EADxD,IAEE,KAAK,KAHT,EAIE,OAAO,KAAP,KAAkB,WAAlB,GAAgC,IAAhC,GAAuC,KAJzC;AAMD;AACF;;;uBAEE,S,EAAW,Q,EAAU;AACtB,aAAO,KAAK,OAAL,CAAa,EAAb,CAAgB,SAAhB,EAA2B,QAA3B,CAAP;AACD;;;wBAEG,S,EAAW,Q,EAAU;AACvB,aAAO,KAAK,OAAL,CAAa,GAAb,CAAiB,SAAjB,EAA4B,QAA5B,CAAP;AACD;;;;;;kBAjDkB,G", + "file": "generated.js", + "sourceRoot": "", + "sourcesContent": [ + "(function(){function e(t,n,r){function s(o,u){if(!n[o]){if(!t[o]){var a=typeof require==\"function\"&&require;if(!u&&a)return a(o,!0);if(i)return i(o,!0);var f=new Error(\"Cannot find module '\"+o+\"'\");throw f.code=\"MODULE_NOT_FOUND\",f}var l=n[o]={exports:{}};t[o][0].call(l.exports,function(e){var n=t[o][1][e];return s(n?n:e)},l,l.exports,e,t,n,r)}return n[o].exports}var i=typeof require==\"function\"&&require;for(var o=0;o {\n this._eventRelay.trigger(\"change\", e, this);\n });\n this._varOnChangeSub = sub;\n }\n }\n\n /**\n * Combine the given `extraInfo` (if any) with the handle's default\n * `_extraInfo` (if any).\n * @private\n */\n _mergeExtraInfo(extraInfo) {\n return util.extend({},\n this._extraInfo ? this._extraInfo : null,\n extraInfo ? extraInfo : null);\n }\n\n /**\n * Close the handle. This clears this handle's contribution to the filter set,\n * and unsubscribes all event listeners.\n */\n close() {\n this._emitter.removeAllListeners();\n this.clear();\n this.setGroup(null);\n }\n\n /**\n * Clear this handle's contribution to the filter set.\n *\n * @param {Object} [extraInfo] - Extra properties to be included on the event\n * object that's passed to listeners (in addition to any options that were\n * passed into the `FilterHandle` constructor).\n * \n * @fires FilterHandle#change\n */\n clear(extraInfo) {\n if (!this._filterSet)\n return;\n this._filterSet.clear(this._id);\n this._onChange(extraInfo);\n }\n\n /**\n * Set this handle's contribution to the filter set. This array should consist\n * of the keys of the rows that _should_ be displayed; any keys that are not\n * present in the array will be considered _filtered out_. Note that multiple\n * `FilterHandle` instances in the group may each contribute an array of keys,\n * and only those keys that appear in _all_ of the arrays make it through the\n * filter.\n *\n * @param {string[]} keys - Empty array, or array of keys. To clear the\n * filter, don't pass an empty array; instead, use the\n * {@link FilterHandle#clear} method.\n * @param {Object} [extraInfo] - Extra properties to be included on the event\n * object that's passed to listeners (in addition to any options that were\n * passed into the `FilterHandle` constructor).\n * \n * @fires FilterHandle#change\n */\n set(keys, extraInfo) {\n if (!this._filterSet)\n return;\n this._filterSet.update(this._id, keys);\n this._onChange(extraInfo);\n }\n\n /**\n * @return {string[]|null} - Either: 1) an array of keys that made it through\n * all of the `FilterHandle` instances, or, 2) `null`, which means no filter\n * is being applied (all data should be displayed).\n */\n get filteredKeys() {\n return this._filterSet ? this._filterSet.value : null;\n }\n\n /**\n * Subscribe to events on this `FilterHandle`.\n *\n * @param {string} eventType - Indicates the type of events to listen to.\n * Currently, only `\"change\"` is supported.\n * @param {FilterHandle~listener} listener - The callback function that\n * will be invoked when the event occurs.\n * @return {string} - A token to pass to {@link FilterHandle#off} to cancel\n * this subscription.\n */\n on(eventType, listener) {\n return this._emitter.on(eventType, listener);\n }\n\n /**\n * Cancel event subscriptions created by {@link FilterHandle#on}.\n *\n * @param {string} eventType - The type of event to unsubscribe.\n * @param {string|FilterHandle~listener} listener - Either the callback\n * function previously passed into {@link FilterHandle#on}, or the\n * string that was returned from {@link FilterHandle#on}.\n */\n off(eventType, listener) {\n return this._emitter.off(eventType, listener);\n }\n\n _onChange(extraInfo) {\n if (!this._filterSet)\n return;\n this._filterVar.set(this._filterSet.value, this._mergeExtraInfo(extraInfo));\n }\n\n /**\n * @callback FilterHandle~listener\n * @param {Object} event - An object containing details of the event. For\n * `\"change\"` events, this includes the properties `value` (the new\n * value of the filter set, or `null` if no filter set is active),\n * `oldValue` (the previous value of the filter set), and `sender` (the\n * `FilterHandle` instance that made the change).\n */\n\n}\n\n/**\n * @event FilterHandle#change\n * @type {object}\n * @property {object} value - The new value of the filter set, or `null`\n * if no filter set is active.\n * @property {object} oldValue - The previous value of the filter set.\n * @property {FilterHandle} sender - The `FilterHandle` instance that\n * changed the value.\n */\n", + "import { diffSortedLists } from \"./util\";\n\nfunction naturalComparator(a, b) {\n if (a === b) {\n return 0;\n } else if (a < b) {\n return -1;\n } else if (a > b) {\n return 1;\n }\n}\n\n/**\n * @private\n */\nexport default class FilterSet {\n constructor() {\n this.reset();\n }\n\n reset() {\n // Key: handle ID, Value: array of selected keys, or null\n this._handles = {};\n // Key: key string, Value: count of handles that include it\n this._keys = {};\n this._value = null;\n this._activeHandles = 0;\n }\n\n get value() {\n return this._value;\n }\n\n update(handleId, keys) {\n if (keys !== null) {\n keys = keys.slice(0); // clone before sorting\n keys.sort(naturalComparator);\n }\n\n let {added, removed} = diffSortedLists(this._handles[handleId], keys);\n this._handles[handleId] = keys;\n\n for (let i = 0; i < added.length; i++) {\n this._keys[added[i]] = (this._keys[added[i]] || 0) + 1;\n }\n for (let i = 0; i < removed.length; i++) {\n this._keys[removed[i]]--;\n }\n\n this._updateValue(keys);\n }\n\n /**\n * @param {string[]} keys Sorted array of strings that indicate\n * a superset of possible keys.\n * @private\n */\n _updateValue(keys = this._allKeys) {\n let handleCount = Object.keys(this._handles).length;\n if (handleCount === 0) {\n this._value = null;\n } else {\n this._value = [];\n for (let i = 0; i < keys.length; i++) {\n let count = this._keys[keys[i]];\n if (count === handleCount) {\n this._value.push(keys[i]);\n }\n }\n }\n }\n\n clear(handleId) {\n if (typeof(this._handles[handleId]) === \"undefined\") {\n return;\n }\n\n let keys = this._handles[handleId];\n if (!keys) {\n keys = [];\n }\n\n for (let i = 0; i < keys.length; i++) {\n this._keys[keys[i]]--;\n }\n delete this._handles[handleId];\n\n this._updateValue();\n }\n\n get _allKeys() {\n let allKeys = Object.keys(this._keys);\n allKeys.sort(naturalComparator);\n return allKeys;\n }\n}\n", + "import Var from \"./var\";\n\n// Use a global so that multiple copies of crosstalk.js can be loaded and still\n// have groups behave as singletons across all copies.\nglobal.__crosstalk_groups = global.__crosstalk_groups || {};\nlet groups = global.__crosstalk_groups;\n\nexport default function group(groupName) {\n if (groupName && typeof(groupName) === \"string\") {\n if (!groups.hasOwnProperty(groupName)) {\n groups[groupName] = new Group(groupName);\n }\n return groups[groupName];\n } else if (typeof(groupName) === \"object\" && groupName._vars && groupName.var) {\n // Appears to already be a group object\n return groupName;\n } else if (Array.isArray(groupName) &&\n groupName.length == 1 &&\n typeof(groupName[0]) === \"string\") {\n return group(groupName[0]);\n } else {\n throw new Error(\"Invalid groupName argument\");\n }\n}\n\nclass Group {\n constructor(name) {\n this.name = name;\n this._vars = {};\n }\n\n var(name) {\n if (!name || typeof(name) !== \"string\") {\n throw new Error(\"Invalid var name\");\n }\n\n if (!this._vars.hasOwnProperty(name))\n this._vars[name] = new Var(this, name);\n return this._vars[name];\n }\n\n has(name) {\n if (!name || typeof(name) !== \"string\") {\n throw new Error(\"Invalid var name\");\n }\n\n return this._vars.hasOwnProperty(name);\n }\n}\n", + "import group from \"./group\";\nimport { SelectionHandle } from \"./selection\";\nimport { FilterHandle } from \"./filter\";\nimport { bind } from \"./input\";\nimport \"./input_selectize\";\nimport \"./input_checkboxgroup\";\nimport \"./input_slider\";\n\nconst defaultGroup = group(\"default\");\n\nfunction var_(name) {\n return defaultGroup.var(name);\n}\n\nfunction has(name) {\n return defaultGroup.has(name);\n}\n\nif (global.Shiny) {\n global.Shiny.addCustomMessageHandler(\"update-client-value\", function(message) {\n if (typeof(message.group) === \"string\") {\n group(message.group).var(message.name).set(message.value);\n } else {\n var_(message.name).set(message.value);\n }\n });\n}\n\nconst crosstalk = {\n group: group,\n var: var_,\n has: has,\n SelectionHandle: SelectionHandle,\n FilterHandle: FilterHandle,\n bind: bind\n};\n\n/**\n * @namespace crosstalk\n */\nexport default crosstalk;\nglobal.crosstalk = crosstalk;\n", + "let $ = global.jQuery;\n\nlet bindings = {};\n\nexport function register(reg) {\n bindings[reg.className] = reg;\n if (global.document && global.document.readyState !== \"complete\") {\n $(() => {\n bind();\n });\n } else if (global.document) {\n setTimeout(bind, 100);\n }\n}\n\nexport function bind() {\n Object.keys(bindings).forEach(function(className) {\n let binding = bindings[className];\n $(\".\" + binding.className).not(\".crosstalk-input-bound\").each(function(i, el) {\n bindInstance(binding, el);\n });\n });\n}\n\n// Escape jQuery identifier\nfunction $escape(val) {\n return val.replace(/([!\"#$%&'()*+,./:;<=>?@[\\\\\\]^`{|}~])/g, \"\\\\$1\");\n}\n\nfunction bindEl(el) {\n let $el = $(el);\n Object.keys(bindings).forEach(function(className) {\n if ($el.hasClass(className) && !$el.hasClass(\"crosstalk-input-bound\")) {\n let binding = bindings[className];\n bindInstance(binding, el);\n }\n });\n}\n\nfunction bindInstance(binding, el) {\n let jsonEl = $(el).find(\"script[type='application/json'][data-for='\" + $escape(el.id) + \"']\");\n let data = JSON.parse(jsonEl[0].innerText);\n\n let instance = binding.factory(el, data);\n $(el).data(\"crosstalk-instance\", instance);\n $(el).addClass(\"crosstalk-input-bound\");\n}\n\nif (global.Shiny) {\n let inputBinding = new global.Shiny.InputBinding();\n let $ = global.jQuery;\n $.extend(inputBinding, {\n find: function(scope) {\n return $(scope).find(\".crosstalk-input\");\n },\n initialize: function(el) {\n if (!$(el).hasClass(\"crosstalk-input-bound\")) {\n bindEl(el);\n }\n },\n getId: function(el) {\n return el.id;\n },\n getValue: function(el) {\n\n },\n setValue: function(el, value) {\n\n },\n receiveMessage: function(el, data) {\n\n },\n subscribe: function(el, callback) {\n $(el).data(\"crosstalk-instance\").resume();\n },\n unsubscribe: function(el) {\n $(el).data(\"crosstalk-instance\").suspend();\n }\n });\n global.Shiny.inputBindings.register(inputBinding, \"crosstalk.inputBinding\");\n}\n", + "import * as input from \"./input\";\nimport { FilterHandle } from \"./filter\";\n\nlet $ = global.jQuery;\n\ninput.register({\n className: \"crosstalk-input-checkboxgroup\",\n\n factory: function(el, data) {\n /*\n * map: {\"groupA\": [\"keyA\", \"keyB\", ...], ...}\n * group: \"ct-groupname\"\n */\n let ctHandle = new FilterHandle(data.group);\n\n let lastKnownKeys;\n let $el = $(el);\n $el.on(\"change\", \"input[type='checkbox']\", function() {\n let checked = $el.find(\"input[type='checkbox']:checked\");\n if (checked.length === 0) {\n lastKnownKeys = null;\n ctHandle.clear();\n } else {\n let keys = {};\n checked.each(function() {\n data.map[this.value].forEach(function(key) {\n keys[key] = true;\n });\n });\n let keyArray = Object.keys(keys);\n keyArray.sort();\n lastKnownKeys = keyArray;\n ctHandle.set(keyArray);\n }\n });\n\n return {\n suspend: function() {\n ctHandle.clear();\n },\n resume: function() {\n if (lastKnownKeys)\n ctHandle.set(lastKnownKeys);\n }\n };\n }\n});\n", + "import * as input from \"./input\";\nimport * as util from \"./util\";\nimport { FilterHandle } from \"./filter\";\n\nlet $ = global.jQuery;\n\ninput.register({\n className: \"crosstalk-input-select\",\n\n factory: function(el, data) {\n /*\n * items: {value: [...], label: [...]}\n * map: {\"groupA\": [\"keyA\", \"keyB\", ...], ...}\n * group: \"ct-groupname\"\n */\n\n let first = [{value: \"\", label: \"(All)\"}];\n let items = util.dataframeToD3(data.items);\n let opts = {\n options: first.concat(items),\n valueField: \"value\",\n labelField: \"label\",\n searchField: \"label\"\n };\n\n let select = $(el).find(\"select\")[0];\n\n let selectize = $(select).selectize(opts)[0].selectize;\n\n let ctHandle = new FilterHandle(data.group);\n\n let lastKnownKeys;\n selectize.on(\"change\", function() {\n if (selectize.items.length === 0) {\n lastKnownKeys = null;\n ctHandle.clear();\n } else {\n let keys = {};\n selectize.items.forEach(function(group) {\n data.map[group].forEach(function(key) {\n keys[key] = true;\n });\n });\n let keyArray = Object.keys(keys);\n keyArray.sort();\n lastKnownKeys = keyArray;\n ctHandle.set(keyArray);\n }\n });\n\n return {\n suspend: function() {\n ctHandle.clear();\n },\n resume: function() {\n if (lastKnownKeys)\n ctHandle.set(lastKnownKeys);\n }\n };\n }\n});\n", + "import * as input from \"./input\";\nimport { FilterHandle } from \"./filter\";\n\nlet $ = global.jQuery;\nlet strftime = global.strftime;\n\ninput.register({\n className: \"crosstalk-input-slider\",\n\n factory: function(el, data) {\n /*\n * map: {\"groupA\": [\"keyA\", \"keyB\", ...], ...}\n * group: \"ct-groupname\"\n */\n let ctHandle = new FilterHandle(data.group);\n\n let opts = {};\n let $el = $(el).find(\"input\");\n let dataType = $el.data(\"data-type\");\n let timeFormat = $el.data(\"time-format\");\n let round = $el.data(\"round\");\n let timeFormatter;\n\n // Set up formatting functions\n if (dataType === \"date\") {\n timeFormatter = strftime.utc();\n opts.prettify = function(num) {\n return timeFormatter(timeFormat, new Date(num));\n };\n\n } else if (dataType === \"datetime\") {\n let timezone = $el.data(\"timezone\");\n if (timezone)\n timeFormatter = strftime.timezone(timezone);\n else\n timeFormatter = strftime;\n\n opts.prettify = function(num) {\n return timeFormatter(timeFormat, new Date(num));\n };\n } else if (dataType === \"number\") {\n if (typeof round !== \"undefined\")\n opts.prettify = function(num) {\n let factor = Math.pow(10, round);\n return Math.round(num * factor) / factor;\n };\n }\n\n $el.ionRangeSlider(opts);\n\n function getValue() {\n let result = $el.data(\"ionRangeSlider\").result;\n\n // Function for converting numeric value from slider to appropriate type.\n let convert;\n let dataType = $el.data(\"data-type\");\n if (dataType === \"date\") {\n convert = function(val) {\n return formatDateUTC(new Date(+val));\n };\n } else if (dataType === \"datetime\") {\n convert = function(val) {\n // Convert ms to s\n return +val / 1000;\n };\n } else {\n convert = function(val) { return +val; };\n }\n\n if ($el.data(\"ionRangeSlider\").options.type === \"double\") {\n return [convert(result.from), convert(result.to)];\n } else {\n return convert(result.from);\n }\n }\n\n let lastKnownKeys = null;\n\n $el.on(\"change.crosstalkSliderInput\", function(event) {\n if (!$el.data(\"updating\") && !$el.data(\"animating\")) {\n let [from, to] = getValue();\n let keys = [];\n for (let i = 0; i < data.values.length; i++) {\n let val = data.values[i];\n if (val >= from && val <= to) {\n keys.push(data.keys[i]);\n }\n }\n keys.sort();\n ctHandle.set(keys);\n lastKnownKeys = keys;\n }\n });\n\n\n // let $el = $(el);\n // $el.on(\"change\", \"input[type=\"checkbox\"]\", function() {\n // let checked = $el.find(\"input[type=\"checkbox\"]:checked\");\n // if (checked.length === 0) {\n // ctHandle.clear();\n // } else {\n // let keys = {};\n // checked.each(function() {\n // data.map[this.value].forEach(function(key) {\n // keys[key] = true;\n // });\n // });\n // let keyArray = Object.keys(keys);\n // keyArray.sort();\n // ctHandle.set(keyArray);\n // }\n // });\n\n return {\n suspend: function() {\n ctHandle.clear();\n },\n resume: function() {\n if (lastKnownKeys)\n ctHandle.set(lastKnownKeys);\n }\n };\n }\n});\n\n\n// Convert a number to a string with leading zeros\nfunction padZeros(n, digits) {\n let str = n.toString();\n while (str.length < digits)\n str = \"0\" + str;\n return str;\n}\n\n// Given a Date object, return a string in yyyy-mm-dd format, using the\n// UTC date. This may be a day off from the date in the local time zone.\nfunction formatDateUTC(date) {\n if (date instanceof Date) {\n return date.getUTCFullYear() + \"-\" +\n padZeros(date.getUTCMonth()+1, 2) + \"-\" +\n padZeros(date.getUTCDate(), 2);\n\n } else {\n return null;\n }\n}\n", + "import Events from \"./events\";\nimport grp from \"./group\";\nimport * as util from \"./util\";\n\n/**\n * Use this class to read and write (and listen for changes to) the selection\n * for a Crosstalk group. This is intended to be used for linked brushing.\n *\n * If two (or more) `SelectionHandle` instances in the same webpage share the\n * same group name, they will share the same state. Setting the selection using\n * one `SelectionHandle` instance will result in the `value` property instantly\n * changing across the others, and `\"change\"` event listeners on all instances\n * (including the one that initiated the sending) will fire.\n *\n * @param {string} [group] - The name of the Crosstalk group, or if none,\n * null or undefined (or any other falsy value). This can be changed later\n * via the [SelectionHandle#setGroup](#setGroup) method.\n * @param {Object} [extraInfo] - An object whose properties will be copied to\n * the event object whenever an event is emitted.\n */\nexport class SelectionHandle {\n\n constructor(group = null, extraInfo = null) {\n this._eventRelay = new Events();\n this._emitter = new util.SubscriptionTracker(this._eventRelay);\n\n // Name of the group we're currently tracking, if any. Can change over time.\n this._group = null;\n // The Var we're currently tracking, if any. Can change over time.\n this._var = null;\n // The event handler subscription we currently have on var.on(\"change\").\n this._varOnChangeSub = null;\n\n this._extraInfo = util.extend({ sender: this }, extraInfo);\n\n this.setGroup(group);\n }\n\n /**\n * Changes the Crosstalk group membership of this SelectionHandle. The group\n * being switched away from (if any) will not have its selection value\n * modified as a result of calling `setGroup`, even if this handle was the\n * most recent handle to set the selection of the group.\n *\n * The group being switched to (if any) will also not have its selection value\n * modified as a result of calling `setGroup`. If you want to set the\n * selection value of the new group, call `set` explicitly.\n *\n * @param {string} group - The name of the Crosstalk group, or null (or\n * undefined) to clear the group.\n */\n setGroup(group) {\n // If group is unchanged, do nothing\n if (this._group === group)\n return;\n // Treat null, undefined, and other falsy values the same\n if (!this._group && !group)\n return;\n\n if (this._var) {\n this._var.off(\"change\", this._varOnChangeSub);\n this._var = null;\n this._varOnChangeSub = null;\n }\n\n this._group = group;\n\n if (group) {\n this._var = grp(group).var(\"selection\");\n let sub = this._var.on(\"change\", (e) => {\n this._eventRelay.trigger(\"change\", e, this);\n });\n this._varOnChangeSub = sub;\n }\n }\n\n /**\n * Retrieves the current selection for the group represented by this\n * `SelectionHandle`.\n *\n * - If no selection is active, then this value will be falsy.\n * - If a selection is active, but no data points are selected, then this\n * value will be an empty array.\n * - If a selection is active, and data points are selected, then the keys\n * of the selected data points will be present in the array.\n */\n get value() {\n return this._var ? this._var.get() : null;\n }\n\n /**\n * Combines the given `extraInfo` (if any) with the handle's default\n * `_extraInfo` (if any).\n * @private\n */\n _mergeExtraInfo(extraInfo) {\n // Important incidental effect: shallow clone is returned\n return util.extend({},\n this._extraInfo ? this._extraInfo : null,\n extraInfo ? extraInfo : null);\n }\n\n /**\n * Overwrites the current selection for the group, and raises the `\"change\"`\n * event among all of the group's '`SelectionHandle` instances (including\n * this one).\n *\n * @fires SelectionHandle#change\n * @param {string[]} selectedKeys - Falsy, empty array, or array of keys (see\n * {@link SelectionHandle#value}).\n * @param {Object} [extraInfo] - Extra properties to be included on the event\n * object that's passed to listeners (in addition to any options that were\n * passed into the `SelectionHandle` constructor).\n */\n set(selectedKeys, extraInfo) {\n if (this._var)\n this._var.set(selectedKeys, this._mergeExtraInfo(extraInfo));\n }\n\n /**\n * Overwrites the current selection for the group, and raises the `\"change\"`\n * event among all of the group's '`SelectionHandle` instances (including\n * this one).\n *\n * @fires SelectionHandle#change\n * @param {Object} [extraInfo] - Extra properties to be included on the event\n * object that's passed to listeners (in addition to any that were passed\n * into the `SelectionHandle` constructor).\n */\n clear(extraInfo) {\n if (this._var)\n this.set(void 0, this._mergeExtraInfo(extraInfo));\n }\n\n /**\n * Subscribes to events on this `SelectionHandle`.\n *\n * @param {string} eventType - Indicates the type of events to listen to.\n * Currently, only `\"change\"` is supported.\n * @param {SelectionHandle~listener} listener - The callback function that\n * will be invoked when the event occurs.\n * @return {string} - A token to pass to {@link SelectionHandle#off} to cancel\n * this subscription.\n */\n on(eventType, listener) {\n return this._emitter.on(eventType, listener);\n }\n\n /**\n * Cancels event subscriptions created by {@link SelectionHandle#on}.\n *\n * @param {string} eventType - The type of event to unsubscribe.\n * @param {string|SelectionHandle~listener} listener - Either the callback\n * function previously passed into {@link SelectionHandle#on}, or the\n * string that was returned from {@link SelectionHandle#on}.\n */\n off(eventType, listener) {\n return this._emitter.off(eventType, listener);\n }\n\n /**\n * Shuts down the `SelectionHandle` object.\n *\n * Removes all event listeners that were added through this handle.\n */\n close() {\n this._emitter.removeAllListeners();\n this.setGroup(null);\n }\n}\n\n/**\n * @callback SelectionHandle~listener\n * @param {Object} event - An object containing details of the event. For\n * `\"change\"` events, this includes the properties `value` (the new\n * value of the selection, or `undefined` if no selection is active),\n * `oldValue` (the previous value of the selection), and `sender` (the\n * `SelectionHandle` instance that made the change).\n */\n\n/**\n * @event SelectionHandle#change\n * @type {object}\n * @property {object} value - The new value of the selection, or `undefined`\n * if no selection is active.\n * @property {object} oldValue - The previous value of the selection.\n * @property {SelectionHandle} sender - The `SelectionHandle` instance that\n * changed the value.\n */\n", + "export function extend(target, ...sources) {\n for (let i = 0; i < sources.length; i++) {\n let src = sources[i];\n if (typeof(src) === \"undefined\" || src === null)\n continue;\n\n for (let key in src) {\n if (src.hasOwnProperty(key)) {\n target[key] = src[key];\n }\n }\n }\n return target;\n}\n\nexport function checkSorted(list) {\n for (let i = 1; i < list.length; i++) {\n if (list[i] <= list[i-1]) {\n throw new Error(\"List is not sorted or contains duplicate\");\n }\n }\n}\n\nexport function diffSortedLists(a, b) {\n let i_a = 0;\n let i_b = 0;\n\n if (!a) a = [];\n if (!b) b = [];\n\n let a_only = [];\n let b_only = [];\n\n checkSorted(a);\n checkSorted(b);\n\n while (i_a < a.length && i_b < b.length) {\n if (a[i_a] === b[i_b]) {\n i_a++;\n i_b++;\n } else if (a[i_a] < b[i_b]) {\n a_only.push(a[i_a++]);\n } else {\n b_only.push(b[i_b++]);\n }\n }\n\n if (i_a < a.length)\n a_only = a_only.concat(a.slice(i_a));\n if (i_b < b.length)\n b_only = b_only.concat(b.slice(i_b));\n return {\n removed: a_only,\n added: b_only\n };\n}\n\n// Convert from wide: { colA: [1,2,3], colB: [4,5,6], ... }\n// to long: [ {colA: 1, colB: 4}, {colA: 2, colB: 5}, ... ]\nexport function dataframeToD3(df) {\n let names = [];\n let length;\n for (let name in df) {\n if (df.hasOwnProperty(name))\n names.push(name);\n if (typeof(df[name]) !== \"object\" || typeof(df[name].length) === \"undefined\") {\n throw new Error(\"All fields must be arrays\");\n } else if (typeof(length) !== \"undefined\" && length !== df[name].length) {\n throw new Error(\"All fields must be arrays of the same length\");\n }\n length = df[name].length;\n }\n let results = [];\n let item;\n for (let row = 0; row < length; row++) {\n item = {};\n for (let col = 0; col < names.length; col++) {\n item[names[col]] = df[names[col]][row];\n }\n results.push(item);\n }\n return results;\n}\n\n/**\n * Keeps track of all event listener additions/removals and lets all active\n * listeners be removed with a single operation.\n *\n * @private\n */\nexport class SubscriptionTracker {\n constructor(emitter) {\n this._emitter = emitter;\n this._subs = {};\n }\n\n on(eventType, listener) {\n let sub = this._emitter.on(eventType, listener);\n this._subs[sub] = eventType;\n return sub;\n }\n\n off(eventType, listener) {\n let sub = this._emitter.off(eventType, listener);\n if (sub) {\n delete this._subs[sub];\n }\n return sub;\n }\n\n removeAllListeners() {\n let current_subs = this._subs;\n this._subs = {};\n Object.keys(current_subs).forEach((sub) => {\n this._emitter.off(current_subs[sub], sub);\n });\n }\n}\n", + "import Events from \"./events\";\n\nexport default class Var {\n constructor(group, name, /*optional*/ value) {\n this._group = group;\n this._name = name;\n this._value = value;\n this._events = new Events();\n }\n\n get() {\n return this._value;\n }\n\n set(value, /*optional*/ event) {\n if (this._value === value) {\n // Do nothing; the value hasn't changed\n return;\n }\n let oldValue = this._value;\n this._value = value;\n // Alert JavaScript listeners that the value has changed\n let evt = {};\n if (event && typeof(event) === \"object\") {\n for (let k in event) {\n if (event.hasOwnProperty(k))\n evt[k] = event[k];\n }\n }\n evt.oldValue = oldValue;\n evt.value = value;\n this._events.trigger(\"change\", evt, this);\n\n // TODO: Make this extensible, to let arbitrary back-ends know that\n // something has changed\n if (global.Shiny && global.Shiny.onInputChange) {\n global.Shiny.onInputChange(\n \".clientValue-\" +\n (this._group.name !== null ? this._group.name + \"-\" : \"\") +\n this._name,\n typeof(value) === \"undefined\" ? null : value\n );\n }\n }\n\n on(eventType, listener) {\n return this._events.on(eventType, listener);\n }\n\n off(eventType, listener) {\n return this._events.off(eventType, listener);\n }\n}\n" + ] +} \ No newline at end of file diff --git a/_freeze/site_libs/crosstalk-1.2.0/js/crosstalk.min.js b/_freeze/site_libs/crosstalk-1.2.0/js/crosstalk.min.js new file mode 100644 index 0000000..b7ec0ac --- /dev/null +++ b/_freeze/site_libs/crosstalk-1.2.0/js/crosstalk.min.js @@ -0,0 +1,2 @@ +!function o(u,a,l){function s(n,e){if(!a[n]){if(!u[n]){var t="function"==typeof require&&require;if(!e&&t)return t(n,!0);if(f)return f(n,!0);var r=new Error("Cannot find module '"+n+"'");throw r.code="MODULE_NOT_FOUND",r}var i=a[n]={exports:{}};u[n][0].call(i.exports,function(e){var t=u[n][1][e];return s(t||e)},i,i.exports,o,u,a,l)}return a[n].exports}for(var f="function"==typeof require&&require,e=0;e?@[\\\]^`{|}~])/g,"\\$1")+"']"),r=JSON.parse(n[0].innerText),i=e.factory(t,r);o(t).data("crosstalk-instance",i),o(t).addClass("crosstalk-input-bound")}if(t.Shiny){var e=new t.Shiny.InputBinding,u=t.jQuery;u.extend(e,{find:function(e){return u(e).find(".crosstalk-input")},initialize:function(e){var t,n;u(e).hasClass("crosstalk-input-bound")||(n=o(t=e),Object.keys(r).forEach(function(e){n.hasClass(e)&&!n.hasClass("crosstalk-input-bound")&&i(r[e],t)}))},getId:function(e){return e.id},getValue:function(e){},setValue:function(e,t){},receiveMessage:function(e,t){},subscribe:function(e,t){u(e).data("crosstalk-instance").resume()},unsubscribe:function(e){u(e).data("crosstalk-instance").suspend()}}),t.Shiny.inputBindings.register(e,"crosstalk.inputBinding")}}).call(this,"undefined"!=typeof global?global:"undefined"!=typeof self?self:"undefined"!=typeof window?window:{})},{}],7:[function(r,e,t){(function(e){"use strict";var t=function(e){{if(e&&e.__esModule)return e;var t={};if(null!=e)for(var n in e)Object.prototype.hasOwnProperty.call(e,n)&&(t[n]=e[n]);return t.default=e,t}}(r("./input")),n=r("./filter");var a=e.jQuery;t.register({className:"crosstalk-input-checkboxgroup",factory:function(e,r){var i=new n.FilterHandle(r.group),o=void 0,u=a(e);return u.on("change","input[type='checkbox']",function(){var e=u.find("input[type='checkbox']:checked");if(0===e.length)o=null,i.clear();else{var t={};e.each(function(){r.map[this.value].forEach(function(e){t[e]=!0})});var n=Object.keys(t);n.sort(),o=n,i.set(n)}}),{suspend:function(){i.clear()},resume:function(){o&&i.set(o)}}}})}).call(this,"undefined"!=typeof global?global:"undefined"!=typeof self?self:"undefined"!=typeof window?window:{})},{"./filter":2,"./input":6}],8:[function(r,e,t){(function(e){"use strict";var t=n(r("./input")),l=n(r("./util")),s=r("./filter");function n(e){if(e&&e.__esModule)return e;var t={};if(null!=e)for(var n in e)Object.prototype.hasOwnProperty.call(e,n)&&(t[n]=e[n]);return t.default=e,t}var f=e.jQuery;t.register({className:"crosstalk-input-select",factory:function(e,n){var t=l.dataframeToD3(n.items),r={options:[{value:"",label:"(All)"}].concat(t),valueField:"value",labelField:"label",searchField:"label"},i=f(e).find("select")[0],o=f(i).selectize(r)[0].selectize,u=new s.FilterHandle(n.group),a=void 0;return o.on("change",function(){if(0===o.items.length)a=null,u.clear();else{var t={};o.items.forEach(function(e){n.map[e].forEach(function(e){t[e]=!0})});var e=Object.keys(t);e.sort(),a=e,u.set(e)}}),{suspend:function(){u.clear()},resume:function(){a&&u.set(a)}}}})}).call(this,"undefined"!=typeof global?global:"undefined"!=typeof self?self:"undefined"!=typeof window?window:{})},{"./filter":2,"./input":6,"./util":11}],9:[function(n,e,t){(function(e){"use strict";var d=function(e,t){if(Array.isArray(e))return e;if(Symbol.iterator in Object(e))return function(e,t){var n=[],r=!0,i=!1,o=void 0;try{for(var u,a=e[Symbol.iterator]();!(r=(u=a.next()).done)&&(n.push(u.value),!t||n.length!==t);r=!0);}catch(e){i=!0,o=e}finally{try{!r&&a.return&&a.return()}finally{if(i)throw o}}return n}(e,t);throw new TypeError("Invalid attempt to destructure non-iterable instance")},t=function(e){{if(e&&e.__esModule)return e;var t={};if(null!=e)for(var n in e)Object.prototype.hasOwnProperty.call(e,n)&&(t[n]=e[n]);return t.default=e,t}}(n("./input")),a=n("./filter");var v=e.jQuery,p=e.strftime;function y(e,t){for(var n=e.toString();n.length {\n this._eventRelay.trigger(\"change\", e, this);\n });\n this._varOnChangeSub = sub;\n }\n }\n\n /**\n * Combine the given `extraInfo` (if any) with the handle's default\n * `_extraInfo` (if any).\n * @private\n */\n _mergeExtraInfo(extraInfo) {\n return util.extend({},\n this._extraInfo ? this._extraInfo : null,\n extraInfo ? extraInfo : null);\n }\n\n /**\n * Close the handle. This clears this handle's contribution to the filter set,\n * and unsubscribes all event listeners.\n */\n close() {\n this._emitter.removeAllListeners();\n this.clear();\n this.setGroup(null);\n }\n\n /**\n * Clear this handle's contribution to the filter set.\n *\n * @param {Object} [extraInfo] - Extra properties to be included on the event\n * object that's passed to listeners (in addition to any options that were\n * passed into the `FilterHandle` constructor).\n * \n * @fires FilterHandle#change\n */\n clear(extraInfo) {\n if (!this._filterSet)\n return;\n this._filterSet.clear(this._id);\n this._onChange(extraInfo);\n }\n\n /**\n * Set this handle's contribution to the filter set. This array should consist\n * of the keys of the rows that _should_ be displayed; any keys that are not\n * present in the array will be considered _filtered out_. Note that multiple\n * `FilterHandle` instances in the group may each contribute an array of keys,\n * and only those keys that appear in _all_ of the arrays make it through the\n * filter.\n *\n * @param {string[]} keys - Empty array, or array of keys. To clear the\n * filter, don't pass an empty array; instead, use the\n * {@link FilterHandle#clear} method.\n * @param {Object} [extraInfo] - Extra properties to be included on the event\n * object that's passed to listeners (in addition to any options that were\n * passed into the `FilterHandle` constructor).\n * \n * @fires FilterHandle#change\n */\n set(keys, extraInfo) {\n if (!this._filterSet)\n return;\n this._filterSet.update(this._id, keys);\n this._onChange(extraInfo);\n }\n\n /**\n * @return {string[]|null} - Either: 1) an array of keys that made it through\n * all of the `FilterHandle` instances, or, 2) `null`, which means no filter\n * is being applied (all data should be displayed).\n */\n get filteredKeys() {\n return this._filterSet ? this._filterSet.value : null;\n }\n\n /**\n * Subscribe to events on this `FilterHandle`.\n *\n * @param {string} eventType - Indicates the type of events to listen to.\n * Currently, only `\"change\"` is supported.\n * @param {FilterHandle~listener} listener - The callback function that\n * will be invoked when the event occurs.\n * @return {string} - A token to pass to {@link FilterHandle#off} to cancel\n * this subscription.\n */\n on(eventType, listener) {\n return this._emitter.on(eventType, listener);\n }\n\n /**\n * Cancel event subscriptions created by {@link FilterHandle#on}.\n *\n * @param {string} eventType - The type of event to unsubscribe.\n * @param {string|FilterHandle~listener} listener - Either the callback\n * function previously passed into {@link FilterHandle#on}, or the\n * string that was returned from {@link FilterHandle#on}.\n */\n off(eventType, listener) {\n return this._emitter.off(eventType, listener);\n }\n\n _onChange(extraInfo) {\n if (!this._filterSet)\n return;\n this._filterVar.set(this._filterSet.value, this._mergeExtraInfo(extraInfo));\n }\n\n /**\n * @callback FilterHandle~listener\n * @param {Object} event - An object containing details of the event. For\n * `\"change\"` events, this includes the properties `value` (the new\n * value of the filter set, or `null` if no filter set is active),\n * `oldValue` (the previous value of the filter set), and `sender` (the\n * `FilterHandle` instance that made the change).\n */\n\n}\n\n/**\n * @event FilterHandle#change\n * @type {object}\n * @property {object} value - The new value of the filter set, or `null`\n * if no filter set is active.\n * @property {object} oldValue - The previous value of the filter set.\n * @property {FilterHandle} sender - The `FilterHandle` instance that\n * changed the value.\n */\n","import { diffSortedLists } from \"./util\";\n\nfunction naturalComparator(a, b) {\n if (a === b) {\n return 0;\n } else if (a < b) {\n return -1;\n } else if (a > b) {\n return 1;\n }\n}\n\n/**\n * @private\n */\nexport default class FilterSet {\n constructor() {\n this.reset();\n }\n\n reset() {\n // Key: handle ID, Value: array of selected keys, or null\n this._handles = {};\n // Key: key string, Value: count of handles that include it\n this._keys = {};\n this._value = null;\n this._activeHandles = 0;\n }\n\n get value() {\n return this._value;\n }\n\n update(handleId, keys) {\n if (keys !== null) {\n keys = keys.slice(0); // clone before sorting\n keys.sort(naturalComparator);\n }\n\n let {added, removed} = diffSortedLists(this._handles[handleId], keys);\n this._handles[handleId] = keys;\n\n for (let i = 0; i < added.length; i++) {\n this._keys[added[i]] = (this._keys[added[i]] || 0) + 1;\n }\n for (let i = 0; i < removed.length; i++) {\n this._keys[removed[i]]--;\n }\n\n this._updateValue(keys);\n }\n\n /**\n * @param {string[]} keys Sorted array of strings that indicate\n * a superset of possible keys.\n * @private\n */\n _updateValue(keys = this._allKeys) {\n let handleCount = Object.keys(this._handles).length;\n if (handleCount === 0) {\n this._value = null;\n } else {\n this._value = [];\n for (let i = 0; i < keys.length; i++) {\n let count = this._keys[keys[i]];\n if (count === handleCount) {\n this._value.push(keys[i]);\n }\n }\n }\n }\n\n clear(handleId) {\n if (typeof(this._handles[handleId]) === \"undefined\") {\n return;\n }\n\n let keys = this._handles[handleId];\n if (!keys) {\n keys = [];\n }\n\n for (let i = 0; i < keys.length; i++) {\n this._keys[keys[i]]--;\n }\n delete this._handles[handleId];\n\n this._updateValue();\n }\n\n get _allKeys() {\n let allKeys = Object.keys(this._keys);\n allKeys.sort(naturalComparator);\n return allKeys;\n }\n}\n","import Var from \"./var\";\n\n// Use a global so that multiple copies of crosstalk.js can be loaded and still\n// have groups behave as singletons across all copies.\nglobal.__crosstalk_groups = global.__crosstalk_groups || {};\nlet groups = global.__crosstalk_groups;\n\nexport default function group(groupName) {\n if (groupName && typeof(groupName) === \"string\") {\n if (!groups.hasOwnProperty(groupName)) {\n groups[groupName] = new Group(groupName);\n }\n return groups[groupName];\n } else if (typeof(groupName) === \"object\" && groupName._vars && groupName.var) {\n // Appears to already be a group object\n return groupName;\n } else if (Array.isArray(groupName) &&\n groupName.length == 1 &&\n typeof(groupName[0]) === \"string\") {\n return group(groupName[0]);\n } else {\n throw new Error(\"Invalid groupName argument\");\n }\n}\n\nclass Group {\n constructor(name) {\n this.name = name;\n this._vars = {};\n }\n\n var(name) {\n if (!name || typeof(name) !== \"string\") {\n throw new Error(\"Invalid var name\");\n }\n\n if (!this._vars.hasOwnProperty(name))\n this._vars[name] = new Var(this, name);\n return this._vars[name];\n }\n\n has(name) {\n if (!name || typeof(name) !== \"string\") {\n throw new Error(\"Invalid var name\");\n }\n\n return this._vars.hasOwnProperty(name);\n }\n}\n","import group from \"./group\";\nimport { SelectionHandle } from \"./selection\";\nimport { FilterHandle } from \"./filter\";\nimport { bind } from \"./input\";\nimport \"./input_selectize\";\nimport \"./input_checkboxgroup\";\nimport \"./input_slider\";\n\nconst defaultGroup = group(\"default\");\n\nfunction var_(name) {\n return defaultGroup.var(name);\n}\n\nfunction has(name) {\n return defaultGroup.has(name);\n}\n\nif (global.Shiny) {\n global.Shiny.addCustomMessageHandler(\"update-client-value\", function(message) {\n if (typeof(message.group) === \"string\") {\n group(message.group).var(message.name).set(message.value);\n } else {\n var_(message.name).set(message.value);\n }\n });\n}\n\nconst crosstalk = {\n group: group,\n var: var_,\n has: has,\n SelectionHandle: SelectionHandle,\n FilterHandle: FilterHandle,\n bind: bind\n};\n\n/**\n * @namespace crosstalk\n */\nexport default crosstalk;\nglobal.crosstalk = crosstalk;\n","let $ = global.jQuery;\n\nlet bindings = {};\n\nexport function register(reg) {\n bindings[reg.className] = reg;\n if (global.document && global.document.readyState !== \"complete\") {\n $(() => {\n bind();\n });\n } else if (global.document) {\n setTimeout(bind, 100);\n }\n}\n\nexport function bind() {\n Object.keys(bindings).forEach(function(className) {\n let binding = bindings[className];\n $(\".\" + binding.className).not(\".crosstalk-input-bound\").each(function(i, el) {\n bindInstance(binding, el);\n });\n });\n}\n\n// Escape jQuery identifier\nfunction $escape(val) {\n return val.replace(/([!\"#$%&'()*+,./:;<=>?@[\\\\\\]^`{|}~])/g, \"\\\\$1\");\n}\n\nfunction bindEl(el) {\n let $el = $(el);\n Object.keys(bindings).forEach(function(className) {\n if ($el.hasClass(className) && !$el.hasClass(\"crosstalk-input-bound\")) {\n let binding = bindings[className];\n bindInstance(binding, el);\n }\n });\n}\n\nfunction bindInstance(binding, el) {\n let jsonEl = $(el).find(\"script[type='application/json'][data-for='\" + $escape(el.id) + \"']\");\n let data = JSON.parse(jsonEl[0].innerText);\n\n let instance = binding.factory(el, data);\n $(el).data(\"crosstalk-instance\", instance);\n $(el).addClass(\"crosstalk-input-bound\");\n}\n\nif (global.Shiny) {\n let inputBinding = new global.Shiny.InputBinding();\n let $ = global.jQuery;\n $.extend(inputBinding, {\n find: function(scope) {\n return $(scope).find(\".crosstalk-input\");\n },\n initialize: function(el) {\n if (!$(el).hasClass(\"crosstalk-input-bound\")) {\n bindEl(el);\n }\n },\n getId: function(el) {\n return el.id;\n },\n getValue: function(el) {\n\n },\n setValue: function(el, value) {\n\n },\n receiveMessage: function(el, data) {\n\n },\n subscribe: function(el, callback) {\n $(el).data(\"crosstalk-instance\").resume();\n },\n unsubscribe: function(el) {\n $(el).data(\"crosstalk-instance\").suspend();\n }\n });\n global.Shiny.inputBindings.register(inputBinding, \"crosstalk.inputBinding\");\n}\n","import * as input from \"./input\";\nimport { FilterHandle } from \"./filter\";\n\nlet $ = global.jQuery;\n\ninput.register({\n className: \"crosstalk-input-checkboxgroup\",\n\n factory: function(el, data) {\n /*\n * map: {\"groupA\": [\"keyA\", \"keyB\", ...], ...}\n * group: \"ct-groupname\"\n */\n let ctHandle = new FilterHandle(data.group);\n\n let lastKnownKeys;\n let $el = $(el);\n $el.on(\"change\", \"input[type='checkbox']\", function() {\n let checked = $el.find(\"input[type='checkbox']:checked\");\n if (checked.length === 0) {\n lastKnownKeys = null;\n ctHandle.clear();\n } else {\n let keys = {};\n checked.each(function() {\n data.map[this.value].forEach(function(key) {\n keys[key] = true;\n });\n });\n let keyArray = Object.keys(keys);\n keyArray.sort();\n lastKnownKeys = keyArray;\n ctHandle.set(keyArray);\n }\n });\n\n return {\n suspend: function() {\n ctHandle.clear();\n },\n resume: function() {\n if (lastKnownKeys)\n ctHandle.set(lastKnownKeys);\n }\n };\n }\n});\n","import * as input from \"./input\";\nimport * as util from \"./util\";\nimport { FilterHandle } from \"./filter\";\n\nlet $ = global.jQuery;\n\ninput.register({\n className: \"crosstalk-input-select\",\n\n factory: function(el, data) {\n /*\n * items: {value: [...], label: [...]}\n * map: {\"groupA\": [\"keyA\", \"keyB\", ...], ...}\n * group: \"ct-groupname\"\n */\n\n let first = [{value: \"\", label: \"(All)\"}];\n let items = util.dataframeToD3(data.items);\n let opts = {\n options: first.concat(items),\n valueField: \"value\",\n labelField: \"label\",\n searchField: \"label\"\n };\n\n let select = $(el).find(\"select\")[0];\n\n let selectize = $(select).selectize(opts)[0].selectize;\n\n let ctHandle = new FilterHandle(data.group);\n\n let lastKnownKeys;\n selectize.on(\"change\", function() {\n if (selectize.items.length === 0) {\n lastKnownKeys = null;\n ctHandle.clear();\n } else {\n let keys = {};\n selectize.items.forEach(function(group) {\n data.map[group].forEach(function(key) {\n keys[key] = true;\n });\n });\n let keyArray = Object.keys(keys);\n keyArray.sort();\n lastKnownKeys = keyArray;\n ctHandle.set(keyArray);\n }\n });\n\n return {\n suspend: function() {\n ctHandle.clear();\n },\n resume: function() {\n if (lastKnownKeys)\n ctHandle.set(lastKnownKeys);\n }\n };\n }\n});\n","import * as input from \"./input\";\nimport { FilterHandle } from \"./filter\";\n\nlet $ = global.jQuery;\nlet strftime = global.strftime;\n\ninput.register({\n className: \"crosstalk-input-slider\",\n\n factory: function(el, data) {\n /*\n * map: {\"groupA\": [\"keyA\", \"keyB\", ...], ...}\n * group: \"ct-groupname\"\n */\n let ctHandle = new FilterHandle(data.group);\n\n let opts = {};\n let $el = $(el).find(\"input\");\n let dataType = $el.data(\"data-type\");\n let timeFormat = $el.data(\"time-format\");\n let round = $el.data(\"round\");\n let timeFormatter;\n\n // Set up formatting functions\n if (dataType === \"date\") {\n timeFormatter = strftime.utc();\n opts.prettify = function(num) {\n return timeFormatter(timeFormat, new Date(num));\n };\n\n } else if (dataType === \"datetime\") {\n let timezone = $el.data(\"timezone\");\n if (timezone)\n timeFormatter = strftime.timezone(timezone);\n else\n timeFormatter = strftime;\n\n opts.prettify = function(num) {\n return timeFormatter(timeFormat, new Date(num));\n };\n } else if (dataType === \"number\") {\n if (typeof round !== \"undefined\")\n opts.prettify = function(num) {\n let factor = Math.pow(10, round);\n return Math.round(num * factor) / factor;\n };\n }\n\n $el.ionRangeSlider(opts);\n\n function getValue() {\n let result = $el.data(\"ionRangeSlider\").result;\n\n // Function for converting numeric value from slider to appropriate type.\n let convert;\n let dataType = $el.data(\"data-type\");\n if (dataType === \"date\") {\n convert = function(val) {\n return formatDateUTC(new Date(+val));\n };\n } else if (dataType === \"datetime\") {\n convert = function(val) {\n // Convert ms to s\n return +val / 1000;\n };\n } else {\n convert = function(val) { return +val; };\n }\n\n if ($el.data(\"ionRangeSlider\").options.type === \"double\") {\n return [convert(result.from), convert(result.to)];\n } else {\n return convert(result.from);\n }\n }\n\n let lastKnownKeys = null;\n\n $el.on(\"change.crosstalkSliderInput\", function(event) {\n if (!$el.data(\"updating\") && !$el.data(\"animating\")) {\n let [from, to] = getValue();\n let keys = [];\n for (let i = 0; i < data.values.length; i++) {\n let val = data.values[i];\n if (val >= from && val <= to) {\n keys.push(data.keys[i]);\n }\n }\n keys.sort();\n ctHandle.set(keys);\n lastKnownKeys = keys;\n }\n });\n\n\n // let $el = $(el);\n // $el.on(\"change\", \"input[type=\"checkbox\"]\", function() {\n // let checked = $el.find(\"input[type=\"checkbox\"]:checked\");\n // if (checked.length === 0) {\n // ctHandle.clear();\n // } else {\n // let keys = {};\n // checked.each(function() {\n // data.map[this.value].forEach(function(key) {\n // keys[key] = true;\n // });\n // });\n // let keyArray = Object.keys(keys);\n // keyArray.sort();\n // ctHandle.set(keyArray);\n // }\n // });\n\n return {\n suspend: function() {\n ctHandle.clear();\n },\n resume: function() {\n if (lastKnownKeys)\n ctHandle.set(lastKnownKeys);\n }\n };\n }\n});\n\n\n// Convert a number to a string with leading zeros\nfunction padZeros(n, digits) {\n let str = n.toString();\n while (str.length < digits)\n str = \"0\" + str;\n return str;\n}\n\n// Given a Date object, return a string in yyyy-mm-dd format, using the\n// UTC date. This may be a day off from the date in the local time zone.\nfunction formatDateUTC(date) {\n if (date instanceof Date) {\n return date.getUTCFullYear() + \"-\" +\n padZeros(date.getUTCMonth()+1, 2) + \"-\" +\n padZeros(date.getUTCDate(), 2);\n\n } else {\n return null;\n }\n}\n","import Events from \"./events\";\nimport grp from \"./group\";\nimport * as util from \"./util\";\n\n/**\n * Use this class to read and write (and listen for changes to) the selection\n * for a Crosstalk group. This is intended to be used for linked brushing.\n *\n * If two (or more) `SelectionHandle` instances in the same webpage share the\n * same group name, they will share the same state. Setting the selection using\n * one `SelectionHandle` instance will result in the `value` property instantly\n * changing across the others, and `\"change\"` event listeners on all instances\n * (including the one that initiated the sending) will fire.\n *\n * @param {string} [group] - The name of the Crosstalk group, or if none,\n * null or undefined (or any other falsy value). This can be changed later\n * via the [SelectionHandle#setGroup](#setGroup) method.\n * @param {Object} [extraInfo] - An object whose properties will be copied to\n * the event object whenever an event is emitted.\n */\nexport class SelectionHandle {\n\n constructor(group = null, extraInfo = null) {\n this._eventRelay = new Events();\n this._emitter = new util.SubscriptionTracker(this._eventRelay);\n\n // Name of the group we're currently tracking, if any. Can change over time.\n this._group = null;\n // The Var we're currently tracking, if any. Can change over time.\n this._var = null;\n // The event handler subscription we currently have on var.on(\"change\").\n this._varOnChangeSub = null;\n\n this._extraInfo = util.extend({ sender: this }, extraInfo);\n\n this.setGroup(group);\n }\n\n /**\n * Changes the Crosstalk group membership of this SelectionHandle. The group\n * being switched away from (if any) will not have its selection value\n * modified as a result of calling `setGroup`, even if this handle was the\n * most recent handle to set the selection of the group.\n *\n * The group being switched to (if any) will also not have its selection value\n * modified as a result of calling `setGroup`. If you want to set the\n * selection value of the new group, call `set` explicitly.\n *\n * @param {string} group - The name of the Crosstalk group, or null (or\n * undefined) to clear the group.\n */\n setGroup(group) {\n // If group is unchanged, do nothing\n if (this._group === group)\n return;\n // Treat null, undefined, and other falsy values the same\n if (!this._group && !group)\n return;\n\n if (this._var) {\n this._var.off(\"change\", this._varOnChangeSub);\n this._var = null;\n this._varOnChangeSub = null;\n }\n\n this._group = group;\n\n if (group) {\n this._var = grp(group).var(\"selection\");\n let sub = this._var.on(\"change\", (e) => {\n this._eventRelay.trigger(\"change\", e, this);\n });\n this._varOnChangeSub = sub;\n }\n }\n\n /**\n * Retrieves the current selection for the group represented by this\n * `SelectionHandle`.\n *\n * - If no selection is active, then this value will be falsy.\n * - If a selection is active, but no data points are selected, then this\n * value will be an empty array.\n * - If a selection is active, and data points are selected, then the keys\n * of the selected data points will be present in the array.\n */\n get value() {\n return this._var ? this._var.get() : null;\n }\n\n /**\n * Combines the given `extraInfo` (if any) with the handle's default\n * `_extraInfo` (if any).\n * @private\n */\n _mergeExtraInfo(extraInfo) {\n // Important incidental effect: shallow clone is returned\n return util.extend({},\n this._extraInfo ? this._extraInfo : null,\n extraInfo ? extraInfo : null);\n }\n\n /**\n * Overwrites the current selection for the group, and raises the `\"change\"`\n * event among all of the group's '`SelectionHandle` instances (including\n * this one).\n *\n * @fires SelectionHandle#change\n * @param {string[]} selectedKeys - Falsy, empty array, or array of keys (see\n * {@link SelectionHandle#value}).\n * @param {Object} [extraInfo] - Extra properties to be included on the event\n * object that's passed to listeners (in addition to any options that were\n * passed into the `SelectionHandle` constructor).\n */\n set(selectedKeys, extraInfo) {\n if (this._var)\n this._var.set(selectedKeys, this._mergeExtraInfo(extraInfo));\n }\n\n /**\n * Overwrites the current selection for the group, and raises the `\"change\"`\n * event among all of the group's '`SelectionHandle` instances (including\n * this one).\n *\n * @fires SelectionHandle#change\n * @param {Object} [extraInfo] - Extra properties to be included on the event\n * object that's passed to listeners (in addition to any that were passed\n * into the `SelectionHandle` constructor).\n */\n clear(extraInfo) {\n if (this._var)\n this.set(void 0, this._mergeExtraInfo(extraInfo));\n }\n\n /**\n * Subscribes to events on this `SelectionHandle`.\n *\n * @param {string} eventType - Indicates the type of events to listen to.\n * Currently, only `\"change\"` is supported.\n * @param {SelectionHandle~listener} listener - The callback function that\n * will be invoked when the event occurs.\n * @return {string} - A token to pass to {@link SelectionHandle#off} to cancel\n * this subscription.\n */\n on(eventType, listener) {\n return this._emitter.on(eventType, listener);\n }\n\n /**\n * Cancels event subscriptions created by {@link SelectionHandle#on}.\n *\n * @param {string} eventType - The type of event to unsubscribe.\n * @param {string|SelectionHandle~listener} listener - Either the callback\n * function previously passed into {@link SelectionHandle#on}, or the\n * string that was returned from {@link SelectionHandle#on}.\n */\n off(eventType, listener) {\n return this._emitter.off(eventType, listener);\n }\n\n /**\n * Shuts down the `SelectionHandle` object.\n *\n * Removes all event listeners that were added through this handle.\n */\n close() {\n this._emitter.removeAllListeners();\n this.setGroup(null);\n }\n}\n\n/**\n * @callback SelectionHandle~listener\n * @param {Object} event - An object containing details of the event. For\n * `\"change\"` events, this includes the properties `value` (the new\n * value of the selection, or `undefined` if no selection is active),\n * `oldValue` (the previous value of the selection), and `sender` (the\n * `SelectionHandle` instance that made the change).\n */\n\n/**\n * @event SelectionHandle#change\n * @type {object}\n * @property {object} value - The new value of the selection, or `undefined`\n * if no selection is active.\n * @property {object} oldValue - The previous value of the selection.\n * @property {SelectionHandle} sender - The `SelectionHandle` instance that\n * changed the value.\n */\n","export function extend(target, ...sources) {\n for (let i = 0; i < sources.length; i++) {\n let src = sources[i];\n if (typeof(src) === \"undefined\" || src === null)\n continue;\n\n for (let key in src) {\n if (src.hasOwnProperty(key)) {\n target[key] = src[key];\n }\n }\n }\n return target;\n}\n\nexport function checkSorted(list) {\n for (let i = 1; i < list.length; i++) {\n if (list[i] <= list[i-1]) {\n throw new Error(\"List is not sorted or contains duplicate\");\n }\n }\n}\n\nexport function diffSortedLists(a, b) {\n let i_a = 0;\n let i_b = 0;\n\n if (!a) a = [];\n if (!b) b = [];\n\n let a_only = [];\n let b_only = [];\n\n checkSorted(a);\n checkSorted(b);\n\n while (i_a < a.length && i_b < b.length) {\n if (a[i_a] === b[i_b]) {\n i_a++;\n i_b++;\n } else if (a[i_a] < b[i_b]) {\n a_only.push(a[i_a++]);\n } else {\n b_only.push(b[i_b++]);\n }\n }\n\n if (i_a < a.length)\n a_only = a_only.concat(a.slice(i_a));\n if (i_b < b.length)\n b_only = b_only.concat(b.slice(i_b));\n return {\n removed: a_only,\n added: b_only\n };\n}\n\n// Convert from wide: { colA: [1,2,3], colB: [4,5,6], ... }\n// to long: [ {colA: 1, colB: 4}, {colA: 2, colB: 5}, ... ]\nexport function dataframeToD3(df) {\n let names = [];\n let length;\n for (let name in df) {\n if (df.hasOwnProperty(name))\n names.push(name);\n if (typeof(df[name]) !== \"object\" || typeof(df[name].length) === \"undefined\") {\n throw new Error(\"All fields must be arrays\");\n } else if (typeof(length) !== \"undefined\" && length !== df[name].length) {\n throw new Error(\"All fields must be arrays of the same length\");\n }\n length = df[name].length;\n }\n let results = [];\n let item;\n for (let row = 0; row < length; row++) {\n item = {};\n for (let col = 0; col < names.length; col++) {\n item[names[col]] = df[names[col]][row];\n }\n results.push(item);\n }\n return results;\n}\n\n/**\n * Keeps track of all event listener additions/removals and lets all active\n * listeners be removed with a single operation.\n *\n * @private\n */\nexport class SubscriptionTracker {\n constructor(emitter) {\n this._emitter = emitter;\n this._subs = {};\n }\n\n on(eventType, listener) {\n let sub = this._emitter.on(eventType, listener);\n this._subs[sub] = eventType;\n return sub;\n }\n\n off(eventType, listener) {\n let sub = this._emitter.off(eventType, listener);\n if (sub) {\n delete this._subs[sub];\n }\n return sub;\n }\n\n removeAllListeners() {\n let current_subs = this._subs;\n this._subs = {};\n Object.keys(current_subs).forEach((sub) => {\n this._emitter.off(current_subs[sub], sub);\n });\n }\n}\n","import Events from \"./events\";\n\nexport default class Var {\n constructor(group, name, /*optional*/ value) {\n this._group = group;\n this._name = name;\n this._value = value;\n this._events = new Events();\n }\n\n get() {\n return this._value;\n }\n\n set(value, /*optional*/ event) {\n if (this._value === value) {\n // Do nothing; the value hasn't changed\n return;\n }\n let oldValue = this._value;\n this._value = value;\n // Alert JavaScript listeners that the value has changed\n let evt = {};\n if (event && typeof(event) === \"object\") {\n for (let k in event) {\n if (event.hasOwnProperty(k))\n evt[k] = event[k];\n }\n }\n evt.oldValue = oldValue;\n evt.value = value;\n this._events.trigger(\"change\", evt, this);\n\n // TODO: Make this extensible, to let arbitrary back-ends know that\n // something has changed\n if (global.Shiny && global.Shiny.onInputChange) {\n global.Shiny.onInputChange(\n \".clientValue-\" +\n (this._group.name !== null ? this._group.name + \"-\" : \"\") +\n this._name,\n typeof(value) === \"undefined\" ? null : value\n );\n }\n }\n\n on(eventType, listener) {\n return this._events.on(eventType, listener);\n }\n\n off(eventType, listener) {\n return this._events.off(eventType, listener);\n }\n}\n"]} \ No newline at end of file diff --git a/_freeze/site_libs/crosstalk-1.2.0/scss/crosstalk.scss b/_freeze/site_libs/crosstalk-1.2.0/scss/crosstalk.scss new file mode 100644 index 0000000..3566561 --- /dev/null +++ b/_freeze/site_libs/crosstalk-1.2.0/scss/crosstalk.scss @@ -0,0 +1,75 @@ +/* Adjust margins outwards, so column contents line up with the edges of the + parent of container-fluid. */ +.container-fluid.crosstalk-bscols { + margin-left: -30px; + margin-right: -30px; + white-space: normal; +} + +/* But don't adjust the margins outwards if we're directly under the body, + i.e. we were the top-level of something at the console. */ +body > .container-fluid.crosstalk-bscols { + margin-left: auto; + margin-right: auto; +} + +.crosstalk-input-checkboxgroup .crosstalk-options-group .crosstalk-options-column { + display: inline-block; + padding-right: 12px; + vertical-align: top; +} + +@media only screen and (max-width:480px) { + .crosstalk-input-checkboxgroup .crosstalk-options-group .crosstalk-options-column { + display: block; + padding-right: inherit; + } +} + +/* Relevant BS3 styles to make filter_checkbox() look reasonable without Bootstrap */ +.crosstalk-input { + margin-bottom: 15px; /* a la .form-group */ + .control-label { + margin-bottom: 0; + vertical-align: middle; + } + input[type="checkbox"] { + margin: 4px 0 0; + margin-top: 1px; + line-height: normal; + } + .checkbox { + position: relative; + display: block; + margin-top: 10px; + margin-bottom: 10px; + } + .checkbox > label{ + padding-left: 20px; + margin-bottom: 0; + font-weight: 400; + cursor: pointer; + } + .checkbox input[type="checkbox"], + .checkbox-inline input[type="checkbox"] { + position: absolute; + margin-top: 2px; + margin-left: -20px; + } + .checkbox + .checkbox { + margin-top: -5px; + } + .checkbox-inline { + position: relative; + display: inline-block; + padding-left: 20px; + margin-bottom: 0; + font-weight: 400; + vertical-align: middle; + cursor: pointer; + } + .checkbox-inline + .checkbox-inline { + margin-top: 0; + margin-left: 10px; + } +} diff --git a/_freeze/site_libs/htmlwidgets-1.6.2/htmlwidgets.js b/_freeze/site_libs/htmlwidgets-1.6.2/htmlwidgets.js new file mode 100644 index 0000000..1067d02 --- /dev/null +++ b/_freeze/site_libs/htmlwidgets-1.6.2/htmlwidgets.js @@ -0,0 +1,901 @@ +(function() { + // If window.HTMLWidgets is already defined, then use it; otherwise create a + // new object. This allows preceding code to set options that affect the + // initialization process (though none currently exist). + window.HTMLWidgets = window.HTMLWidgets || {}; + + // See if we're running in a viewer pane. If not, we're in a web browser. + var viewerMode = window.HTMLWidgets.viewerMode = + /\bviewer_pane=1\b/.test(window.location); + + // See if we're running in Shiny mode. If not, it's a static document. + // Note that static widgets can appear in both Shiny and static modes, but + // obviously, Shiny widgets can only appear in Shiny apps/documents. + var shinyMode = window.HTMLWidgets.shinyMode = + typeof(window.Shiny) !== "undefined" && !!window.Shiny.outputBindings; + + // We can't count on jQuery being available, so we implement our own + // version if necessary. + function querySelectorAll(scope, selector) { + if (typeof(jQuery) !== "undefined" && scope instanceof jQuery) { + return scope.find(selector); + } + if (scope.querySelectorAll) { + return scope.querySelectorAll(selector); + } + } + + function asArray(value) { + if (value === null) + return []; + if ($.isArray(value)) + return value; + return [value]; + } + + // Implement jQuery's extend + function extend(target /*, ... */) { + if (arguments.length == 1) { + return target; + } + for (var i = 1; i < arguments.length; i++) { + var source = arguments[i]; + for (var prop in source) { + if (source.hasOwnProperty(prop)) { + target[prop] = source[prop]; + } + } + } + return target; + } + + // IE8 doesn't support Array.forEach. + function forEach(values, callback, thisArg) { + if (values.forEach) { + values.forEach(callback, thisArg); + } else { + for (var i = 0; i < values.length; i++) { + callback.call(thisArg, values[i], i, values); + } + } + } + + // Replaces the specified method with the return value of funcSource. + // + // Note that funcSource should not BE the new method, it should be a function + // that RETURNS the new method. funcSource receives a single argument that is + // the overridden method, it can be called from the new method. The overridden + // method can be called like a regular function, it has the target permanently + // bound to it so "this" will work correctly. + function overrideMethod(target, methodName, funcSource) { + var superFunc = target[methodName] || function() {}; + var superFuncBound = function() { + return superFunc.apply(target, arguments); + }; + target[methodName] = funcSource(superFuncBound); + } + + // Add a method to delegator that, when invoked, calls + // delegatee.methodName. If there is no such method on + // the delegatee, but there was one on delegator before + // delegateMethod was called, then the original version + // is invoked instead. + // For example: + // + // var a = { + // method1: function() { console.log('a1'); } + // method2: function() { console.log('a2'); } + // }; + // var b = { + // method1: function() { console.log('b1'); } + // }; + // delegateMethod(a, b, "method1"); + // delegateMethod(a, b, "method2"); + // a.method1(); + // a.method2(); + // + // The output would be "b1", "a2". + function delegateMethod(delegator, delegatee, methodName) { + var inherited = delegator[methodName]; + delegator[methodName] = function() { + var target = delegatee; + var method = delegatee[methodName]; + + // The method doesn't exist on the delegatee. Instead, + // call the method on the delegator, if it exists. + if (!method) { + target = delegator; + method = inherited; + } + + if (method) { + return method.apply(target, arguments); + } + }; + } + + // Implement a vague facsimilie of jQuery's data method + function elementData(el, name, value) { + if (arguments.length == 2) { + return el["htmlwidget_data_" + name]; + } else if (arguments.length == 3) { + el["htmlwidget_data_" + name] = value; + return el; + } else { + throw new Error("Wrong number of arguments for elementData: " + + arguments.length); + } + } + + // http://stackoverflow.com/questions/3446170/escape-string-for-use-in-javascript-regex + function escapeRegExp(str) { + return str.replace(/[\-\[\]\/\{\}\(\)\*\+\?\.\\\^\$\|]/g, "\\$&"); + } + + function hasClass(el, className) { + var re = new RegExp("\\b" + escapeRegExp(className) + "\\b"); + return re.test(el.className); + } + + // elements - array (or array-like object) of HTML elements + // className - class name to test for + // include - if true, only return elements with given className; + // if false, only return elements *without* given className + function filterByClass(elements, className, include) { + var results = []; + for (var i = 0; i < elements.length; i++) { + if (hasClass(elements[i], className) == include) + results.push(elements[i]); + } + return results; + } + + function on(obj, eventName, func) { + if (obj.addEventListener) { + obj.addEventListener(eventName, func, false); + } else if (obj.attachEvent) { + obj.attachEvent(eventName, func); + } + } + + function off(obj, eventName, func) { + if (obj.removeEventListener) + obj.removeEventListener(eventName, func, false); + else if (obj.detachEvent) { + obj.detachEvent(eventName, func); + } + } + + // Translate array of values to top/right/bottom/left, as usual with + // the "padding" CSS property + // https://developer.mozilla.org/en-US/docs/Web/CSS/padding + function unpackPadding(value) { + if (typeof(value) === "number") + value = [value]; + if (value.length === 1) { + return {top: value[0], right: value[0], bottom: value[0], left: value[0]}; + } + if (value.length === 2) { + return {top: value[0], right: value[1], bottom: value[0], left: value[1]}; + } + if (value.length === 3) { + return {top: value[0], right: value[1], bottom: value[2], left: value[1]}; + } + if (value.length === 4) { + return {top: value[0], right: value[1], bottom: value[2], left: value[3]}; + } + } + + // Convert an unpacked padding object to a CSS value + function paddingToCss(paddingObj) { + return paddingObj.top + "px " + paddingObj.right + "px " + paddingObj.bottom + "px " + paddingObj.left + "px"; + } + + // Makes a number suitable for CSS + function px(x) { + if (typeof(x) === "number") + return x + "px"; + else + return x; + } + + // Retrieves runtime widget sizing information for an element. + // The return value is either null, or an object with fill, padding, + // defaultWidth, defaultHeight fields. + function sizingPolicy(el) { + var sizingEl = document.querySelector("script[data-for='" + el.id + "'][type='application/htmlwidget-sizing']"); + if (!sizingEl) + return null; + var sp = JSON.parse(sizingEl.textContent || sizingEl.text || "{}"); + if (viewerMode) { + return sp.viewer; + } else { + return sp.browser; + } + } + + // @param tasks Array of strings (or falsy value, in which case no-op). + // Each element must be a valid JavaScript expression that yields a + // function. Or, can be an array of objects with "code" and "data" + // properties; in this case, the "code" property should be a string + // of JS that's an expr that yields a function, and "data" should be + // an object that will be added as an additional argument when that + // function is called. + // @param target The object that will be "this" for each function + // execution. + // @param args Array of arguments to be passed to the functions. (The + // same arguments will be passed to all functions.) + function evalAndRun(tasks, target, args) { + if (tasks) { + forEach(tasks, function(task) { + var theseArgs = args; + if (typeof(task) === "object") { + theseArgs = theseArgs.concat([task.data]); + task = task.code; + } + var taskFunc = tryEval(task); + if (typeof(taskFunc) !== "function") { + throw new Error("Task must be a function! Source:\n" + task); + } + taskFunc.apply(target, theseArgs); + }); + } + } + + // Attempt eval() both with and without enclosing in parentheses. + // Note that enclosing coerces a function declaration into + // an expression that eval() can parse + // (otherwise, a SyntaxError is thrown) + function tryEval(code) { + var result = null; + try { + result = eval("(" + code + ")"); + } catch(error) { + if (!(error instanceof SyntaxError)) { + throw error; + } + try { + result = eval(code); + } catch(e) { + if (e instanceof SyntaxError) { + throw error; + } else { + throw e; + } + } + } + return result; + } + + function initSizing(el) { + var sizing = sizingPolicy(el); + if (!sizing) + return; + + var cel = document.getElementById("htmlwidget_container"); + if (!cel) + return; + + if (typeof(sizing.padding) !== "undefined") { + document.body.style.margin = "0"; + document.body.style.padding = paddingToCss(unpackPadding(sizing.padding)); + } + + if (sizing.fill) { + document.body.style.overflow = "hidden"; + document.body.style.width = "100%"; + document.body.style.height = "100%"; + document.documentElement.style.width = "100%"; + document.documentElement.style.height = "100%"; + cel.style.position = "absolute"; + var pad = unpackPadding(sizing.padding); + cel.style.top = pad.top + "px"; + cel.style.right = pad.right + "px"; + cel.style.bottom = pad.bottom + "px"; + cel.style.left = pad.left + "px"; + el.style.width = "100%"; + el.style.height = "100%"; + + return { + getWidth: function() { return cel.getBoundingClientRect().width; }, + getHeight: function() { return cel.getBoundingClientRect().height; } + }; + + } else { + el.style.width = px(sizing.width); + el.style.height = px(sizing.height); + + return { + getWidth: function() { return cel.getBoundingClientRect().width; }, + getHeight: function() { return cel.getBoundingClientRect().height; } + }; + } + } + + // Default implementations for methods + var defaults = { + find: function(scope) { + return querySelectorAll(scope, "." + this.name); + }, + renderError: function(el, err) { + var $el = $(el); + + this.clearError(el); + + // Add all these error classes, as Shiny does + var errClass = "shiny-output-error"; + if (err.type !== null) { + // use the classes of the error condition as CSS class names + errClass = errClass + " " + $.map(asArray(err.type), function(type) { + return errClass + "-" + type; + }).join(" "); + } + errClass = errClass + " htmlwidgets-error"; + + // Is el inline or block? If inline or inline-block, just display:none it + // and add an inline error. + var display = $el.css("display"); + $el.data("restore-display-mode", display); + + if (display === "inline" || display === "inline-block") { + $el.hide(); + if (err.message !== "") { + var errorSpan = $("").addClass(errClass); + errorSpan.text(err.message); + $el.after(errorSpan); + } + } else if (display === "block") { + // If block, add an error just after the el, set visibility:none on the + // el, and position the error to be on top of the el. + // Mark it with a unique ID and CSS class so we can remove it later. + $el.css("visibility", "hidden"); + if (err.message !== "") { + var errorDiv = $("
").addClass(errClass).css("position", "absolute") + .css("top", el.offsetTop) + .css("left", el.offsetLeft) + // setting width can push out the page size, forcing otherwise + // unnecessary scrollbars to appear and making it impossible for + // the element to shrink; so use max-width instead + .css("maxWidth", el.offsetWidth) + .css("height", el.offsetHeight); + errorDiv.text(err.message); + $el.after(errorDiv); + + // Really dumb way to keep the size/position of the error in sync with + // the parent element as the window is resized or whatever. + var intId = setInterval(function() { + if (!errorDiv[0].parentElement) { + clearInterval(intId); + return; + } + errorDiv + .css("top", el.offsetTop) + .css("left", el.offsetLeft) + .css("maxWidth", el.offsetWidth) + .css("height", el.offsetHeight); + }, 500); + } + } + }, + clearError: function(el) { + var $el = $(el); + var display = $el.data("restore-display-mode"); + $el.data("restore-display-mode", null); + + if (display === "inline" || display === "inline-block") { + if (display) + $el.css("display", display); + $(el.nextSibling).filter(".htmlwidgets-error").remove(); + } else if (display === "block"){ + $el.css("visibility", "inherit"); + $(el.nextSibling).filter(".htmlwidgets-error").remove(); + } + }, + sizing: {} + }; + + // Called by widget bindings to register a new type of widget. The definition + // object can contain the following properties: + // - name (required) - A string indicating the binding name, which will be + // used by default as the CSS classname to look for. + // - initialize (optional) - A function(el) that will be called once per + // widget element; if a value is returned, it will be passed as the third + // value to renderValue. + // - renderValue (required) - A function(el, data, initValue) that will be + // called with data. Static contexts will cause this to be called once per + // element; Shiny apps will cause this to be called multiple times per + // element, as the data changes. + window.HTMLWidgets.widget = function(definition) { + if (!definition.name) { + throw new Error("Widget must have a name"); + } + if (!definition.type) { + throw new Error("Widget must have a type"); + } + // Currently we only support output widgets + if (definition.type !== "output") { + throw new Error("Unrecognized widget type '" + definition.type + "'"); + } + // TODO: Verify that .name is a valid CSS classname + + // Support new-style instance-bound definitions. Old-style class-bound + // definitions have one widget "object" per widget per type/class of + // widget; the renderValue and resize methods on such widget objects + // take el and instance arguments, because the widget object can't + // store them. New-style instance-bound definitions have one widget + // object per widget instance; the definition that's passed in doesn't + // provide renderValue or resize methods at all, just the single method + // factory(el, width, height) + // which returns an object that has renderValue(x) and resize(w, h). + // This enables a far more natural programming style for the widget + // author, who can store per-instance state using either OO-style + // instance fields or functional-style closure variables (I guess this + // is in contrast to what can only be called C-style pseudo-OO which is + // what we required before). + if (definition.factory) { + definition = createLegacyDefinitionAdapter(definition); + } + + if (!definition.renderValue) { + throw new Error("Widget must have a renderValue function"); + } + + // For static rendering (non-Shiny), use a simple widget registration + // scheme. We also use this scheme for Shiny apps/documents that also + // contain static widgets. + window.HTMLWidgets.widgets = window.HTMLWidgets.widgets || []; + // Merge defaults into the definition; don't mutate the original definition. + var staticBinding = extend({}, defaults, definition); + overrideMethod(staticBinding, "find", function(superfunc) { + return function(scope) { + var results = superfunc(scope); + // Filter out Shiny outputs, we only want the static kind + return filterByClass(results, "html-widget-output", false); + }; + }); + window.HTMLWidgets.widgets.push(staticBinding); + + if (shinyMode) { + // Shiny is running. Register the definition with an output binding. + // The definition itself will not be the output binding, instead + // we will make an output binding object that delegates to the + // definition. This is because we foolishly used the same method + // name (renderValue) for htmlwidgets definition and Shiny bindings + // but they actually have quite different semantics (the Shiny + // bindings receive data that includes lots of metadata that it + // strips off before calling htmlwidgets renderValue). We can't + // just ignore the difference because in some widgets it's helpful + // to call this.renderValue() from inside of resize(), and if + // we're not delegating, then that call will go to the Shiny + // version instead of the htmlwidgets version. + + // Merge defaults with definition, without mutating either. + var bindingDef = extend({}, defaults, definition); + + // This object will be our actual Shiny binding. + var shinyBinding = new Shiny.OutputBinding(); + + // With a few exceptions, we'll want to simply use the bindingDef's + // version of methods if they are available, otherwise fall back to + // Shiny's defaults. NOTE: If Shiny's output bindings gain additional + // methods in the future, and we want them to be overrideable by + // HTMLWidget binding definitions, then we'll need to add them to this + // list. + delegateMethod(shinyBinding, bindingDef, "getId"); + delegateMethod(shinyBinding, bindingDef, "onValueChange"); + delegateMethod(shinyBinding, bindingDef, "onValueError"); + delegateMethod(shinyBinding, bindingDef, "renderError"); + delegateMethod(shinyBinding, bindingDef, "clearError"); + delegateMethod(shinyBinding, bindingDef, "showProgress"); + + // The find, renderValue, and resize are handled differently, because we + // want to actually decorate the behavior of the bindingDef methods. + + shinyBinding.find = function(scope) { + var results = bindingDef.find(scope); + + // Only return elements that are Shiny outputs, not static ones + var dynamicResults = results.filter(".html-widget-output"); + + // It's possible that whatever caused Shiny to think there might be + // new dynamic outputs, also caused there to be new static outputs. + // Since there might be lots of different htmlwidgets bindings, we + // schedule execution for later--no need to staticRender multiple + // times. + if (results.length !== dynamicResults.length) + scheduleStaticRender(); + + return dynamicResults; + }; + + // Wrap renderValue to handle initialization, which unfortunately isn't + // supported natively by Shiny at the time of this writing. + + shinyBinding.renderValue = function(el, data) { + Shiny.renderDependencies(data.deps); + // Resolve strings marked as javascript literals to objects + if (!(data.evals instanceof Array)) data.evals = [data.evals]; + for (var i = 0; data.evals && i < data.evals.length; i++) { + window.HTMLWidgets.evaluateStringMember(data.x, data.evals[i]); + } + if (!bindingDef.renderOnNullValue) { + if (data.x === null) { + el.style.visibility = "hidden"; + return; + } else { + el.style.visibility = "inherit"; + } + } + if (!elementData(el, "initialized")) { + initSizing(el); + + elementData(el, "initialized", true); + if (bindingDef.initialize) { + var rect = el.getBoundingClientRect(); + var result = bindingDef.initialize(el, rect.width, rect.height); + elementData(el, "init_result", result); + } + } + bindingDef.renderValue(el, data.x, elementData(el, "init_result")); + evalAndRun(data.jsHooks.render, elementData(el, "init_result"), [el, data.x]); + }; + + // Only override resize if bindingDef implements it + if (bindingDef.resize) { + shinyBinding.resize = function(el, width, height) { + // Shiny can call resize before initialize/renderValue have been + // called, which doesn't make sense for widgets. + if (elementData(el, "initialized")) { + bindingDef.resize(el, width, height, elementData(el, "init_result")); + } + }; + } + + Shiny.outputBindings.register(shinyBinding, bindingDef.name); + } + }; + + var scheduleStaticRenderTimerId = null; + function scheduleStaticRender() { + if (!scheduleStaticRenderTimerId) { + scheduleStaticRenderTimerId = setTimeout(function() { + scheduleStaticRenderTimerId = null; + window.HTMLWidgets.staticRender(); + }, 1); + } + } + + // Render static widgets after the document finishes loading + // Statically render all elements that are of this widget's class + window.HTMLWidgets.staticRender = function() { + var bindings = window.HTMLWidgets.widgets || []; + forEach(bindings, function(binding) { + var matches = binding.find(document.documentElement); + forEach(matches, function(el) { + var sizeObj = initSizing(el, binding); + + var getSize = function(el) { + if (sizeObj) { + return {w: sizeObj.getWidth(), h: sizeObj.getHeight()} + } else { + var rect = el.getBoundingClientRect(); + return {w: rect.width, h: rect.height} + } + }; + + if (hasClass(el, "html-widget-static-bound")) + return; + el.className = el.className + " html-widget-static-bound"; + + var initResult; + if (binding.initialize) { + var size = getSize(el); + initResult = binding.initialize(el, size.w, size.h); + elementData(el, "init_result", initResult); + } + + if (binding.resize) { + var lastSize = getSize(el); + var resizeHandler = function(e) { + var size = getSize(el); + if (size.w === 0 && size.h === 0) + return; + if (size.w === lastSize.w && size.h === lastSize.h) + return; + lastSize = size; + binding.resize(el, size.w, size.h, initResult); + }; + + on(window, "resize", resizeHandler); + + // This is needed for cases where we're running in a Shiny + // app, but the widget itself is not a Shiny output, but + // rather a simple static widget. One example of this is + // an rmarkdown document that has runtime:shiny and widget + // that isn't in a render function. Shiny only knows to + // call resize handlers for Shiny outputs, not for static + // widgets, so we do it ourselves. + if (window.jQuery) { + window.jQuery(document).on( + "shown.htmlwidgets shown.bs.tab.htmlwidgets shown.bs.collapse.htmlwidgets", + resizeHandler + ); + window.jQuery(document).on( + "hidden.htmlwidgets hidden.bs.tab.htmlwidgets hidden.bs.collapse.htmlwidgets", + resizeHandler + ); + } + + // This is needed for the specific case of ioslides, which + // flips slides between display:none and display:block. + // Ideally we would not have to have ioslide-specific code + // here, but rather have ioslides raise a generic event, + // but the rmarkdown package just went to CRAN so the + // window to getting that fixed may be long. + if (window.addEventListener) { + // It's OK to limit this to window.addEventListener + // browsers because ioslides itself only supports + // such browsers. + on(document, "slideenter", resizeHandler); + on(document, "slideleave", resizeHandler); + } + } + + var scriptData = document.querySelector("script[data-for='" + el.id + "'][type='application/json']"); + if (scriptData) { + var data = JSON.parse(scriptData.textContent || scriptData.text); + // Resolve strings marked as javascript literals to objects + if (!(data.evals instanceof Array)) data.evals = [data.evals]; + for (var k = 0; data.evals && k < data.evals.length; k++) { + window.HTMLWidgets.evaluateStringMember(data.x, data.evals[k]); + } + binding.renderValue(el, data.x, initResult); + evalAndRun(data.jsHooks.render, initResult, [el, data.x]); + } + }); + }); + + invokePostRenderHandlers(); + } + + + function has_jQuery3() { + if (!window.jQuery) { + return false; + } + var $version = window.jQuery.fn.jquery; + var $major_version = parseInt($version.split(".")[0]); + return $major_version >= 3; + } + + /* + / Shiny 1.4 bumped jQuery from 1.x to 3.x which means jQuery's + / on-ready handler (i.e., $(fn)) is now asyncronous (i.e., it now + / really means $(setTimeout(fn)). + / https://jquery.com/upgrade-guide/3.0/#breaking-change-document-ready-handlers-are-now-asynchronous + / + / Since Shiny uses $() to schedule initShiny, shiny>=1.4 calls initShiny + / one tick later than it did before, which means staticRender() is + / called renderValue() earlier than (advanced) widget authors might be expecting. + / https://github.com/rstudio/shiny/issues/2630 + / + / For a concrete example, leaflet has some methods (e.g., updateBounds) + / which reference Shiny methods registered in initShiny (e.g., setInputValue). + / Since leaflet is privy to this life-cycle, it knows to use setTimeout() to + / delay execution of those methods (until Shiny methods are ready) + / https://github.com/rstudio/leaflet/blob/18ec981/javascript/src/index.js#L266-L268 + / + / Ideally widget authors wouldn't need to use this setTimeout() hack that + / leaflet uses to call Shiny methods on a staticRender(). In the long run, + / the logic initShiny should be broken up so that method registration happens + / right away, but binding happens later. + */ + function maybeStaticRenderLater() { + if (shinyMode && has_jQuery3()) { + window.jQuery(window.HTMLWidgets.staticRender); + } else { + window.HTMLWidgets.staticRender(); + } + } + + if (document.addEventListener) { + document.addEventListener("DOMContentLoaded", function() { + document.removeEventListener("DOMContentLoaded", arguments.callee, false); + maybeStaticRenderLater(); + }, false); + } else if (document.attachEvent) { + document.attachEvent("onreadystatechange", function() { + if (document.readyState === "complete") { + document.detachEvent("onreadystatechange", arguments.callee); + maybeStaticRenderLater(); + } + }); + } + + + window.HTMLWidgets.getAttachmentUrl = function(depname, key) { + // If no key, default to the first item + if (typeof(key) === "undefined") + key = 1; + + var link = document.getElementById(depname + "-" + key + "-attachment"); + if (!link) { + throw new Error("Attachment " + depname + "/" + key + " not found in document"); + } + return link.getAttribute("href"); + }; + + window.HTMLWidgets.dataframeToD3 = function(df) { + var names = []; + var length; + for (var name in df) { + if (df.hasOwnProperty(name)) + names.push(name); + if (typeof(df[name]) !== "object" || typeof(df[name].length) === "undefined") { + throw new Error("All fields must be arrays"); + } else if (typeof(length) !== "undefined" && length !== df[name].length) { + throw new Error("All fields must be arrays of the same length"); + } + length = df[name].length; + } + var results = []; + var item; + for (var row = 0; row < length; row++) { + item = {}; + for (var col = 0; col < names.length; col++) { + item[names[col]] = df[names[col]][row]; + } + results.push(item); + } + return results; + }; + + window.HTMLWidgets.transposeArray2D = function(array) { + if (array.length === 0) return array; + var newArray = array[0].map(function(col, i) { + return array.map(function(row) { + return row[i] + }) + }); + return newArray; + }; + // Split value at splitChar, but allow splitChar to be escaped + // using escapeChar. Any other characters escaped by escapeChar + // will be included as usual (including escapeChar itself). + function splitWithEscape(value, splitChar, escapeChar) { + var results = []; + var escapeMode = false; + var currentResult = ""; + for (var pos = 0; pos < value.length; pos++) { + if (!escapeMode) { + if (value[pos] === splitChar) { + results.push(currentResult); + currentResult = ""; + } else if (value[pos] === escapeChar) { + escapeMode = true; + } else { + currentResult += value[pos]; + } + } else { + currentResult += value[pos]; + escapeMode = false; + } + } + if (currentResult !== "") { + results.push(currentResult); + } + return results; + } + // Function authored by Yihui/JJ Allaire + window.HTMLWidgets.evaluateStringMember = function(o, member) { + var parts = splitWithEscape(member, '.', '\\'); + for (var i = 0, l = parts.length; i < l; i++) { + var part = parts[i]; + // part may be a character or 'numeric' member name + if (o !== null && typeof o === "object" && part in o) { + if (i == (l - 1)) { // if we are at the end of the line then evalulate + if (typeof o[part] === "string") + o[part] = tryEval(o[part]); + } else { // otherwise continue to next embedded object + o = o[part]; + } + } + } + }; + + // Retrieve the HTMLWidget instance (i.e. the return value of an + // HTMLWidget binding's initialize() or factory() function) + // associated with an element, or null if none. + window.HTMLWidgets.getInstance = function(el) { + return elementData(el, "init_result"); + }; + + // Finds the first element in the scope that matches the selector, + // and returns the HTMLWidget instance (i.e. the return value of + // an HTMLWidget binding's initialize() or factory() function) + // associated with that element, if any. If no element matches the + // selector, or the first matching element has no HTMLWidget + // instance associated with it, then null is returned. + // + // The scope argument is optional, and defaults to window.document. + window.HTMLWidgets.find = function(scope, selector) { + if (arguments.length == 1) { + selector = scope; + scope = document; + } + + var el = scope.querySelector(selector); + if (el === null) { + return null; + } else { + return window.HTMLWidgets.getInstance(el); + } + }; + + // Finds all elements in the scope that match the selector, and + // returns the HTMLWidget instances (i.e. the return values of + // an HTMLWidget binding's initialize() or factory() function) + // associated with the elements, in an array. If elements that + // match the selector don't have an associated HTMLWidget + // instance, the returned array will contain nulls. + // + // The scope argument is optional, and defaults to window.document. + window.HTMLWidgets.findAll = function(scope, selector) { + if (arguments.length == 1) { + selector = scope; + scope = document; + } + + var nodes = scope.querySelectorAll(selector); + var results = []; + for (var i = 0; i < nodes.length; i++) { + results.push(window.HTMLWidgets.getInstance(nodes[i])); + } + return results; + }; + + var postRenderHandlers = []; + function invokePostRenderHandlers() { + while (postRenderHandlers.length) { + var handler = postRenderHandlers.shift(); + if (handler) { + handler(); + } + } + } + + // Register the given callback function to be invoked after the + // next time static widgets are rendered. + window.HTMLWidgets.addPostRenderHandler = function(callback) { + postRenderHandlers.push(callback); + }; + + // Takes a new-style instance-bound definition, and returns an + // old-style class-bound definition. This saves us from having + // to rewrite all the logic in this file to accomodate both + // types of definitions. + function createLegacyDefinitionAdapter(defn) { + var result = { + name: defn.name, + type: defn.type, + initialize: function(el, width, height) { + return defn.factory(el, width, height); + }, + renderValue: function(el, x, instance) { + return instance.renderValue(x); + }, + resize: function(el, width, height, instance) { + return instance.resize(width, height); + } + }; + + if (defn.find) + result.find = defn.find; + if (defn.renderError) + result.renderError = defn.renderError; + if (defn.clearError) + result.clearError = defn.clearError; + + return result; + } +})(); diff --git a/_freeze/site_libs/jquery-3.5.1/jquery-AUTHORS.txt b/_freeze/site_libs/jquery-3.5.1/jquery-AUTHORS.txt new file mode 100644 index 0000000..06df1a5 --- /dev/null +++ b/_freeze/site_libs/jquery-3.5.1/jquery-AUTHORS.txt @@ -0,0 +1,357 @@ +Authors ordered by first contribution. + +John Resig +Gilles van den Hoven +Michael Geary +Stefan Petre +Yehuda Katz +Corey Jewett +Klaus Hartl +Franck Marcia +Jörn Zaefferer +Paul Bakaus +Brandon Aaron +Mike Alsup +Dave Methvin +Ed Engelhardt +Sean Catchpole +Paul Mclanahan +David Serduke +Richard D. Worth +Scott González +Ariel Flesler +Cheah Chu Yeow +Andrew Chalkley +Fabio Buffoni +Stefan Bauckmeier  +Jon Evans +TJ Holowaychuk +Riccardo De Agostini +Michael Bensoussan +Louis-Rémi Babé +Robert Katić +Damian Janowski +Anton Kovalyov +Dušan B. Jovanovic +Earle Castledine +Rich Dougherty +Kim Dalsgaard +Andrea Giammarchi +Fabian Jakobs +Mark Gibson +Karl Swedberg +Justin Meyer +Ben Alman +James Padolsey +David Petersen +Batiste Bieler +Jake Archibald +Alexander Farkas +Filipe Fortes +Rick Waldron +Neeraj Singh +Paul Irish +Iraê Carvalho +Matt Curry +Michael Monteleone +Noah Sloan +Tom Viner +J. Ryan Stinnett +Douglas Neiner +Adam J. Sontag +Heungsub Lee +Dave Reed +Carl Fürstenberg +Jacob Wright +Ralph Whitbeck +unknown +temp01 +Colin Snover +Jared Grippe +Ryan W Tenney +Alex Sexton +Pinhook +Ron Otten +Jephte Clain +Anton Matzneller +Dan Heberden +Henri Wiechers +Russell Holbrook +Julian Aubourg +Gianni Alessandro Chiappetta +Scott Jehl +James Burke +Jonas Pfenniger +Xavi Ramirez +Sylvester Keil +Brandon Sterne +Mathias Bynens +Lee Carpenter +Timmy Willison <4timmywil@gmail.com> +Corey Frang +Digitalxero +David Murdoch +Josh Varner +Charles McNulty +Jordan Boesch +Jess Thrysoee +Michael Murray +Alexis Abril +Rob Morgan +John Firebaugh +Sam Bisbee +Gilmore Davidson +Brian Brennan +Xavier Montillet +Daniel Pihlstrom +Sahab Yazdani +avaly +Scott Hughes +Mike Sherov +Greg Hazel +Schalk Neethling +Denis Knauf +Timo Tijhof +Steen Nielsen +Anton Ryzhov +Shi Chuan +Matt Mueller +Berker Peksag +Toby Brain +Justin +Daniel Herman +Oleg Gaidarenko +Rock Hymas +Richard Gibson +Rafaël Blais Masson +cmc3cn <59194618@qq.com> +Joe Presbrey +Sindre Sorhus +Arne de Bree +Vladislav Zarakovsky +Andrew E Monat +Oskari +Joao Henrique de Andrade Bruni +tsinha +Dominik D. Geyer +Matt Farmer +Trey Hunner +Jason Moon +Jeffery To +Kris Borchers +Vladimir Zhuravlev +Jacob Thornton +Chad Killingsworth +Vitya Muhachev +Nowres Rafid +David Benjamin +Alan Plum +Uri Gilad +Chris Faulkner +Marcel Greter +Elijah Manor +Daniel Chatfield +Daniel Gálvez +Nikita Govorov +Wesley Walser +Mike Pennisi +Matthias Jäggli +Devin Cooper +Markus Staab +Dave Riddle +Callum Macrae +Jonathan Sampson +Benjamin Truyman +Jay Merrifield +James Huston +Sai Lung Wong +Erick Ruiz de Chávez +David Bonner +Allen J Schmidt Jr +Akintayo Akinwunmi +MORGAN +Ismail Khair +Carl Danley +Mike Petrovich +Greg Lavallee +Tom H Fuertes +Roland Eckl +Yiming He +David Fox +Bennett Sorbo +Paul Ramos +Rod Vagg +Sebastian Burkhard +Zachary Adam Kaplan +Adam Coulombe +nanto_vi +nanto +Danil Somsikov +Ryunosuke SATO +Diego Tres +Jean Boussier +Andrew Plummer +Mark Raddatz +Pascal Borreli +Isaac Z. Schlueter +Karl Sieburg +Nguyen Phuc Lam +Dmitry Gusev +Steven Benner +Li Xudong +Michał Gołębiowski-Owczarek +Renato Oliveira dos Santos +Frederic Junod +Tom H Fuertes +Mitch Foley +ros3cin +Kyle Robinson Young +John Paul +Jason Bedard +Chris Talkington +Eddie Monge +Terry Jones +Jason Merino +Dan Burzo +Jeremy Dunck +Chris Price +Guy Bedford +njhamann +Goare Mao +Amey Sakhadeo +Mike Sidorov +Anthony Ryan +Lihan Li +George Kats +Dongseok Paeng +Ronny Springer +Ilya Kantor +Marian Sollmann +Chris Antaki +David Hong +Jakob Stoeck +Christopher Jones +Forbes Lindesay +S. Andrew Sheppard +Leonardo Balter +Rodrigo Rosenfeld Rosas +Daniel Husar +Philip Jägenstedt +John Hoven +Roman Reiß +Benjy Cui +Christian Kosmowski +David Corbacho +Liang Peng +TJ VanToll +Aurelio De Rosa +Senya Pugach +Dan Hart +Nazar Mokrynskyi +Benjamin Tan +Amit Merchant +Jason Bedard +Veaceslav Grimalschi +Richard McDaniel +Arthur Verschaeve +Shivaji Varma +Ben Toews +Bin Xin +Neftaly Hernandez +T.J. Crowder +Nicolas HENRY +Frederic Hemberger +Victor Homyakov +Aditya Raghavan +Anne-Gaelle Colom +Leonardo Braga +George Mauer +Stephen Edgar +Thomas Tortorini +Jörn Wagner +Jon Hester +Colin Frick +Winston Howes +Alexander O'Mara +Chris Rebert +Bastian Buchholz +Mu Haibao +Calvin Metcalf +Arthur Stolyar +Gabriel Schulhof +Gilad Peleg +Julian Alexander Murillo +Kevin Kirsche +Martin Naumann +Yongwoo Jeon +John-David Dalton +Marek Lewandowski +Bruno Pérel +Daniel Nill +Reed Loden +Sean Henderson +Gary Ye +Richard Kraaijenhagen +Connor Atherton +Christian Grete +Tom von Clef +Liza Ramo +Joelle Fleurantin +Steve Mao +Jon Dufresne +Jae Sung Park +Josh Soref +Saptak Sengupta +Henry Wong +Jun Sun +Martijn W. van der Lee +Devin Wilson +Damian Senn +Zack Hall +Vitaliy Terziev +Todor Prikumov +Bernhard M. Wiedemann +Jha Naman +Alexander Lisianoi +William Robinet +Joe Trumbull +Alexander K +Ralin Chimev +Felipe Sateler +Christophe Tafani-Dereeper +Manoj Kumar +David Broder-Rodgers +Alex Louden +Alex Padilla +karan-96 +南漂一卒 +Erik Lax +Boom Lee +Andreas Solleder +Pierre Spring +Shashanka Nataraj +CDAGaming +Matan Kotler-Berkowitz <205matan@gmail.com> +Jordan Beland +Henry Zhu +Nilton Cesar +basil.belokon +Andrey Meshkov +tmybr11 +Luis Emilio Velasco Sanchez +Ed S +Bert Zhang +Sébastien Règne +wartmanm <3869625+wartmanm@users.noreply.github.com> +Siddharth Dungarwal +abnud1 +Andrei Fangli +Marja Hölttä +buddh4 +Hoang +Wonseop Kim +Pat O'Callaghan +JuanMa Ruiz +Ahmed.S.ElAfifi +Sean Robinson +Christian Oliff diff --git a/_freeze/site_libs/jquery-3.5.1/jquery.js b/_freeze/site_libs/jquery-3.5.1/jquery.js new file mode 100644 index 0000000..5093733 --- /dev/null +++ b/_freeze/site_libs/jquery-3.5.1/jquery.js @@ -0,0 +1,10872 @@ +/*! + * jQuery JavaScript Library v3.5.1 + * https://jquery.com/ + * + * Includes Sizzle.js + * https://sizzlejs.com/ + * + * Copyright JS Foundation and other contributors + * Released under the MIT license + * https://jquery.org/license + * + * Date: 2020-05-04T22:49Z + */ +( function( global, factory ) { + + "use strict"; + + if ( typeof module === "object" && typeof module.exports === "object" ) { + + // For CommonJS and CommonJS-like environments where a proper `window` + // is present, execute the factory and get jQuery. + // For environments that do not have a `window` with a `document` + // (such as Node.js), expose a factory as module.exports. + // This accentuates the need for the creation of a real `window`. + // e.g. var jQuery = require("jquery")(window); + // See ticket #14549 for more info. + module.exports = global.document ? + factory( global, true ) : + function( w ) { + if ( !w.document ) { + throw new Error( "jQuery requires a window with a document" ); + } + return factory( w ); + }; + } else { + factory( global ); + } + +// Pass this if window is not defined yet +} )( typeof window !== "undefined" ? window : this, function( window, noGlobal ) { + +// Edge <= 12 - 13+, Firefox <=18 - 45+, IE 10 - 11, Safari 5.1 - 9+, iOS 6 - 9.1 +// throw exceptions when non-strict code (e.g., ASP.NET 4.5) accesses strict mode +// arguments.callee.caller (trac-13335). But as of jQuery 3.0 (2016), strict mode should be common +// enough that all such attempts are guarded in a try block. +"use strict"; + +var arr = []; + +var getProto = Object.getPrototypeOf; + +var slice = arr.slice; + +var flat = arr.flat ? function( array ) { + return arr.flat.call( array ); +} : function( array ) { + return arr.concat.apply( [], array ); +}; + + +var push = arr.push; + +var indexOf = arr.indexOf; + +var class2type = {}; + +var toString = class2type.toString; + +var hasOwn = class2type.hasOwnProperty; + +var fnToString = hasOwn.toString; + +var ObjectFunctionString = fnToString.call( Object ); + +var support = {}; + +var isFunction = function isFunction( obj ) { + + // Support: Chrome <=57, Firefox <=52 + // In some browsers, typeof returns "function" for HTML elements + // (i.e., `typeof document.createElement( "object" ) === "function"`). + // We don't want to classify *any* DOM node as a function. + return typeof obj === "function" && typeof obj.nodeType !== "number"; + }; + + +var isWindow = function isWindow( obj ) { + return obj != null && obj === obj.window; + }; + + +var document = window.document; + + + + var preservedScriptAttributes = { + type: true, + src: true, + nonce: true, + noModule: true + }; + + function DOMEval( code, node, doc ) { + doc = doc || document; + + var i, val, + script = doc.createElement( "script" ); + + script.text = code; + if ( node ) { + for ( i in preservedScriptAttributes ) { + + // Support: Firefox 64+, Edge 18+ + // Some browsers don't support the "nonce" property on scripts. + // On the other hand, just using `getAttribute` is not enough as + // the `nonce` attribute is reset to an empty string whenever it + // becomes browsing-context connected. + // See https://github.com/whatwg/html/issues/2369 + // See https://html.spec.whatwg.org/#nonce-attributes + // The `node.getAttribute` check was added for the sake of + // `jQuery.globalEval` so that it can fake a nonce-containing node + // via an object. + val = node[ i ] || node.getAttribute && node.getAttribute( i ); + if ( val ) { + script.setAttribute( i, val ); + } + } + } + doc.head.appendChild( script ).parentNode.removeChild( script ); + } + + +function toType( obj ) { + if ( obj == null ) { + return obj + ""; + } + + // Support: Android <=2.3 only (functionish RegExp) + return typeof obj === "object" || typeof obj === "function" ? + class2type[ toString.call( obj ) ] || "object" : + typeof obj; +} +/* global Symbol */ +// Defining this global in .eslintrc.json would create a danger of using the global +// unguarded in another place, it seems safer to define global only for this module + + + +var + version = "3.5.1", + + // Define a local copy of jQuery + jQuery = function( selector, context ) { + + // The jQuery object is actually just the init constructor 'enhanced' + // Need init if jQuery is called (just allow error to be thrown if not included) + return new jQuery.fn.init( selector, context ); + }; + +jQuery.fn = jQuery.prototype = { + + // The current version of jQuery being used + jquery: version, + + constructor: jQuery, + + // The default length of a jQuery object is 0 + length: 0, + + toArray: function() { + return slice.call( this ); + }, + + // Get the Nth element in the matched element set OR + // Get the whole matched element set as a clean array + get: function( num ) { + + // Return all the elements in a clean array + if ( num == null ) { + return slice.call( this ); + } + + // Return just the one element from the set + return num < 0 ? this[ num + this.length ] : this[ num ]; + }, + + // Take an array of elements and push it onto the stack + // (returning the new matched element set) + pushStack: function( elems ) { + + // Build a new jQuery matched element set + var ret = jQuery.merge( this.constructor(), elems ); + + // Add the old object onto the stack (as a reference) + ret.prevObject = this; + + // Return the newly-formed element set + return ret; + }, + + // Execute a callback for every element in the matched set. + each: function( callback ) { + return jQuery.each( this, callback ); + }, + + map: function( callback ) { + return this.pushStack( jQuery.map( this, function( elem, i ) { + return callback.call( elem, i, elem ); + } ) ); + }, + + slice: function() { + return this.pushStack( slice.apply( this, arguments ) ); + }, + + first: function() { + return this.eq( 0 ); + }, + + last: function() { + return this.eq( -1 ); + }, + + even: function() { + return this.pushStack( jQuery.grep( this, function( _elem, i ) { + return ( i + 1 ) % 2; + } ) ); + }, + + odd: function() { + return this.pushStack( jQuery.grep( this, function( _elem, i ) { + return i % 2; + } ) ); + }, + + eq: function( i ) { + var len = this.length, + j = +i + ( i < 0 ? len : 0 ); + return this.pushStack( j >= 0 && j < len ? [ this[ j ] ] : [] ); + }, + + end: function() { + return this.prevObject || this.constructor(); + }, + + // For internal use only. + // Behaves like an Array's method, not like a jQuery method. + push: push, + sort: arr.sort, + splice: arr.splice +}; + +jQuery.extend = jQuery.fn.extend = function() { + var options, name, src, copy, copyIsArray, clone, + target = arguments[ 0 ] || {}, + i = 1, + length = arguments.length, + deep = false; + + // Handle a deep copy situation + if ( typeof target === "boolean" ) { + deep = target; + + // Skip the boolean and the target + target = arguments[ i ] || {}; + i++; + } + + // Handle case when target is a string or something (possible in deep copy) + if ( typeof target !== "object" && !isFunction( target ) ) { + target = {}; + } + + // Extend jQuery itself if only one argument is passed + if ( i === length ) { + target = this; + i--; + } + + for ( ; i < length; i++ ) { + + // Only deal with non-null/undefined values + if ( ( options = arguments[ i ] ) != null ) { + + // Extend the base object + for ( name in options ) { + copy = options[ name ]; + + // Prevent Object.prototype pollution + // Prevent never-ending loop + if ( name === "__proto__" || target === copy ) { + continue; + } + + // Recurse if we're merging plain objects or arrays + if ( deep && copy && ( jQuery.isPlainObject( copy ) || + ( copyIsArray = Array.isArray( copy ) ) ) ) { + src = target[ name ]; + + // Ensure proper type for the source value + if ( copyIsArray && !Array.isArray( src ) ) { + clone = []; + } else if ( !copyIsArray && !jQuery.isPlainObject( src ) ) { + clone = {}; + } else { + clone = src; + } + copyIsArray = false; + + // Never move original objects, clone them + target[ name ] = jQuery.extend( deep, clone, copy ); + + // Don't bring in undefined values + } else if ( copy !== undefined ) { + target[ name ] = copy; + } + } + } + } + + // Return the modified object + return target; +}; + +jQuery.extend( { + + // Unique for each copy of jQuery on the page + expando: "jQuery" + ( version + Math.random() ).replace( /\D/g, "" ), + + // Assume jQuery is ready without the ready module + isReady: true, + + error: function( msg ) { + throw new Error( msg ); + }, + + noop: function() {}, + + isPlainObject: function( obj ) { + var proto, Ctor; + + // Detect obvious negatives + // Use toString instead of jQuery.type to catch host objects + if ( !obj || toString.call( obj ) !== "[object Object]" ) { + return false; + } + + proto = getProto( obj ); + + // Objects with no prototype (e.g., `Object.create( null )`) are plain + if ( !proto ) { + return true; + } + + // Objects with prototype are plain iff they were constructed by a global Object function + Ctor = hasOwn.call( proto, "constructor" ) && proto.constructor; + return typeof Ctor === "function" && fnToString.call( Ctor ) === ObjectFunctionString; + }, + + isEmptyObject: function( obj ) { + var name; + + for ( name in obj ) { + return false; + } + return true; + }, + + // Evaluates a script in a provided context; falls back to the global one + // if not specified. + globalEval: function( code, options, doc ) { + DOMEval( code, { nonce: options && options.nonce }, doc ); + }, + + each: function( obj, callback ) { + var length, i = 0; + + if ( isArrayLike( obj ) ) { + length = obj.length; + for ( ; i < length; i++ ) { + if ( callback.call( obj[ i ], i, obj[ i ] ) === false ) { + break; + } + } + } else { + for ( i in obj ) { + if ( callback.call( obj[ i ], i, obj[ i ] ) === false ) { + break; + } + } + } + + return obj; + }, + + // results is for internal usage only + makeArray: function( arr, results ) { + var ret = results || []; + + if ( arr != null ) { + if ( isArrayLike( Object( arr ) ) ) { + jQuery.merge( ret, + typeof arr === "string" ? + [ arr ] : arr + ); + } else { + push.call( ret, arr ); + } + } + + return ret; + }, + + inArray: function( elem, arr, i ) { + return arr == null ? -1 : indexOf.call( arr, elem, i ); + }, + + // Support: Android <=4.0 only, PhantomJS 1 only + // push.apply(_, arraylike) throws on ancient WebKit + merge: function( first, second ) { + var len = +second.length, + j = 0, + i = first.length; + + for ( ; j < len; j++ ) { + first[ i++ ] = second[ j ]; + } + + first.length = i; + + return first; + }, + + grep: function( elems, callback, invert ) { + var callbackInverse, + matches = [], + i = 0, + length = elems.length, + callbackExpect = !invert; + + // Go through the array, only saving the items + // that pass the validator function + for ( ; i < length; i++ ) { + callbackInverse = !callback( elems[ i ], i ); + if ( callbackInverse !== callbackExpect ) { + matches.push( elems[ i ] ); + } + } + + return matches; + }, + + // arg is for internal usage only + map: function( elems, callback, arg ) { + var length, value, + i = 0, + ret = []; + + // Go through the array, translating each of the items to their new values + if ( isArrayLike( elems ) ) { + length = elems.length; + for ( ; i < length; i++ ) { + value = callback( elems[ i ], i, arg ); + + if ( value != null ) { + ret.push( value ); + } + } + + // Go through every key on the object, + } else { + for ( i in elems ) { + value = callback( elems[ i ], i, arg ); + + if ( value != null ) { + ret.push( value ); + } + } + } + + // Flatten any nested arrays + return flat( ret ); + }, + + // A global GUID counter for objects + guid: 1, + + // jQuery.support is not used in Core but other projects attach their + // properties to it so it needs to exist. + support: support +} ); + +if ( typeof Symbol === "function" ) { + jQuery.fn[ Symbol.iterator ] = arr[ Symbol.iterator ]; +} + +// Populate the class2type map +jQuery.each( "Boolean Number String Function Array Date RegExp Object Error Symbol".split( " " ), +function( _i, name ) { + class2type[ "[object " + name + "]" ] = name.toLowerCase(); +} ); + +function isArrayLike( obj ) { + + // Support: real iOS 8.2 only (not reproducible in simulator) + // `in` check used to prevent JIT error (gh-2145) + // hasOwn isn't used here due to false negatives + // regarding Nodelist length in IE + var length = !!obj && "length" in obj && obj.length, + type = toType( obj ); + + if ( isFunction( obj ) || isWindow( obj ) ) { + return false; + } + + return type === "array" || length === 0 || + typeof length === "number" && length > 0 && ( length - 1 ) in obj; +} +var Sizzle = +/*! + * Sizzle CSS Selector Engine v2.3.5 + * https://sizzlejs.com/ + * + * Copyright JS Foundation and other contributors + * Released under the MIT license + * https://js.foundation/ + * + * Date: 2020-03-14 + */ +( function( window ) { +var i, + support, + Expr, + getText, + isXML, + tokenize, + compile, + select, + outermostContext, + sortInput, + hasDuplicate, + + // Local document vars + setDocument, + document, + docElem, + documentIsHTML, + rbuggyQSA, + rbuggyMatches, + matches, + contains, + + // Instance-specific data + expando = "sizzle" + 1 * new Date(), + preferredDoc = window.document, + dirruns = 0, + done = 0, + classCache = createCache(), + tokenCache = createCache(), + compilerCache = createCache(), + nonnativeSelectorCache = createCache(), + sortOrder = function( a, b ) { + if ( a === b ) { + hasDuplicate = true; + } + return 0; + }, + + // Instance methods + hasOwn = ( {} ).hasOwnProperty, + arr = [], + pop = arr.pop, + pushNative = arr.push, + push = arr.push, + slice = arr.slice, + + // Use a stripped-down indexOf as it's faster than native + // https://jsperf.com/thor-indexof-vs-for/5 + indexOf = function( list, elem ) { + var i = 0, + len = list.length; + for ( ; i < len; i++ ) { + if ( list[ i ] === elem ) { + return i; + } + } + return -1; + }, + + booleans = "checked|selected|async|autofocus|autoplay|controls|defer|disabled|hidden|" + + "ismap|loop|multiple|open|readonly|required|scoped", + + // Regular expressions + + // http://www.w3.org/TR/css3-selectors/#whitespace + whitespace = "[\\x20\\t\\r\\n\\f]", + + // https://www.w3.org/TR/css-syntax-3/#ident-token-diagram + identifier = "(?:\\\\[\\da-fA-F]{1,6}" + whitespace + + "?|\\\\[^\\r\\n\\f]|[\\w-]|[^\0-\\x7f])+", + + // Attribute selectors: http://www.w3.org/TR/selectors/#attribute-selectors + attributes = "\\[" + whitespace + "*(" + identifier + ")(?:" + whitespace + + + // Operator (capture 2) + "*([*^$|!~]?=)" + whitespace + + + // "Attribute values must be CSS identifiers [capture 5] + // or strings [capture 3 or capture 4]" + "*(?:'((?:\\\\.|[^\\\\'])*)'|\"((?:\\\\.|[^\\\\\"])*)\"|(" + identifier + "))|)" + + whitespace + "*\\]", + + pseudos = ":(" + identifier + ")(?:\\((" + + + // To reduce the number of selectors needing tokenize in the preFilter, prefer arguments: + // 1. quoted (capture 3; capture 4 or capture 5) + "('((?:\\\\.|[^\\\\'])*)'|\"((?:\\\\.|[^\\\\\"])*)\")|" + + + // 2. simple (capture 6) + "((?:\\\\.|[^\\\\()[\\]]|" + attributes + ")*)|" + + + // 3. anything else (capture 2) + ".*" + + ")\\)|)", + + // Leading and non-escaped trailing whitespace, capturing some non-whitespace characters preceding the latter + rwhitespace = new RegExp( whitespace + "+", "g" ), + rtrim = new RegExp( "^" + whitespace + "+|((?:^|[^\\\\])(?:\\\\.)*)" + + whitespace + "+$", "g" ), + + rcomma = new RegExp( "^" + whitespace + "*," + whitespace + "*" ), + rcombinators = new RegExp( "^" + whitespace + "*([>+~]|" + whitespace + ")" + whitespace + + "*" ), + rdescend = new RegExp( whitespace + "|>" ), + + rpseudo = new RegExp( pseudos ), + ridentifier = new RegExp( "^" + identifier + "$" ), + + matchExpr = { + "ID": new RegExp( "^#(" + identifier + ")" ), + "CLASS": new RegExp( "^\\.(" + identifier + ")" ), + "TAG": new RegExp( "^(" + identifier + "|[*])" ), + "ATTR": new RegExp( "^" + attributes ), + "PSEUDO": new RegExp( "^" + pseudos ), + "CHILD": new RegExp( "^:(only|first|last|nth|nth-last)-(child|of-type)(?:\\(" + + whitespace + "*(even|odd|(([+-]|)(\\d*)n|)" + whitespace + "*(?:([+-]|)" + + whitespace + "*(\\d+)|))" + whitespace + "*\\)|)", "i" ), + "bool": new RegExp( "^(?:" + booleans + ")$", "i" ), + + // For use in libraries implementing .is() + // We use this for POS matching in `select` + "needsContext": new RegExp( "^" + whitespace + + "*[>+~]|:(even|odd|eq|gt|lt|nth|first|last)(?:\\(" + whitespace + + "*((?:-\\d)?\\d*)" + whitespace + "*\\)|)(?=[^-]|$)", "i" ) + }, + + rhtml = /HTML$/i, + rinputs = /^(?:input|select|textarea|button)$/i, + rheader = /^h\d$/i, + + rnative = /^[^{]+\{\s*\[native \w/, + + // Easily-parseable/retrievable ID or TAG or CLASS selectors + rquickExpr = /^(?:#([\w-]+)|(\w+)|\.([\w-]+))$/, + + rsibling = /[+~]/, + + // CSS escapes + // http://www.w3.org/TR/CSS21/syndata.html#escaped-characters + runescape = new RegExp( "\\\\[\\da-fA-F]{1,6}" + whitespace + "?|\\\\([^\\r\\n\\f])", "g" ), + funescape = function( escape, nonHex ) { + var high = "0x" + escape.slice( 1 ) - 0x10000; + + return nonHex ? + + // Strip the backslash prefix from a non-hex escape sequence + nonHex : + + // Replace a hexadecimal escape sequence with the encoded Unicode code point + // Support: IE <=11+ + // For values outside the Basic Multilingual Plane (BMP), manually construct a + // surrogate pair + high < 0 ? + String.fromCharCode( high + 0x10000 ) : + String.fromCharCode( high >> 10 | 0xD800, high & 0x3FF | 0xDC00 ); + }, + + // CSS string/identifier serialization + // https://drafts.csswg.org/cssom/#common-serializing-idioms + rcssescape = /([\0-\x1f\x7f]|^-?\d)|^-$|[^\0-\x1f\x7f-\uFFFF\w-]/g, + fcssescape = function( ch, asCodePoint ) { + if ( asCodePoint ) { + + // U+0000 NULL becomes U+FFFD REPLACEMENT CHARACTER + if ( ch === "\0" ) { + return "\uFFFD"; + } + + // Control characters and (dependent upon position) numbers get escaped as code points + return ch.slice( 0, -1 ) + "\\" + + ch.charCodeAt( ch.length - 1 ).toString( 16 ) + " "; + } + + // Other potentially-special ASCII characters get backslash-escaped + return "\\" + ch; + }, + + // Used for iframes + // See setDocument() + // Removing the function wrapper causes a "Permission Denied" + // error in IE + unloadHandler = function() { + setDocument(); + }, + + inDisabledFieldset = addCombinator( + function( elem ) { + return elem.disabled === true && elem.nodeName.toLowerCase() === "fieldset"; + }, + { dir: "parentNode", next: "legend" } + ); + +// Optimize for push.apply( _, NodeList ) +try { + push.apply( + ( arr = slice.call( preferredDoc.childNodes ) ), + preferredDoc.childNodes + ); + + // Support: Android<4.0 + // Detect silently failing push.apply + // eslint-disable-next-line no-unused-expressions + arr[ preferredDoc.childNodes.length ].nodeType; +} catch ( e ) { + push = { apply: arr.length ? + + // Leverage slice if possible + function( target, els ) { + pushNative.apply( target, slice.call( els ) ); + } : + + // Support: IE<9 + // Otherwise append directly + function( target, els ) { + var j = target.length, + i = 0; + + // Can't trust NodeList.length + while ( ( target[ j++ ] = els[ i++ ] ) ) {} + target.length = j - 1; + } + }; +} + +function Sizzle( selector, context, results, seed ) { + var m, i, elem, nid, match, groups, newSelector, + newContext = context && context.ownerDocument, + + // nodeType defaults to 9, since context defaults to document + nodeType = context ? context.nodeType : 9; + + results = results || []; + + // Return early from calls with invalid selector or context + if ( typeof selector !== "string" || !selector || + nodeType !== 1 && nodeType !== 9 && nodeType !== 11 ) { + + return results; + } + + // Try to shortcut find operations (as opposed to filters) in HTML documents + if ( !seed ) { + setDocument( context ); + context = context || document; + + if ( documentIsHTML ) { + + // If the selector is sufficiently simple, try using a "get*By*" DOM method + // (excepting DocumentFragment context, where the methods don't exist) + if ( nodeType !== 11 && ( match = rquickExpr.exec( selector ) ) ) { + + // ID selector + if ( ( m = match[ 1 ] ) ) { + + // Document context + if ( nodeType === 9 ) { + if ( ( elem = context.getElementById( m ) ) ) { + + // Support: IE, Opera, Webkit + // TODO: identify versions + // getElementById can match elements by name instead of ID + if ( elem.id === m ) { + results.push( elem ); + return results; + } + } else { + return results; + } + + // Element context + } else { + + // Support: IE, Opera, Webkit + // TODO: identify versions + // getElementById can match elements by name instead of ID + if ( newContext && ( elem = newContext.getElementById( m ) ) && + contains( context, elem ) && + elem.id === m ) { + + results.push( elem ); + return results; + } + } + + // Type selector + } else if ( match[ 2 ] ) { + push.apply( results, context.getElementsByTagName( selector ) ); + return results; + + // Class selector + } else if ( ( m = match[ 3 ] ) && support.getElementsByClassName && + context.getElementsByClassName ) { + + push.apply( results, context.getElementsByClassName( m ) ); + return results; + } + } + + // Take advantage of querySelectorAll + if ( support.qsa && + !nonnativeSelectorCache[ selector + " " ] && + ( !rbuggyQSA || !rbuggyQSA.test( selector ) ) && + + // Support: IE 8 only + // Exclude object elements + ( nodeType !== 1 || context.nodeName.toLowerCase() !== "object" ) ) { + + newSelector = selector; + newContext = context; + + // qSA considers elements outside a scoping root when evaluating child or + // descendant combinators, which is not what we want. + // In such cases, we work around the behavior by prefixing every selector in the + // list with an ID selector referencing the scope context. + // The technique has to be used as well when a leading combinator is used + // as such selectors are not recognized by querySelectorAll. + // Thanks to Andrew Dupont for this technique. + if ( nodeType === 1 && + ( rdescend.test( selector ) || rcombinators.test( selector ) ) ) { + + // Expand context for sibling selectors + newContext = rsibling.test( selector ) && testContext( context.parentNode ) || + context; + + // We can use :scope instead of the ID hack if the browser + // supports it & if we're not changing the context. + if ( newContext !== context || !support.scope ) { + + // Capture the context ID, setting it first if necessary + if ( ( nid = context.getAttribute( "id" ) ) ) { + nid = nid.replace( rcssescape, fcssescape ); + } else { + context.setAttribute( "id", ( nid = expando ) ); + } + } + + // Prefix every selector in the list + groups = tokenize( selector ); + i = groups.length; + while ( i-- ) { + groups[ i ] = ( nid ? "#" + nid : ":scope" ) + " " + + toSelector( groups[ i ] ); + } + newSelector = groups.join( "," ); + } + + try { + push.apply( results, + newContext.querySelectorAll( newSelector ) + ); + return results; + } catch ( qsaError ) { + nonnativeSelectorCache( selector, true ); + } finally { + if ( nid === expando ) { + context.removeAttribute( "id" ); + } + } + } + } + } + + // All others + return select( selector.replace( rtrim, "$1" ), context, results, seed ); +} + +/** + * Create key-value caches of limited size + * @returns {function(string, object)} Returns the Object data after storing it on itself with + * property name the (space-suffixed) string and (if the cache is larger than Expr.cacheLength) + * deleting the oldest entry + */ +function createCache() { + var keys = []; + + function cache( key, value ) { + + // Use (key + " ") to avoid collision with native prototype properties (see Issue #157) + if ( keys.push( key + " " ) > Expr.cacheLength ) { + + // Only keep the most recent entries + delete cache[ keys.shift() ]; + } + return ( cache[ key + " " ] = value ); + } + return cache; +} + +/** + * Mark a function for special use by Sizzle + * @param {Function} fn The function to mark + */ +function markFunction( fn ) { + fn[ expando ] = true; + return fn; +} + +/** + * Support testing using an element + * @param {Function} fn Passed the created element and returns a boolean result + */ +function assert( fn ) { + var el = document.createElement( "fieldset" ); + + try { + return !!fn( el ); + } catch ( e ) { + return false; + } finally { + + // Remove from its parent by default + if ( el.parentNode ) { + el.parentNode.removeChild( el ); + } + + // release memory in IE + el = null; + } +} + +/** + * Adds the same handler for all of the specified attrs + * @param {String} attrs Pipe-separated list of attributes + * @param {Function} handler The method that will be applied + */ +function addHandle( attrs, handler ) { + var arr = attrs.split( "|" ), + i = arr.length; + + while ( i-- ) { + Expr.attrHandle[ arr[ i ] ] = handler; + } +} + +/** + * Checks document order of two siblings + * @param {Element} a + * @param {Element} b + * @returns {Number} Returns less than 0 if a precedes b, greater than 0 if a follows b + */ +function siblingCheck( a, b ) { + var cur = b && a, + diff = cur && a.nodeType === 1 && b.nodeType === 1 && + a.sourceIndex - b.sourceIndex; + + // Use IE sourceIndex if available on both nodes + if ( diff ) { + return diff; + } + + // Check if b follows a + if ( cur ) { + while ( ( cur = cur.nextSibling ) ) { + if ( cur === b ) { + return -1; + } + } + } + + return a ? 1 : -1; +} + +/** + * Returns a function to use in pseudos for input types + * @param {String} type + */ +function createInputPseudo( type ) { + return function( elem ) { + var name = elem.nodeName.toLowerCase(); + return name === "input" && elem.type === type; + }; +} + +/** + * Returns a function to use in pseudos for buttons + * @param {String} type + */ +function createButtonPseudo( type ) { + return function( elem ) { + var name = elem.nodeName.toLowerCase(); + return ( name === "input" || name === "button" ) && elem.type === type; + }; +} + +/** + * Returns a function to use in pseudos for :enabled/:disabled + * @param {Boolean} disabled true for :disabled; false for :enabled + */ +function createDisabledPseudo( disabled ) { + + // Known :disabled false positives: fieldset[disabled] > legend:nth-of-type(n+2) :can-disable + return function( elem ) { + + // Only certain elements can match :enabled or :disabled + // https://html.spec.whatwg.org/multipage/scripting.html#selector-enabled + // https://html.spec.whatwg.org/multipage/scripting.html#selector-disabled + if ( "form" in elem ) { + + // Check for inherited disabledness on relevant non-disabled elements: + // * listed form-associated elements in a disabled fieldset + // https://html.spec.whatwg.org/multipage/forms.html#category-listed + // https://html.spec.whatwg.org/multipage/forms.html#concept-fe-disabled + // * option elements in a disabled optgroup + // https://html.spec.whatwg.org/multipage/forms.html#concept-option-disabled + // All such elements have a "form" property. + if ( elem.parentNode && elem.disabled === false ) { + + // Option elements defer to a parent optgroup if present + if ( "label" in elem ) { + if ( "label" in elem.parentNode ) { + return elem.parentNode.disabled === disabled; + } else { + return elem.disabled === disabled; + } + } + + // Support: IE 6 - 11 + // Use the isDisabled shortcut property to check for disabled fieldset ancestors + return elem.isDisabled === disabled || + + // Where there is no isDisabled, check manually + /* jshint -W018 */ + elem.isDisabled !== !disabled && + inDisabledFieldset( elem ) === disabled; + } + + return elem.disabled === disabled; + + // Try to winnow out elements that can't be disabled before trusting the disabled property. + // Some victims get caught in our net (label, legend, menu, track), but it shouldn't + // even exist on them, let alone have a boolean value. + } else if ( "label" in elem ) { + return elem.disabled === disabled; + } + + // Remaining elements are neither :enabled nor :disabled + return false; + }; +} + +/** + * Returns a function to use in pseudos for positionals + * @param {Function} fn + */ +function createPositionalPseudo( fn ) { + return markFunction( function( argument ) { + argument = +argument; + return markFunction( function( seed, matches ) { + var j, + matchIndexes = fn( [], seed.length, argument ), + i = matchIndexes.length; + + // Match elements found at the specified indexes + while ( i-- ) { + if ( seed[ ( j = matchIndexes[ i ] ) ] ) { + seed[ j ] = !( matches[ j ] = seed[ j ] ); + } + } + } ); + } ); +} + +/** + * Checks a node for validity as a Sizzle context + * @param {Element|Object=} context + * @returns {Element|Object|Boolean} The input node if acceptable, otherwise a falsy value + */ +function testContext( context ) { + return context && typeof context.getElementsByTagName !== "undefined" && context; +} + +// Expose support vars for convenience +support = Sizzle.support = {}; + +/** + * Detects XML nodes + * @param {Element|Object} elem An element or a document + * @returns {Boolean} True iff elem is a non-HTML XML node + */ +isXML = Sizzle.isXML = function( elem ) { + var namespace = elem.namespaceURI, + docElem = ( elem.ownerDocument || elem ).documentElement; + + // Support: IE <=8 + // Assume HTML when documentElement doesn't yet exist, such as inside loading iframes + // https://bugs.jquery.com/ticket/4833 + return !rhtml.test( namespace || docElem && docElem.nodeName || "HTML" ); +}; + +/** + * Sets document-related variables once based on the current document + * @param {Element|Object} [doc] An element or document object to use to set the document + * @returns {Object} Returns the current document + */ +setDocument = Sizzle.setDocument = function( node ) { + var hasCompare, subWindow, + doc = node ? node.ownerDocument || node : preferredDoc; + + // Return early if doc is invalid or already selected + // Support: IE 11+, Edge 17 - 18+ + // IE/Edge sometimes throw a "Permission denied" error when strict-comparing + // two documents; shallow comparisons work. + // eslint-disable-next-line eqeqeq + if ( doc == document || doc.nodeType !== 9 || !doc.documentElement ) { + return document; + } + + // Update global variables + document = doc; + docElem = document.documentElement; + documentIsHTML = !isXML( document ); + + // Support: IE 9 - 11+, Edge 12 - 18+ + // Accessing iframe documents after unload throws "permission denied" errors (jQuery #13936) + // Support: IE 11+, Edge 17 - 18+ + // IE/Edge sometimes throw a "Permission denied" error when strict-comparing + // two documents; shallow comparisons work. + // eslint-disable-next-line eqeqeq + if ( preferredDoc != document && + ( subWindow = document.defaultView ) && subWindow.top !== subWindow ) { + + // Support: IE 11, Edge + if ( subWindow.addEventListener ) { + subWindow.addEventListener( "unload", unloadHandler, false ); + + // Support: IE 9 - 10 only + } else if ( subWindow.attachEvent ) { + subWindow.attachEvent( "onunload", unloadHandler ); + } + } + + // Support: IE 8 - 11+, Edge 12 - 18+, Chrome <=16 - 25 only, Firefox <=3.6 - 31 only, + // Safari 4 - 5 only, Opera <=11.6 - 12.x only + // IE/Edge & older browsers don't support the :scope pseudo-class. + // Support: Safari 6.0 only + // Safari 6.0 supports :scope but it's an alias of :root there. + support.scope = assert( function( el ) { + docElem.appendChild( el ).appendChild( document.createElement( "div" ) ); + return typeof el.querySelectorAll !== "undefined" && + !el.querySelectorAll( ":scope fieldset div" ).length; + } ); + + /* Attributes + ---------------------------------------------------------------------- */ + + // Support: IE<8 + // Verify that getAttribute really returns attributes and not properties + // (excepting IE8 booleans) + support.attributes = assert( function( el ) { + el.className = "i"; + return !el.getAttribute( "className" ); + } ); + + /* getElement(s)By* + ---------------------------------------------------------------------- */ + + // Check if getElementsByTagName("*") returns only elements + support.getElementsByTagName = assert( function( el ) { + el.appendChild( document.createComment( "" ) ); + return !el.getElementsByTagName( "*" ).length; + } ); + + // Support: IE<9 + support.getElementsByClassName = rnative.test( document.getElementsByClassName ); + + // Support: IE<10 + // Check if getElementById returns elements by name + // The broken getElementById methods don't pick up programmatically-set names, + // so use a roundabout getElementsByName test + support.getById = assert( function( el ) { + docElem.appendChild( el ).id = expando; + return !document.getElementsByName || !document.getElementsByName( expando ).length; + } ); + + // ID filter and find + if ( support.getById ) { + Expr.filter[ "ID" ] = function( id ) { + var attrId = id.replace( runescape, funescape ); + return function( elem ) { + return elem.getAttribute( "id" ) === attrId; + }; + }; + Expr.find[ "ID" ] = function( id, context ) { + if ( typeof context.getElementById !== "undefined" && documentIsHTML ) { + var elem = context.getElementById( id ); + return elem ? [ elem ] : []; + } + }; + } else { + Expr.filter[ "ID" ] = function( id ) { + var attrId = id.replace( runescape, funescape ); + return function( elem ) { + var node = typeof elem.getAttributeNode !== "undefined" && + elem.getAttributeNode( "id" ); + return node && node.value === attrId; + }; + }; + + // Support: IE 6 - 7 only + // getElementById is not reliable as a find shortcut + Expr.find[ "ID" ] = function( id, context ) { + if ( typeof context.getElementById !== "undefined" && documentIsHTML ) { + var node, i, elems, + elem = context.getElementById( id ); + + if ( elem ) { + + // Verify the id attribute + node = elem.getAttributeNode( "id" ); + if ( node && node.value === id ) { + return [ elem ]; + } + + // Fall back on getElementsByName + elems = context.getElementsByName( id ); + i = 0; + while ( ( elem = elems[ i++ ] ) ) { + node = elem.getAttributeNode( "id" ); + if ( node && node.value === id ) { + return [ elem ]; + } + } + } + + return []; + } + }; + } + + // Tag + Expr.find[ "TAG" ] = support.getElementsByTagName ? + function( tag, context ) { + if ( typeof context.getElementsByTagName !== "undefined" ) { + return context.getElementsByTagName( tag ); + + // DocumentFragment nodes don't have gEBTN + } else if ( support.qsa ) { + return context.querySelectorAll( tag ); + } + } : + + function( tag, context ) { + var elem, + tmp = [], + i = 0, + + // By happy coincidence, a (broken) gEBTN appears on DocumentFragment nodes too + results = context.getElementsByTagName( tag ); + + // Filter out possible comments + if ( tag === "*" ) { + while ( ( elem = results[ i++ ] ) ) { + if ( elem.nodeType === 1 ) { + tmp.push( elem ); + } + } + + return tmp; + } + return results; + }; + + // Class + Expr.find[ "CLASS" ] = support.getElementsByClassName && function( className, context ) { + if ( typeof context.getElementsByClassName !== "undefined" && documentIsHTML ) { + return context.getElementsByClassName( className ); + } + }; + + /* QSA/matchesSelector + ---------------------------------------------------------------------- */ + + // QSA and matchesSelector support + + // matchesSelector(:active) reports false when true (IE9/Opera 11.5) + rbuggyMatches = []; + + // qSa(:focus) reports false when true (Chrome 21) + // We allow this because of a bug in IE8/9 that throws an error + // whenever `document.activeElement` is accessed on an iframe + // So, we allow :focus to pass through QSA all the time to avoid the IE error + // See https://bugs.jquery.com/ticket/13378 + rbuggyQSA = []; + + if ( ( support.qsa = rnative.test( document.querySelectorAll ) ) ) { + + // Build QSA regex + // Regex strategy adopted from Diego Perini + assert( function( el ) { + + var input; + + // Select is set to empty string on purpose + // This is to test IE's treatment of not explicitly + // setting a boolean content attribute, + // since its presence should be enough + // https://bugs.jquery.com/ticket/12359 + docElem.appendChild( el ).innerHTML = "" + + ""; + + // Support: IE8, Opera 11-12.16 + // Nothing should be selected when empty strings follow ^= or $= or *= + // The test attribute must be unknown in Opera but "safe" for WinRT + // https://msdn.microsoft.com/en-us/library/ie/hh465388.aspx#attribute_section + if ( el.querySelectorAll( "[msallowcapture^='']" ).length ) { + rbuggyQSA.push( "[*^$]=" + whitespace + "*(?:''|\"\")" ); + } + + // Support: IE8 + // Boolean attributes and "value" are not treated correctly + if ( !el.querySelectorAll( "[selected]" ).length ) { + rbuggyQSA.push( "\\[" + whitespace + "*(?:value|" + booleans + ")" ); + } + + // Support: Chrome<29, Android<4.4, Safari<7.0+, iOS<7.0+, PhantomJS<1.9.8+ + if ( !el.querySelectorAll( "[id~=" + expando + "-]" ).length ) { + rbuggyQSA.push( "~=" ); + } + + // Support: IE 11+, Edge 15 - 18+ + // IE 11/Edge don't find elements on a `[name='']` query in some cases. + // Adding a temporary attribute to the document before the selection works + // around the issue. + // Interestingly, IE 10 & older don't seem to have the issue. + input = document.createElement( "input" ); + input.setAttribute( "name", "" ); + el.appendChild( input ); + if ( !el.querySelectorAll( "[name='']" ).length ) { + rbuggyQSA.push( "\\[" + whitespace + "*name" + whitespace + "*=" + + whitespace + "*(?:''|\"\")" ); + } + + // Webkit/Opera - :checked should return selected option elements + // http://www.w3.org/TR/2011/REC-css3-selectors-20110929/#checked + // IE8 throws error here and will not see later tests + if ( !el.querySelectorAll( ":checked" ).length ) { + rbuggyQSA.push( ":checked" ); + } + + // Support: Safari 8+, iOS 8+ + // https://bugs.webkit.org/show_bug.cgi?id=136851 + // In-page `selector#id sibling-combinator selector` fails + if ( !el.querySelectorAll( "a#" + expando + "+*" ).length ) { + rbuggyQSA.push( ".#.+[+~]" ); + } + + // Support: Firefox <=3.6 - 5 only + // Old Firefox doesn't throw on a badly-escaped identifier. + el.querySelectorAll( "\\\f" ); + rbuggyQSA.push( "[\\r\\n\\f]" ); + } ); + + assert( function( el ) { + el.innerHTML = "" + + ""; + + // Support: Windows 8 Native Apps + // The type and name attributes are restricted during .innerHTML assignment + var input = document.createElement( "input" ); + input.setAttribute( "type", "hidden" ); + el.appendChild( input ).setAttribute( "name", "D" ); + + // Support: IE8 + // Enforce case-sensitivity of name attribute + if ( el.querySelectorAll( "[name=d]" ).length ) { + rbuggyQSA.push( "name" + whitespace + "*[*^$|!~]?=" ); + } + + // FF 3.5 - :enabled/:disabled and hidden elements (hidden elements are still enabled) + // IE8 throws error here and will not see later tests + if ( el.querySelectorAll( ":enabled" ).length !== 2 ) { + rbuggyQSA.push( ":enabled", ":disabled" ); + } + + // Support: IE9-11+ + // IE's :disabled selector does not pick up the children of disabled fieldsets + docElem.appendChild( el ).disabled = true; + if ( el.querySelectorAll( ":disabled" ).length !== 2 ) { + rbuggyQSA.push( ":enabled", ":disabled" ); + } + + // Support: Opera 10 - 11 only + // Opera 10-11 does not throw on post-comma invalid pseudos + el.querySelectorAll( "*,:x" ); + rbuggyQSA.push( ",.*:" ); + } ); + } + + if ( ( support.matchesSelector = rnative.test( ( matches = docElem.matches || + docElem.webkitMatchesSelector || + docElem.mozMatchesSelector || + docElem.oMatchesSelector || + docElem.msMatchesSelector ) ) ) ) { + + assert( function( el ) { + + // Check to see if it's possible to do matchesSelector + // on a disconnected node (IE 9) + support.disconnectedMatch = matches.call( el, "*" ); + + // This should fail with an exception + // Gecko does not error, returns false instead + matches.call( el, "[s!='']:x" ); + rbuggyMatches.push( "!=", pseudos ); + } ); + } + + rbuggyQSA = rbuggyQSA.length && new RegExp( rbuggyQSA.join( "|" ) ); + rbuggyMatches = rbuggyMatches.length && new RegExp( rbuggyMatches.join( "|" ) ); + + /* Contains + ---------------------------------------------------------------------- */ + hasCompare = rnative.test( docElem.compareDocumentPosition ); + + // Element contains another + // Purposefully self-exclusive + // As in, an element does not contain itself + contains = hasCompare || rnative.test( docElem.contains ) ? + function( a, b ) { + var adown = a.nodeType === 9 ? a.documentElement : a, + bup = b && b.parentNode; + return a === bup || !!( bup && bup.nodeType === 1 && ( + adown.contains ? + adown.contains( bup ) : + a.compareDocumentPosition && a.compareDocumentPosition( bup ) & 16 + ) ); + } : + function( a, b ) { + if ( b ) { + while ( ( b = b.parentNode ) ) { + if ( b === a ) { + return true; + } + } + } + return false; + }; + + /* Sorting + ---------------------------------------------------------------------- */ + + // Document order sorting + sortOrder = hasCompare ? + function( a, b ) { + + // Flag for duplicate removal + if ( a === b ) { + hasDuplicate = true; + return 0; + } + + // Sort on method existence if only one input has compareDocumentPosition + var compare = !a.compareDocumentPosition - !b.compareDocumentPosition; + if ( compare ) { + return compare; + } + + // Calculate position if both inputs belong to the same document + // Support: IE 11+, Edge 17 - 18+ + // IE/Edge sometimes throw a "Permission denied" error when strict-comparing + // two documents; shallow comparisons work. + // eslint-disable-next-line eqeqeq + compare = ( a.ownerDocument || a ) == ( b.ownerDocument || b ) ? + a.compareDocumentPosition( b ) : + + // Otherwise we know they are disconnected + 1; + + // Disconnected nodes + if ( compare & 1 || + ( !support.sortDetached && b.compareDocumentPosition( a ) === compare ) ) { + + // Choose the first element that is related to our preferred document + // Support: IE 11+, Edge 17 - 18+ + // IE/Edge sometimes throw a "Permission denied" error when strict-comparing + // two documents; shallow comparisons work. + // eslint-disable-next-line eqeqeq + if ( a == document || a.ownerDocument == preferredDoc && + contains( preferredDoc, a ) ) { + return -1; + } + + // Support: IE 11+, Edge 17 - 18+ + // IE/Edge sometimes throw a "Permission denied" error when strict-comparing + // two documents; shallow comparisons work. + // eslint-disable-next-line eqeqeq + if ( b == document || b.ownerDocument == preferredDoc && + contains( preferredDoc, b ) ) { + return 1; + } + + // Maintain original order + return sortInput ? + ( indexOf( sortInput, a ) - indexOf( sortInput, b ) ) : + 0; + } + + return compare & 4 ? -1 : 1; + } : + function( a, b ) { + + // Exit early if the nodes are identical + if ( a === b ) { + hasDuplicate = true; + return 0; + } + + var cur, + i = 0, + aup = a.parentNode, + bup = b.parentNode, + ap = [ a ], + bp = [ b ]; + + // Parentless nodes are either documents or disconnected + if ( !aup || !bup ) { + + // Support: IE 11+, Edge 17 - 18+ + // IE/Edge sometimes throw a "Permission denied" error when strict-comparing + // two documents; shallow comparisons work. + /* eslint-disable eqeqeq */ + return a == document ? -1 : + b == document ? 1 : + /* eslint-enable eqeqeq */ + aup ? -1 : + bup ? 1 : + sortInput ? + ( indexOf( sortInput, a ) - indexOf( sortInput, b ) ) : + 0; + + // If the nodes are siblings, we can do a quick check + } else if ( aup === bup ) { + return siblingCheck( a, b ); + } + + // Otherwise we need full lists of their ancestors for comparison + cur = a; + while ( ( cur = cur.parentNode ) ) { + ap.unshift( cur ); + } + cur = b; + while ( ( cur = cur.parentNode ) ) { + bp.unshift( cur ); + } + + // Walk down the tree looking for a discrepancy + while ( ap[ i ] === bp[ i ] ) { + i++; + } + + return i ? + + // Do a sibling check if the nodes have a common ancestor + siblingCheck( ap[ i ], bp[ i ] ) : + + // Otherwise nodes in our document sort first + // Support: IE 11+, Edge 17 - 18+ + // IE/Edge sometimes throw a "Permission denied" error when strict-comparing + // two documents; shallow comparisons work. + /* eslint-disable eqeqeq */ + ap[ i ] == preferredDoc ? -1 : + bp[ i ] == preferredDoc ? 1 : + /* eslint-enable eqeqeq */ + 0; + }; + + return document; +}; + +Sizzle.matches = function( expr, elements ) { + return Sizzle( expr, null, null, elements ); +}; + +Sizzle.matchesSelector = function( elem, expr ) { + setDocument( elem ); + + if ( support.matchesSelector && documentIsHTML && + !nonnativeSelectorCache[ expr + " " ] && + ( !rbuggyMatches || !rbuggyMatches.test( expr ) ) && + ( !rbuggyQSA || !rbuggyQSA.test( expr ) ) ) { + + try { + var ret = matches.call( elem, expr ); + + // IE 9's matchesSelector returns false on disconnected nodes + if ( ret || support.disconnectedMatch || + + // As well, disconnected nodes are said to be in a document + // fragment in IE 9 + elem.document && elem.document.nodeType !== 11 ) { + return ret; + } + } catch ( e ) { + nonnativeSelectorCache( expr, true ); + } + } + + return Sizzle( expr, document, null, [ elem ] ).length > 0; +}; + +Sizzle.contains = function( context, elem ) { + + // Set document vars if needed + // Support: IE 11+, Edge 17 - 18+ + // IE/Edge sometimes throw a "Permission denied" error when strict-comparing + // two documents; shallow comparisons work. + // eslint-disable-next-line eqeqeq + if ( ( context.ownerDocument || context ) != document ) { + setDocument( context ); + } + return contains( context, elem ); +}; + +Sizzle.attr = function( elem, name ) { + + // Set document vars if needed + // Support: IE 11+, Edge 17 - 18+ + // IE/Edge sometimes throw a "Permission denied" error when strict-comparing + // two documents; shallow comparisons work. + // eslint-disable-next-line eqeqeq + if ( ( elem.ownerDocument || elem ) != document ) { + setDocument( elem ); + } + + var fn = Expr.attrHandle[ name.toLowerCase() ], + + // Don't get fooled by Object.prototype properties (jQuery #13807) + val = fn && hasOwn.call( Expr.attrHandle, name.toLowerCase() ) ? + fn( elem, name, !documentIsHTML ) : + undefined; + + return val !== undefined ? + val : + support.attributes || !documentIsHTML ? + elem.getAttribute( name ) : + ( val = elem.getAttributeNode( name ) ) && val.specified ? + val.value : + null; +}; + +Sizzle.escape = function( sel ) { + return ( sel + "" ).replace( rcssescape, fcssescape ); +}; + +Sizzle.error = function( msg ) { + throw new Error( "Syntax error, unrecognized expression: " + msg ); +}; + +/** + * Document sorting and removing duplicates + * @param {ArrayLike} results + */ +Sizzle.uniqueSort = function( results ) { + var elem, + duplicates = [], + j = 0, + i = 0; + + // Unless we *know* we can detect duplicates, assume their presence + hasDuplicate = !support.detectDuplicates; + sortInput = !support.sortStable && results.slice( 0 ); + results.sort( sortOrder ); + + if ( hasDuplicate ) { + while ( ( elem = results[ i++ ] ) ) { + if ( elem === results[ i ] ) { + j = duplicates.push( i ); + } + } + while ( j-- ) { + results.splice( duplicates[ j ], 1 ); + } + } + + // Clear input after sorting to release objects + // See https://github.com/jquery/sizzle/pull/225 + sortInput = null; + + return results; +}; + +/** + * Utility function for retrieving the text value of an array of DOM nodes + * @param {Array|Element} elem + */ +getText = Sizzle.getText = function( elem ) { + var node, + ret = "", + i = 0, + nodeType = elem.nodeType; + + if ( !nodeType ) { + + // If no nodeType, this is expected to be an array + while ( ( node = elem[ i++ ] ) ) { + + // Do not traverse comment nodes + ret += getText( node ); + } + } else if ( nodeType === 1 || nodeType === 9 || nodeType === 11 ) { + + // Use textContent for elements + // innerText usage removed for consistency of new lines (jQuery #11153) + if ( typeof elem.textContent === "string" ) { + return elem.textContent; + } else { + + // Traverse its children + for ( elem = elem.firstChild; elem; elem = elem.nextSibling ) { + ret += getText( elem ); + } + } + } else if ( nodeType === 3 || nodeType === 4 ) { + return elem.nodeValue; + } + + // Do not include comment or processing instruction nodes + + return ret; +}; + +Expr = Sizzle.selectors = { + + // Can be adjusted by the user + cacheLength: 50, + + createPseudo: markFunction, + + match: matchExpr, + + attrHandle: {}, + + find: {}, + + relative: { + ">": { dir: "parentNode", first: true }, + " ": { dir: "parentNode" }, + "+": { dir: "previousSibling", first: true }, + "~": { dir: "previousSibling" } + }, + + preFilter: { + "ATTR": function( match ) { + match[ 1 ] = match[ 1 ].replace( runescape, funescape ); + + // Move the given value to match[3] whether quoted or unquoted + match[ 3 ] = ( match[ 3 ] || match[ 4 ] || + match[ 5 ] || "" ).replace( runescape, funescape ); + + if ( match[ 2 ] === "~=" ) { + match[ 3 ] = " " + match[ 3 ] + " "; + } + + return match.slice( 0, 4 ); + }, + + "CHILD": function( match ) { + + /* matches from matchExpr["CHILD"] + 1 type (only|nth|...) + 2 what (child|of-type) + 3 argument (even|odd|\d*|\d*n([+-]\d+)?|...) + 4 xn-component of xn+y argument ([+-]?\d*n|) + 5 sign of xn-component + 6 x of xn-component + 7 sign of y-component + 8 y of y-component + */ + match[ 1 ] = match[ 1 ].toLowerCase(); + + if ( match[ 1 ].slice( 0, 3 ) === "nth" ) { + + // nth-* requires argument + if ( !match[ 3 ] ) { + Sizzle.error( match[ 0 ] ); + } + + // numeric x and y parameters for Expr.filter.CHILD + // remember that false/true cast respectively to 0/1 + match[ 4 ] = +( match[ 4 ] ? + match[ 5 ] + ( match[ 6 ] || 1 ) : + 2 * ( match[ 3 ] === "even" || match[ 3 ] === "odd" ) ); + match[ 5 ] = +( ( match[ 7 ] + match[ 8 ] ) || match[ 3 ] === "odd" ); + + // other types prohibit arguments + } else if ( match[ 3 ] ) { + Sizzle.error( match[ 0 ] ); + } + + return match; + }, + + "PSEUDO": function( match ) { + var excess, + unquoted = !match[ 6 ] && match[ 2 ]; + + if ( matchExpr[ "CHILD" ].test( match[ 0 ] ) ) { + return null; + } + + // Accept quoted arguments as-is + if ( match[ 3 ] ) { + match[ 2 ] = match[ 4 ] || match[ 5 ] || ""; + + // Strip excess characters from unquoted arguments + } else if ( unquoted && rpseudo.test( unquoted ) && + + // Get excess from tokenize (recursively) + ( excess = tokenize( unquoted, true ) ) && + + // advance to the next closing parenthesis + ( excess = unquoted.indexOf( ")", unquoted.length - excess ) - unquoted.length ) ) { + + // excess is a negative index + match[ 0 ] = match[ 0 ].slice( 0, excess ); + match[ 2 ] = unquoted.slice( 0, excess ); + } + + // Return only captures needed by the pseudo filter method (type and argument) + return match.slice( 0, 3 ); + } + }, + + filter: { + + "TAG": function( nodeNameSelector ) { + var nodeName = nodeNameSelector.replace( runescape, funescape ).toLowerCase(); + return nodeNameSelector === "*" ? + function() { + return true; + } : + function( elem ) { + return elem.nodeName && elem.nodeName.toLowerCase() === nodeName; + }; + }, + + "CLASS": function( className ) { + var pattern = classCache[ className + " " ]; + + return pattern || + ( pattern = new RegExp( "(^|" + whitespace + + ")" + className + "(" + whitespace + "|$)" ) ) && classCache( + className, function( elem ) { + return pattern.test( + typeof elem.className === "string" && elem.className || + typeof elem.getAttribute !== "undefined" && + elem.getAttribute( "class" ) || + "" + ); + } ); + }, + + "ATTR": function( name, operator, check ) { + return function( elem ) { + var result = Sizzle.attr( elem, name ); + + if ( result == null ) { + return operator === "!="; + } + if ( !operator ) { + return true; + } + + result += ""; + + /* eslint-disable max-len */ + + return operator === "=" ? result === check : + operator === "!=" ? result !== check : + operator === "^=" ? check && result.indexOf( check ) === 0 : + operator === "*=" ? check && result.indexOf( check ) > -1 : + operator === "$=" ? check && result.slice( -check.length ) === check : + operator === "~=" ? ( " " + result.replace( rwhitespace, " " ) + " " ).indexOf( check ) > -1 : + operator === "|=" ? result === check || result.slice( 0, check.length + 1 ) === check + "-" : + false; + /* eslint-enable max-len */ + + }; + }, + + "CHILD": function( type, what, _argument, first, last ) { + var simple = type.slice( 0, 3 ) !== "nth", + forward = type.slice( -4 ) !== "last", + ofType = what === "of-type"; + + return first === 1 && last === 0 ? + + // Shortcut for :nth-*(n) + function( elem ) { + return !!elem.parentNode; + } : + + function( elem, _context, xml ) { + var cache, uniqueCache, outerCache, node, nodeIndex, start, + dir = simple !== forward ? "nextSibling" : "previousSibling", + parent = elem.parentNode, + name = ofType && elem.nodeName.toLowerCase(), + useCache = !xml && !ofType, + diff = false; + + if ( parent ) { + + // :(first|last|only)-(child|of-type) + if ( simple ) { + while ( dir ) { + node = elem; + while ( ( node = node[ dir ] ) ) { + if ( ofType ? + node.nodeName.toLowerCase() === name : + node.nodeType === 1 ) { + + return false; + } + } + + // Reverse direction for :only-* (if we haven't yet done so) + start = dir = type === "only" && !start && "nextSibling"; + } + return true; + } + + start = [ forward ? parent.firstChild : parent.lastChild ]; + + // non-xml :nth-child(...) stores cache data on `parent` + if ( forward && useCache ) { + + // Seek `elem` from a previously-cached index + + // ...in a gzip-friendly way + node = parent; + outerCache = node[ expando ] || ( node[ expando ] = {} ); + + // Support: IE <9 only + // Defend against cloned attroperties (jQuery gh-1709) + uniqueCache = outerCache[ node.uniqueID ] || + ( outerCache[ node.uniqueID ] = {} ); + + cache = uniqueCache[ type ] || []; + nodeIndex = cache[ 0 ] === dirruns && cache[ 1 ]; + diff = nodeIndex && cache[ 2 ]; + node = nodeIndex && parent.childNodes[ nodeIndex ]; + + while ( ( node = ++nodeIndex && node && node[ dir ] || + + // Fallback to seeking `elem` from the start + ( diff = nodeIndex = 0 ) || start.pop() ) ) { + + // When found, cache indexes on `parent` and break + if ( node.nodeType === 1 && ++diff && node === elem ) { + uniqueCache[ type ] = [ dirruns, nodeIndex, diff ]; + break; + } + } + + } else { + + // Use previously-cached element index if available + if ( useCache ) { + + // ...in a gzip-friendly way + node = elem; + outerCache = node[ expando ] || ( node[ expando ] = {} ); + + // Support: IE <9 only + // Defend against cloned attroperties (jQuery gh-1709) + uniqueCache = outerCache[ node.uniqueID ] || + ( outerCache[ node.uniqueID ] = {} ); + + cache = uniqueCache[ type ] || []; + nodeIndex = cache[ 0 ] === dirruns && cache[ 1 ]; + diff = nodeIndex; + } + + // xml :nth-child(...) + // or :nth-last-child(...) or :nth(-last)?-of-type(...) + if ( diff === false ) { + + // Use the same loop as above to seek `elem` from the start + while ( ( node = ++nodeIndex && node && node[ dir ] || + ( diff = nodeIndex = 0 ) || start.pop() ) ) { + + if ( ( ofType ? + node.nodeName.toLowerCase() === name : + node.nodeType === 1 ) && + ++diff ) { + + // Cache the index of each encountered element + if ( useCache ) { + outerCache = node[ expando ] || + ( node[ expando ] = {} ); + + // Support: IE <9 only + // Defend against cloned attroperties (jQuery gh-1709) + uniqueCache = outerCache[ node.uniqueID ] || + ( outerCache[ node.uniqueID ] = {} ); + + uniqueCache[ type ] = [ dirruns, diff ]; + } + + if ( node === elem ) { + break; + } + } + } + } + } + + // Incorporate the offset, then check against cycle size + diff -= last; + return diff === first || ( diff % first === 0 && diff / first >= 0 ); + } + }; + }, + + "PSEUDO": function( pseudo, argument ) { + + // pseudo-class names are case-insensitive + // http://www.w3.org/TR/selectors/#pseudo-classes + // Prioritize by case sensitivity in case custom pseudos are added with uppercase letters + // Remember that setFilters inherits from pseudos + var args, + fn = Expr.pseudos[ pseudo ] || Expr.setFilters[ pseudo.toLowerCase() ] || + Sizzle.error( "unsupported pseudo: " + pseudo ); + + // The user may use createPseudo to indicate that + // arguments are needed to create the filter function + // just as Sizzle does + if ( fn[ expando ] ) { + return fn( argument ); + } + + // But maintain support for old signatures + if ( fn.length > 1 ) { + args = [ pseudo, pseudo, "", argument ]; + return Expr.setFilters.hasOwnProperty( pseudo.toLowerCase() ) ? + markFunction( function( seed, matches ) { + var idx, + matched = fn( seed, argument ), + i = matched.length; + while ( i-- ) { + idx = indexOf( seed, matched[ i ] ); + seed[ idx ] = !( matches[ idx ] = matched[ i ] ); + } + } ) : + function( elem ) { + return fn( elem, 0, args ); + }; + } + + return fn; + } + }, + + pseudos: { + + // Potentially complex pseudos + "not": markFunction( function( selector ) { + + // Trim the selector passed to compile + // to avoid treating leading and trailing + // spaces as combinators + var input = [], + results = [], + matcher = compile( selector.replace( rtrim, "$1" ) ); + + return matcher[ expando ] ? + markFunction( function( seed, matches, _context, xml ) { + var elem, + unmatched = matcher( seed, null, xml, [] ), + i = seed.length; + + // Match elements unmatched by `matcher` + while ( i-- ) { + if ( ( elem = unmatched[ i ] ) ) { + seed[ i ] = !( matches[ i ] = elem ); + } + } + } ) : + function( elem, _context, xml ) { + input[ 0 ] = elem; + matcher( input, null, xml, results ); + + // Don't keep the element (issue #299) + input[ 0 ] = null; + return !results.pop(); + }; + } ), + + "has": markFunction( function( selector ) { + return function( elem ) { + return Sizzle( selector, elem ).length > 0; + }; + } ), + + "contains": markFunction( function( text ) { + text = text.replace( runescape, funescape ); + return function( elem ) { + return ( elem.textContent || getText( elem ) ).indexOf( text ) > -1; + }; + } ), + + // "Whether an element is represented by a :lang() selector + // is based solely on the element's language value + // being equal to the identifier C, + // or beginning with the identifier C immediately followed by "-". + // The matching of C against the element's language value is performed case-insensitively. + // The identifier C does not have to be a valid language name." + // http://www.w3.org/TR/selectors/#lang-pseudo + "lang": markFunction( function( lang ) { + + // lang value must be a valid identifier + if ( !ridentifier.test( lang || "" ) ) { + Sizzle.error( "unsupported lang: " + lang ); + } + lang = lang.replace( runescape, funescape ).toLowerCase(); + return function( elem ) { + var elemLang; + do { + if ( ( elemLang = documentIsHTML ? + elem.lang : + elem.getAttribute( "xml:lang" ) || elem.getAttribute( "lang" ) ) ) { + + elemLang = elemLang.toLowerCase(); + return elemLang === lang || elemLang.indexOf( lang + "-" ) === 0; + } + } while ( ( elem = elem.parentNode ) && elem.nodeType === 1 ); + return false; + }; + } ), + + // Miscellaneous + "target": function( elem ) { + var hash = window.location && window.location.hash; + return hash && hash.slice( 1 ) === elem.id; + }, + + "root": function( elem ) { + return elem === docElem; + }, + + "focus": function( elem ) { + return elem === document.activeElement && + ( !document.hasFocus || document.hasFocus() ) && + !!( elem.type || elem.href || ~elem.tabIndex ); + }, + + // Boolean properties + "enabled": createDisabledPseudo( false ), + "disabled": createDisabledPseudo( true ), + + "checked": function( elem ) { + + // In CSS3, :checked should return both checked and selected elements + // http://www.w3.org/TR/2011/REC-css3-selectors-20110929/#checked + var nodeName = elem.nodeName.toLowerCase(); + return ( nodeName === "input" && !!elem.checked ) || + ( nodeName === "option" && !!elem.selected ); + }, + + "selected": function( elem ) { + + // Accessing this property makes selected-by-default + // options in Safari work properly + if ( elem.parentNode ) { + // eslint-disable-next-line no-unused-expressions + elem.parentNode.selectedIndex; + } + + return elem.selected === true; + }, + + // Contents + "empty": function( elem ) { + + // http://www.w3.org/TR/selectors/#empty-pseudo + // :empty is negated by element (1) or content nodes (text: 3; cdata: 4; entity ref: 5), + // but not by others (comment: 8; processing instruction: 7; etc.) + // nodeType < 6 works because attributes (2) do not appear as children + for ( elem = elem.firstChild; elem; elem = elem.nextSibling ) { + if ( elem.nodeType < 6 ) { + return false; + } + } + return true; + }, + + "parent": function( elem ) { + return !Expr.pseudos[ "empty" ]( elem ); + }, + + // Element/input types + "header": function( elem ) { + return rheader.test( elem.nodeName ); + }, + + "input": function( elem ) { + return rinputs.test( elem.nodeName ); + }, + + "button": function( elem ) { + var name = elem.nodeName.toLowerCase(); + return name === "input" && elem.type === "button" || name === "button"; + }, + + "text": function( elem ) { + var attr; + return elem.nodeName.toLowerCase() === "input" && + elem.type === "text" && + + // Support: IE<8 + // New HTML5 attribute values (e.g., "search") appear with elem.type === "text" + ( ( attr = elem.getAttribute( "type" ) ) == null || + attr.toLowerCase() === "text" ); + }, + + // Position-in-collection + "first": createPositionalPseudo( function() { + return [ 0 ]; + } ), + + "last": createPositionalPseudo( function( _matchIndexes, length ) { + return [ length - 1 ]; + } ), + + "eq": createPositionalPseudo( function( _matchIndexes, length, argument ) { + return [ argument < 0 ? argument + length : argument ]; + } ), + + "even": createPositionalPseudo( function( matchIndexes, length ) { + var i = 0; + for ( ; i < length; i += 2 ) { + matchIndexes.push( i ); + } + return matchIndexes; + } ), + + "odd": createPositionalPseudo( function( matchIndexes, length ) { + var i = 1; + for ( ; i < length; i += 2 ) { + matchIndexes.push( i ); + } + return matchIndexes; + } ), + + "lt": createPositionalPseudo( function( matchIndexes, length, argument ) { + var i = argument < 0 ? + argument + length : + argument > length ? + length : + argument; + for ( ; --i >= 0; ) { + matchIndexes.push( i ); + } + return matchIndexes; + } ), + + "gt": createPositionalPseudo( function( matchIndexes, length, argument ) { + var i = argument < 0 ? argument + length : argument; + for ( ; ++i < length; ) { + matchIndexes.push( i ); + } + return matchIndexes; + } ) + } +}; + +Expr.pseudos[ "nth" ] = Expr.pseudos[ "eq" ]; + +// Add button/input type pseudos +for ( i in { radio: true, checkbox: true, file: true, password: true, image: true } ) { + Expr.pseudos[ i ] = createInputPseudo( i ); +} +for ( i in { submit: true, reset: true } ) { + Expr.pseudos[ i ] = createButtonPseudo( i ); +} + +// Easy API for creating new setFilters +function setFilters() {} +setFilters.prototype = Expr.filters = Expr.pseudos; +Expr.setFilters = new setFilters(); + +tokenize = Sizzle.tokenize = function( selector, parseOnly ) { + var matched, match, tokens, type, + soFar, groups, preFilters, + cached = tokenCache[ selector + " " ]; + + if ( cached ) { + return parseOnly ? 0 : cached.slice( 0 ); + } + + soFar = selector; + groups = []; + preFilters = Expr.preFilter; + + while ( soFar ) { + + // Comma and first run + if ( !matched || ( match = rcomma.exec( soFar ) ) ) { + if ( match ) { + + // Don't consume trailing commas as valid + soFar = soFar.slice( match[ 0 ].length ) || soFar; + } + groups.push( ( tokens = [] ) ); + } + + matched = false; + + // Combinators + if ( ( match = rcombinators.exec( soFar ) ) ) { + matched = match.shift(); + tokens.push( { + value: matched, + + // Cast descendant combinators to space + type: match[ 0 ].replace( rtrim, " " ) + } ); + soFar = soFar.slice( matched.length ); + } + + // Filters + for ( type in Expr.filter ) { + if ( ( match = matchExpr[ type ].exec( soFar ) ) && ( !preFilters[ type ] || + ( match = preFilters[ type ]( match ) ) ) ) { + matched = match.shift(); + tokens.push( { + value: matched, + type: type, + matches: match + } ); + soFar = soFar.slice( matched.length ); + } + } + + if ( !matched ) { + break; + } + } + + // Return the length of the invalid excess + // if we're just parsing + // Otherwise, throw an error or return tokens + return parseOnly ? + soFar.length : + soFar ? + Sizzle.error( selector ) : + + // Cache the tokens + tokenCache( selector, groups ).slice( 0 ); +}; + +function toSelector( tokens ) { + var i = 0, + len = tokens.length, + selector = ""; + for ( ; i < len; i++ ) { + selector += tokens[ i ].value; + } + return selector; +} + +function addCombinator( matcher, combinator, base ) { + var dir = combinator.dir, + skip = combinator.next, + key = skip || dir, + checkNonElements = base && key === "parentNode", + doneName = done++; + + return combinator.first ? + + // Check against closest ancestor/preceding element + function( elem, context, xml ) { + while ( ( elem = elem[ dir ] ) ) { + if ( elem.nodeType === 1 || checkNonElements ) { + return matcher( elem, context, xml ); + } + } + return false; + } : + + // Check against all ancestor/preceding elements + function( elem, context, xml ) { + var oldCache, uniqueCache, outerCache, + newCache = [ dirruns, doneName ]; + + // We can't set arbitrary data on XML nodes, so they don't benefit from combinator caching + if ( xml ) { + while ( ( elem = elem[ dir ] ) ) { + if ( elem.nodeType === 1 || checkNonElements ) { + if ( matcher( elem, context, xml ) ) { + return true; + } + } + } + } else { + while ( ( elem = elem[ dir ] ) ) { + if ( elem.nodeType === 1 || checkNonElements ) { + outerCache = elem[ expando ] || ( elem[ expando ] = {} ); + + // Support: IE <9 only + // Defend against cloned attroperties (jQuery gh-1709) + uniqueCache = outerCache[ elem.uniqueID ] || + ( outerCache[ elem.uniqueID ] = {} ); + + if ( skip && skip === elem.nodeName.toLowerCase() ) { + elem = elem[ dir ] || elem; + } else if ( ( oldCache = uniqueCache[ key ] ) && + oldCache[ 0 ] === dirruns && oldCache[ 1 ] === doneName ) { + + // Assign to newCache so results back-propagate to previous elements + return ( newCache[ 2 ] = oldCache[ 2 ] ); + } else { + + // Reuse newcache so results back-propagate to previous elements + uniqueCache[ key ] = newCache; + + // A match means we're done; a fail means we have to keep checking + if ( ( newCache[ 2 ] = matcher( elem, context, xml ) ) ) { + return true; + } + } + } + } + } + return false; + }; +} + +function elementMatcher( matchers ) { + return matchers.length > 1 ? + function( elem, context, xml ) { + var i = matchers.length; + while ( i-- ) { + if ( !matchers[ i ]( elem, context, xml ) ) { + return false; + } + } + return true; + } : + matchers[ 0 ]; +} + +function multipleContexts( selector, contexts, results ) { + var i = 0, + len = contexts.length; + for ( ; i < len; i++ ) { + Sizzle( selector, contexts[ i ], results ); + } + return results; +} + +function condense( unmatched, map, filter, context, xml ) { + var elem, + newUnmatched = [], + i = 0, + len = unmatched.length, + mapped = map != null; + + for ( ; i < len; i++ ) { + if ( ( elem = unmatched[ i ] ) ) { + if ( !filter || filter( elem, context, xml ) ) { + newUnmatched.push( elem ); + if ( mapped ) { + map.push( i ); + } + } + } + } + + return newUnmatched; +} + +function setMatcher( preFilter, selector, matcher, postFilter, postFinder, postSelector ) { + if ( postFilter && !postFilter[ expando ] ) { + postFilter = setMatcher( postFilter ); + } + if ( postFinder && !postFinder[ expando ] ) { + postFinder = setMatcher( postFinder, postSelector ); + } + return markFunction( function( seed, results, context, xml ) { + var temp, i, elem, + preMap = [], + postMap = [], + preexisting = results.length, + + // Get initial elements from seed or context + elems = seed || multipleContexts( + selector || "*", + context.nodeType ? [ context ] : context, + [] + ), + + // Prefilter to get matcher input, preserving a map for seed-results synchronization + matcherIn = preFilter && ( seed || !selector ) ? + condense( elems, preMap, preFilter, context, xml ) : + elems, + + matcherOut = matcher ? + + // If we have a postFinder, or filtered seed, or non-seed postFilter or preexisting results, + postFinder || ( seed ? preFilter : preexisting || postFilter ) ? + + // ...intermediate processing is necessary + [] : + + // ...otherwise use results directly + results : + matcherIn; + + // Find primary matches + if ( matcher ) { + matcher( matcherIn, matcherOut, context, xml ); + } + + // Apply postFilter + if ( postFilter ) { + temp = condense( matcherOut, postMap ); + postFilter( temp, [], context, xml ); + + // Un-match failing elements by moving them back to matcherIn + i = temp.length; + while ( i-- ) { + if ( ( elem = temp[ i ] ) ) { + matcherOut[ postMap[ i ] ] = !( matcherIn[ postMap[ i ] ] = elem ); + } + } + } + + if ( seed ) { + if ( postFinder || preFilter ) { + if ( postFinder ) { + + // Get the final matcherOut by condensing this intermediate into postFinder contexts + temp = []; + i = matcherOut.length; + while ( i-- ) { + if ( ( elem = matcherOut[ i ] ) ) { + + // Restore matcherIn since elem is not yet a final match + temp.push( ( matcherIn[ i ] = elem ) ); + } + } + postFinder( null, ( matcherOut = [] ), temp, xml ); + } + + // Move matched elements from seed to results to keep them synchronized + i = matcherOut.length; + while ( i-- ) { + if ( ( elem = matcherOut[ i ] ) && + ( temp = postFinder ? indexOf( seed, elem ) : preMap[ i ] ) > -1 ) { + + seed[ temp ] = !( results[ temp ] = elem ); + } + } + } + + // Add elements to results, through postFinder if defined + } else { + matcherOut = condense( + matcherOut === results ? + matcherOut.splice( preexisting, matcherOut.length ) : + matcherOut + ); + if ( postFinder ) { + postFinder( null, results, matcherOut, xml ); + } else { + push.apply( results, matcherOut ); + } + } + } ); +} + +function matcherFromTokens( tokens ) { + var checkContext, matcher, j, + len = tokens.length, + leadingRelative = Expr.relative[ tokens[ 0 ].type ], + implicitRelative = leadingRelative || Expr.relative[ " " ], + i = leadingRelative ? 1 : 0, + + // The foundational matcher ensures that elements are reachable from top-level context(s) + matchContext = addCombinator( function( elem ) { + return elem === checkContext; + }, implicitRelative, true ), + matchAnyContext = addCombinator( function( elem ) { + return indexOf( checkContext, elem ) > -1; + }, implicitRelative, true ), + matchers = [ function( elem, context, xml ) { + var ret = ( !leadingRelative && ( xml || context !== outermostContext ) ) || ( + ( checkContext = context ).nodeType ? + matchContext( elem, context, xml ) : + matchAnyContext( elem, context, xml ) ); + + // Avoid hanging onto element (issue #299) + checkContext = null; + return ret; + } ]; + + for ( ; i < len; i++ ) { + if ( ( matcher = Expr.relative[ tokens[ i ].type ] ) ) { + matchers = [ addCombinator( elementMatcher( matchers ), matcher ) ]; + } else { + matcher = Expr.filter[ tokens[ i ].type ].apply( null, tokens[ i ].matches ); + + // Return special upon seeing a positional matcher + if ( matcher[ expando ] ) { + + // Find the next relative operator (if any) for proper handling + j = ++i; + for ( ; j < len; j++ ) { + if ( Expr.relative[ tokens[ j ].type ] ) { + break; + } + } + return setMatcher( + i > 1 && elementMatcher( matchers ), + i > 1 && toSelector( + + // If the preceding token was a descendant combinator, insert an implicit any-element `*` + tokens + .slice( 0, i - 1 ) + .concat( { value: tokens[ i - 2 ].type === " " ? "*" : "" } ) + ).replace( rtrim, "$1" ), + matcher, + i < j && matcherFromTokens( tokens.slice( i, j ) ), + j < len && matcherFromTokens( ( tokens = tokens.slice( j ) ) ), + j < len && toSelector( tokens ) + ); + } + matchers.push( matcher ); + } + } + + return elementMatcher( matchers ); +} + +function matcherFromGroupMatchers( elementMatchers, setMatchers ) { + var bySet = setMatchers.length > 0, + byElement = elementMatchers.length > 0, + superMatcher = function( seed, context, xml, results, outermost ) { + var elem, j, matcher, + matchedCount = 0, + i = "0", + unmatched = seed && [], + setMatched = [], + contextBackup = outermostContext, + + // We must always have either seed elements or outermost context + elems = seed || byElement && Expr.find[ "TAG" ]( "*", outermost ), + + // Use integer dirruns iff this is the outermost matcher + dirrunsUnique = ( dirruns += contextBackup == null ? 1 : Math.random() || 0.1 ), + len = elems.length; + + if ( outermost ) { + + // Support: IE 11+, Edge 17 - 18+ + // IE/Edge sometimes throw a "Permission denied" error when strict-comparing + // two documents; shallow comparisons work. + // eslint-disable-next-line eqeqeq + outermostContext = context == document || context || outermost; + } + + // Add elements passing elementMatchers directly to results + // Support: IE<9, Safari + // Tolerate NodeList properties (IE: "length"; Safari: ) matching elements by id + for ( ; i !== len && ( elem = elems[ i ] ) != null; i++ ) { + if ( byElement && elem ) { + j = 0; + + // Support: IE 11+, Edge 17 - 18+ + // IE/Edge sometimes throw a "Permission denied" error when strict-comparing + // two documents; shallow comparisons work. + // eslint-disable-next-line eqeqeq + if ( !context && elem.ownerDocument != document ) { + setDocument( elem ); + xml = !documentIsHTML; + } + while ( ( matcher = elementMatchers[ j++ ] ) ) { + if ( matcher( elem, context || document, xml ) ) { + results.push( elem ); + break; + } + } + if ( outermost ) { + dirruns = dirrunsUnique; + } + } + + // Track unmatched elements for set filters + if ( bySet ) { + + // They will have gone through all possible matchers + if ( ( elem = !matcher && elem ) ) { + matchedCount--; + } + + // Lengthen the array for every element, matched or not + if ( seed ) { + unmatched.push( elem ); + } + } + } + + // `i` is now the count of elements visited above, and adding it to `matchedCount` + // makes the latter nonnegative. + matchedCount += i; + + // Apply set filters to unmatched elements + // NOTE: This can be skipped if there are no unmatched elements (i.e., `matchedCount` + // equals `i`), unless we didn't visit _any_ elements in the above loop because we have + // no element matchers and no seed. + // Incrementing an initially-string "0" `i` allows `i` to remain a string only in that + // case, which will result in a "00" `matchedCount` that differs from `i` but is also + // numerically zero. + if ( bySet && i !== matchedCount ) { + j = 0; + while ( ( matcher = setMatchers[ j++ ] ) ) { + matcher( unmatched, setMatched, context, xml ); + } + + if ( seed ) { + + // Reintegrate element matches to eliminate the need for sorting + if ( matchedCount > 0 ) { + while ( i-- ) { + if ( !( unmatched[ i ] || setMatched[ i ] ) ) { + setMatched[ i ] = pop.call( results ); + } + } + } + + // Discard index placeholder values to get only actual matches + setMatched = condense( setMatched ); + } + + // Add matches to results + push.apply( results, setMatched ); + + // Seedless set matches succeeding multiple successful matchers stipulate sorting + if ( outermost && !seed && setMatched.length > 0 && + ( matchedCount + setMatchers.length ) > 1 ) { + + Sizzle.uniqueSort( results ); + } + } + + // Override manipulation of globals by nested matchers + if ( outermost ) { + dirruns = dirrunsUnique; + outermostContext = contextBackup; + } + + return unmatched; + }; + + return bySet ? + markFunction( superMatcher ) : + superMatcher; +} + +compile = Sizzle.compile = function( selector, match /* Internal Use Only */ ) { + var i, + setMatchers = [], + elementMatchers = [], + cached = compilerCache[ selector + " " ]; + + if ( !cached ) { + + // Generate a function of recursive functions that can be used to check each element + if ( !match ) { + match = tokenize( selector ); + } + i = match.length; + while ( i-- ) { + cached = matcherFromTokens( match[ i ] ); + if ( cached[ expando ] ) { + setMatchers.push( cached ); + } else { + elementMatchers.push( cached ); + } + } + + // Cache the compiled function + cached = compilerCache( + selector, + matcherFromGroupMatchers( elementMatchers, setMatchers ) + ); + + // Save selector and tokenization + cached.selector = selector; + } + return cached; +}; + +/** + * A low-level selection function that works with Sizzle's compiled + * selector functions + * @param {String|Function} selector A selector or a pre-compiled + * selector function built with Sizzle.compile + * @param {Element} context + * @param {Array} [results] + * @param {Array} [seed] A set of elements to match against + */ +select = Sizzle.select = function( selector, context, results, seed ) { + var i, tokens, token, type, find, + compiled = typeof selector === "function" && selector, + match = !seed && tokenize( ( selector = compiled.selector || selector ) ); + + results = results || []; + + // Try to minimize operations if there is only one selector in the list and no seed + // (the latter of which guarantees us context) + if ( match.length === 1 ) { + + // Reduce context if the leading compound selector is an ID + tokens = match[ 0 ] = match[ 0 ].slice( 0 ); + if ( tokens.length > 2 && ( token = tokens[ 0 ] ).type === "ID" && + context.nodeType === 9 && documentIsHTML && Expr.relative[ tokens[ 1 ].type ] ) { + + context = ( Expr.find[ "ID" ]( token.matches[ 0 ] + .replace( runescape, funescape ), context ) || [] )[ 0 ]; + if ( !context ) { + return results; + + // Precompiled matchers will still verify ancestry, so step up a level + } else if ( compiled ) { + context = context.parentNode; + } + + selector = selector.slice( tokens.shift().value.length ); + } + + // Fetch a seed set for right-to-left matching + i = matchExpr[ "needsContext" ].test( selector ) ? 0 : tokens.length; + while ( i-- ) { + token = tokens[ i ]; + + // Abort if we hit a combinator + if ( Expr.relative[ ( type = token.type ) ] ) { + break; + } + if ( ( find = Expr.find[ type ] ) ) { + + // Search, expanding context for leading sibling combinators + if ( ( seed = find( + token.matches[ 0 ].replace( runescape, funescape ), + rsibling.test( tokens[ 0 ].type ) && testContext( context.parentNode ) || + context + ) ) ) { + + // If seed is empty or no tokens remain, we can return early + tokens.splice( i, 1 ); + selector = seed.length && toSelector( tokens ); + if ( !selector ) { + push.apply( results, seed ); + return results; + } + + break; + } + } + } + } + + // Compile and execute a filtering function if one is not provided + // Provide `match` to avoid retokenization if we modified the selector above + ( compiled || compile( selector, match ) )( + seed, + context, + !documentIsHTML, + results, + !context || rsibling.test( selector ) && testContext( context.parentNode ) || context + ); + return results; +}; + +// One-time assignments + +// Sort stability +support.sortStable = expando.split( "" ).sort( sortOrder ).join( "" ) === expando; + +// Support: Chrome 14-35+ +// Always assume duplicates if they aren't passed to the comparison function +support.detectDuplicates = !!hasDuplicate; + +// Initialize against the default document +setDocument(); + +// Support: Webkit<537.32 - Safari 6.0.3/Chrome 25 (fixed in Chrome 27) +// Detached nodes confoundingly follow *each other* +support.sortDetached = assert( function( el ) { + + // Should return 1, but returns 4 (following) + return el.compareDocumentPosition( document.createElement( "fieldset" ) ) & 1; +} ); + +// Support: IE<8 +// Prevent attribute/property "interpolation" +// https://msdn.microsoft.com/en-us/library/ms536429%28VS.85%29.aspx +if ( !assert( function( el ) { + el.innerHTML = ""; + return el.firstChild.getAttribute( "href" ) === "#"; +} ) ) { + addHandle( "type|href|height|width", function( elem, name, isXML ) { + if ( !isXML ) { + return elem.getAttribute( name, name.toLowerCase() === "type" ? 1 : 2 ); + } + } ); +} + +// Support: IE<9 +// Use defaultValue in place of getAttribute("value") +if ( !support.attributes || !assert( function( el ) { + el.innerHTML = ""; + el.firstChild.setAttribute( "value", "" ); + return el.firstChild.getAttribute( "value" ) === ""; +} ) ) { + addHandle( "value", function( elem, _name, isXML ) { + if ( !isXML && elem.nodeName.toLowerCase() === "input" ) { + return elem.defaultValue; + } + } ); +} + +// Support: IE<9 +// Use getAttributeNode to fetch booleans when getAttribute lies +if ( !assert( function( el ) { + return el.getAttribute( "disabled" ) == null; +} ) ) { + addHandle( booleans, function( elem, name, isXML ) { + var val; + if ( !isXML ) { + return elem[ name ] === true ? name.toLowerCase() : + ( val = elem.getAttributeNode( name ) ) && val.specified ? + val.value : + null; + } + } ); +} + +return Sizzle; + +} )( window ); + + + +jQuery.find = Sizzle; +jQuery.expr = Sizzle.selectors; + +// Deprecated +jQuery.expr[ ":" ] = jQuery.expr.pseudos; +jQuery.uniqueSort = jQuery.unique = Sizzle.uniqueSort; +jQuery.text = Sizzle.getText; +jQuery.isXMLDoc = Sizzle.isXML; +jQuery.contains = Sizzle.contains; +jQuery.escapeSelector = Sizzle.escape; + + + + +var dir = function( elem, dir, until ) { + var matched = [], + truncate = until !== undefined; + + while ( ( elem = elem[ dir ] ) && elem.nodeType !== 9 ) { + if ( elem.nodeType === 1 ) { + if ( truncate && jQuery( elem ).is( until ) ) { + break; + } + matched.push( elem ); + } + } + return matched; +}; + + +var siblings = function( n, elem ) { + var matched = []; + + for ( ; n; n = n.nextSibling ) { + if ( n.nodeType === 1 && n !== elem ) { + matched.push( n ); + } + } + + return matched; +}; + + +var rneedsContext = jQuery.expr.match.needsContext; + + + +function nodeName( elem, name ) { + + return elem.nodeName && elem.nodeName.toLowerCase() === name.toLowerCase(); + +}; +var rsingleTag = ( /^<([a-z][^\/\0>:\x20\t\r\n\f]*)[\x20\t\r\n\f]*\/?>(?:<\/\1>|)$/i ); + + + +// Implement the identical functionality for filter and not +function winnow( elements, qualifier, not ) { + if ( isFunction( qualifier ) ) { + return jQuery.grep( elements, function( elem, i ) { + return !!qualifier.call( elem, i, elem ) !== not; + } ); + } + + // Single element + if ( qualifier.nodeType ) { + return jQuery.grep( elements, function( elem ) { + return ( elem === qualifier ) !== not; + } ); + } + + // Arraylike of elements (jQuery, arguments, Array) + if ( typeof qualifier !== "string" ) { + return jQuery.grep( elements, function( elem ) { + return ( indexOf.call( qualifier, elem ) > -1 ) !== not; + } ); + } + + // Filtered directly for both simple and complex selectors + return jQuery.filter( qualifier, elements, not ); +} + +jQuery.filter = function( expr, elems, not ) { + var elem = elems[ 0 ]; + + if ( not ) { + expr = ":not(" + expr + ")"; + } + + if ( elems.length === 1 && elem.nodeType === 1 ) { + return jQuery.find.matchesSelector( elem, expr ) ? [ elem ] : []; + } + + return jQuery.find.matches( expr, jQuery.grep( elems, function( elem ) { + return elem.nodeType === 1; + } ) ); +}; + +jQuery.fn.extend( { + find: function( selector ) { + var i, ret, + len = this.length, + self = this; + + if ( typeof selector !== "string" ) { + return this.pushStack( jQuery( selector ).filter( function() { + for ( i = 0; i < len; i++ ) { + if ( jQuery.contains( self[ i ], this ) ) { + return true; + } + } + } ) ); + } + + ret = this.pushStack( [] ); + + for ( i = 0; i < len; i++ ) { + jQuery.find( selector, self[ i ], ret ); + } + + return len > 1 ? jQuery.uniqueSort( ret ) : ret; + }, + filter: function( selector ) { + return this.pushStack( winnow( this, selector || [], false ) ); + }, + not: function( selector ) { + return this.pushStack( winnow( this, selector || [], true ) ); + }, + is: function( selector ) { + return !!winnow( + this, + + // If this is a positional/relative selector, check membership in the returned set + // so $("p:first").is("p:last") won't return true for a doc with two "p". + typeof selector === "string" && rneedsContext.test( selector ) ? + jQuery( selector ) : + selector || [], + false + ).length; + } +} ); + + +// Initialize a jQuery object + + +// A central reference to the root jQuery(document) +var rootjQuery, + + // A simple way to check for HTML strings + // Prioritize #id over to avoid XSS via location.hash (#9521) + // Strict HTML recognition (#11290: must start with <) + // Shortcut simple #id case for speed + rquickExpr = /^(?:\s*(<[\w\W]+>)[^>]*|#([\w-]+))$/, + + init = jQuery.fn.init = function( selector, context, root ) { + var match, elem; + + // HANDLE: $(""), $(null), $(undefined), $(false) + if ( !selector ) { + return this; + } + + // Method init() accepts an alternate rootjQuery + // so migrate can support jQuery.sub (gh-2101) + root = root || rootjQuery; + + // Handle HTML strings + if ( typeof selector === "string" ) { + if ( selector[ 0 ] === "<" && + selector[ selector.length - 1 ] === ">" && + selector.length >= 3 ) { + + // Assume that strings that start and end with <> are HTML and skip the regex check + match = [ null, selector, null ]; + + } else { + match = rquickExpr.exec( selector ); + } + + // Match html or make sure no context is specified for #id + if ( match && ( match[ 1 ] || !context ) ) { + + // HANDLE: $(html) -> $(array) + if ( match[ 1 ] ) { + context = context instanceof jQuery ? context[ 0 ] : context; + + // Option to run scripts is true for back-compat + // Intentionally let the error be thrown if parseHTML is not present + jQuery.merge( this, jQuery.parseHTML( + match[ 1 ], + context && context.nodeType ? context.ownerDocument || context : document, + true + ) ); + + // HANDLE: $(html, props) + if ( rsingleTag.test( match[ 1 ] ) && jQuery.isPlainObject( context ) ) { + for ( match in context ) { + + // Properties of context are called as methods if possible + if ( isFunction( this[ match ] ) ) { + this[ match ]( context[ match ] ); + + // ...and otherwise set as attributes + } else { + this.attr( match, context[ match ] ); + } + } + } + + return this; + + // HANDLE: $(#id) + } else { + elem = document.getElementById( match[ 2 ] ); + + if ( elem ) { + + // Inject the element directly into the jQuery object + this[ 0 ] = elem; + this.length = 1; + } + return this; + } + + // HANDLE: $(expr, $(...)) + } else if ( !context || context.jquery ) { + return ( context || root ).find( selector ); + + // HANDLE: $(expr, context) + // (which is just equivalent to: $(context).find(expr) + } else { + return this.constructor( context ).find( selector ); + } + + // HANDLE: $(DOMElement) + } else if ( selector.nodeType ) { + this[ 0 ] = selector; + this.length = 1; + return this; + + // HANDLE: $(function) + // Shortcut for document ready + } else if ( isFunction( selector ) ) { + return root.ready !== undefined ? + root.ready( selector ) : + + // Execute immediately if ready is not present + selector( jQuery ); + } + + return jQuery.makeArray( selector, this ); + }; + +// Give the init function the jQuery prototype for later instantiation +init.prototype = jQuery.fn; + +// Initialize central reference +rootjQuery = jQuery( document ); + + +var rparentsprev = /^(?:parents|prev(?:Until|All))/, + + // Methods guaranteed to produce a unique set when starting from a unique set + guaranteedUnique = { + children: true, + contents: true, + next: true, + prev: true + }; + +jQuery.fn.extend( { + has: function( target ) { + var targets = jQuery( target, this ), + l = targets.length; + + return this.filter( function() { + var i = 0; + for ( ; i < l; i++ ) { + if ( jQuery.contains( this, targets[ i ] ) ) { + return true; + } + } + } ); + }, + + closest: function( selectors, context ) { + var cur, + i = 0, + l = this.length, + matched = [], + targets = typeof selectors !== "string" && jQuery( selectors ); + + // Positional selectors never match, since there's no _selection_ context + if ( !rneedsContext.test( selectors ) ) { + for ( ; i < l; i++ ) { + for ( cur = this[ i ]; cur && cur !== context; cur = cur.parentNode ) { + + // Always skip document fragments + if ( cur.nodeType < 11 && ( targets ? + targets.index( cur ) > -1 : + + // Don't pass non-elements to Sizzle + cur.nodeType === 1 && + jQuery.find.matchesSelector( cur, selectors ) ) ) { + + matched.push( cur ); + break; + } + } + } + } + + return this.pushStack( matched.length > 1 ? jQuery.uniqueSort( matched ) : matched ); + }, + + // Determine the position of an element within the set + index: function( elem ) { + + // No argument, return index in parent + if ( !elem ) { + return ( this[ 0 ] && this[ 0 ].parentNode ) ? this.first().prevAll().length : -1; + } + + // Index in selector + if ( typeof elem === "string" ) { + return indexOf.call( jQuery( elem ), this[ 0 ] ); + } + + // Locate the position of the desired element + return indexOf.call( this, + + // If it receives a jQuery object, the first element is used + elem.jquery ? elem[ 0 ] : elem + ); + }, + + add: function( selector, context ) { + return this.pushStack( + jQuery.uniqueSort( + jQuery.merge( this.get(), jQuery( selector, context ) ) + ) + ); + }, + + addBack: function( selector ) { + return this.add( selector == null ? + this.prevObject : this.prevObject.filter( selector ) + ); + } +} ); + +function sibling( cur, dir ) { + while ( ( cur = cur[ dir ] ) && cur.nodeType !== 1 ) {} + return cur; +} + +jQuery.each( { + parent: function( elem ) { + var parent = elem.parentNode; + return parent && parent.nodeType !== 11 ? parent : null; + }, + parents: function( elem ) { + return dir( elem, "parentNode" ); + }, + parentsUntil: function( elem, _i, until ) { + return dir( elem, "parentNode", until ); + }, + next: function( elem ) { + return sibling( elem, "nextSibling" ); + }, + prev: function( elem ) { + return sibling( elem, "previousSibling" ); + }, + nextAll: function( elem ) { + return dir( elem, "nextSibling" ); + }, + prevAll: function( elem ) { + return dir( elem, "previousSibling" ); + }, + nextUntil: function( elem, _i, until ) { + return dir( elem, "nextSibling", until ); + }, + prevUntil: function( elem, _i, until ) { + return dir( elem, "previousSibling", until ); + }, + siblings: function( elem ) { + return siblings( ( elem.parentNode || {} ).firstChild, elem ); + }, + children: function( elem ) { + return siblings( elem.firstChild ); + }, + contents: function( elem ) { + if ( elem.contentDocument != null && + + // Support: IE 11+ + // elements with no `data` attribute has an object + // `contentDocument` with a `null` prototype. + getProto( elem.contentDocument ) ) { + + return elem.contentDocument; + } + + // Support: IE 9 - 11 only, iOS 7 only, Android Browser <=4.3 only + // Treat the template element as a regular one in browsers that + // don't support it. + if ( nodeName( elem, "template" ) ) { + elem = elem.content || elem; + } + + return jQuery.merge( [], elem.childNodes ); + } +}, function( name, fn ) { + jQuery.fn[ name ] = function( until, selector ) { + var matched = jQuery.map( this, fn, until ); + + if ( name.slice( -5 ) !== "Until" ) { + selector = until; + } + + if ( selector && typeof selector === "string" ) { + matched = jQuery.filter( selector, matched ); + } + + if ( this.length > 1 ) { + + // Remove duplicates + if ( !guaranteedUnique[ name ] ) { + jQuery.uniqueSort( matched ); + } + + // Reverse order for parents* and prev-derivatives + if ( rparentsprev.test( name ) ) { + matched.reverse(); + } + } + + return this.pushStack( matched ); + }; +} ); +var rnothtmlwhite = ( /[^\x20\t\r\n\f]+/g ); + + + +// Convert String-formatted options into Object-formatted ones +function createOptions( options ) { + var object = {}; + jQuery.each( options.match( rnothtmlwhite ) || [], function( _, flag ) { + object[ flag ] = true; + } ); + return object; +} + +/* + * Create a callback list using the following parameters: + * + * options: an optional list of space-separated options that will change how + * the callback list behaves or a more traditional option object + * + * By default a callback list will act like an event callback list and can be + * "fired" multiple times. + * + * Possible options: + * + * once: will ensure the callback list can only be fired once (like a Deferred) + * + * memory: will keep track of previous values and will call any callback added + * after the list has been fired right away with the latest "memorized" + * values (like a Deferred) + * + * unique: will ensure a callback can only be added once (no duplicate in the list) + * + * stopOnFalse: interrupt callings when a callback returns false + * + */ +jQuery.Callbacks = function( options ) { + + // Convert options from String-formatted to Object-formatted if needed + // (we check in cache first) + options = typeof options === "string" ? + createOptions( options ) : + jQuery.extend( {}, options ); + + var // Flag to know if list is currently firing + firing, + + // Last fire value for non-forgettable lists + memory, + + // Flag to know if list was already fired + fired, + + // Flag to prevent firing + locked, + + // Actual callback list + list = [], + + // Queue of execution data for repeatable lists + queue = [], + + // Index of currently firing callback (modified by add/remove as needed) + firingIndex = -1, + + // Fire callbacks + fire = function() { + + // Enforce single-firing + locked = locked || options.once; + + // Execute callbacks for all pending executions, + // respecting firingIndex overrides and runtime changes + fired = firing = true; + for ( ; queue.length; firingIndex = -1 ) { + memory = queue.shift(); + while ( ++firingIndex < list.length ) { + + // Run callback and check for early termination + if ( list[ firingIndex ].apply( memory[ 0 ], memory[ 1 ] ) === false && + options.stopOnFalse ) { + + // Jump to end and forget the data so .add doesn't re-fire + firingIndex = list.length; + memory = false; + } + } + } + + // Forget the data if we're done with it + if ( !options.memory ) { + memory = false; + } + + firing = false; + + // Clean up if we're done firing for good + if ( locked ) { + + // Keep an empty list if we have data for future add calls + if ( memory ) { + list = []; + + // Otherwise, this object is spent + } else { + list = ""; + } + } + }, + + // Actual Callbacks object + self = { + + // Add a callback or a collection of callbacks to the list + add: function() { + if ( list ) { + + // If we have memory from a past run, we should fire after adding + if ( memory && !firing ) { + firingIndex = list.length - 1; + queue.push( memory ); + } + + ( function add( args ) { + jQuery.each( args, function( _, arg ) { + if ( isFunction( arg ) ) { + if ( !options.unique || !self.has( arg ) ) { + list.push( arg ); + } + } else if ( arg && arg.length && toType( arg ) !== "string" ) { + + // Inspect recursively + add( arg ); + } + } ); + } )( arguments ); + + if ( memory && !firing ) { + fire(); + } + } + return this; + }, + + // Remove a callback from the list + remove: function() { + jQuery.each( arguments, function( _, arg ) { + var index; + while ( ( index = jQuery.inArray( arg, list, index ) ) > -1 ) { + list.splice( index, 1 ); + + // Handle firing indexes + if ( index <= firingIndex ) { + firingIndex--; + } + } + } ); + return this; + }, + + // Check if a given callback is in the list. + // If no argument is given, return whether or not list has callbacks attached. + has: function( fn ) { + return fn ? + jQuery.inArray( fn, list ) > -1 : + list.length > 0; + }, + + // Remove all callbacks from the list + empty: function() { + if ( list ) { + list = []; + } + return this; + }, + + // Disable .fire and .add + // Abort any current/pending executions + // Clear all callbacks and values + disable: function() { + locked = queue = []; + list = memory = ""; + return this; + }, + disabled: function() { + return !list; + }, + + // Disable .fire + // Also disable .add unless we have memory (since it would have no effect) + // Abort any pending executions + lock: function() { + locked = queue = []; + if ( !memory && !firing ) { + list = memory = ""; + } + return this; + }, + locked: function() { + return !!locked; + }, + + // Call all callbacks with the given context and arguments + fireWith: function( context, args ) { + if ( !locked ) { + args = args || []; + args = [ context, args.slice ? args.slice() : args ]; + queue.push( args ); + if ( !firing ) { + fire(); + } + } + return this; + }, + + // Call all the callbacks with the given arguments + fire: function() { + self.fireWith( this, arguments ); + return this; + }, + + // To know if the callbacks have already been called at least once + fired: function() { + return !!fired; + } + }; + + return self; +}; + + +function Identity( v ) { + return v; +} +function Thrower( ex ) { + throw ex; +} + +function adoptValue( value, resolve, reject, noValue ) { + var method; + + try { + + // Check for promise aspect first to privilege synchronous behavior + if ( value && isFunction( ( method = value.promise ) ) ) { + method.call( value ).done( resolve ).fail( reject ); + + // Other thenables + } else if ( value && isFunction( ( method = value.then ) ) ) { + method.call( value, resolve, reject ); + + // Other non-thenables + } else { + + // Control `resolve` arguments by letting Array#slice cast boolean `noValue` to integer: + // * false: [ value ].slice( 0 ) => resolve( value ) + // * true: [ value ].slice( 1 ) => resolve() + resolve.apply( undefined, [ value ].slice( noValue ) ); + } + + // For Promises/A+, convert exceptions into rejections + // Since jQuery.when doesn't unwrap thenables, we can skip the extra checks appearing in + // Deferred#then to conditionally suppress rejection. + } catch ( value ) { + + // Support: Android 4.0 only + // Strict mode functions invoked without .call/.apply get global-object context + reject.apply( undefined, [ value ] ); + } +} + +jQuery.extend( { + + Deferred: function( func ) { + var tuples = [ + + // action, add listener, callbacks, + // ... .then handlers, argument index, [final state] + [ "notify", "progress", jQuery.Callbacks( "memory" ), + jQuery.Callbacks( "memory" ), 2 ], + [ "resolve", "done", jQuery.Callbacks( "once memory" ), + jQuery.Callbacks( "once memory" ), 0, "resolved" ], + [ "reject", "fail", jQuery.Callbacks( "once memory" ), + jQuery.Callbacks( "once memory" ), 1, "rejected" ] + ], + state = "pending", + promise = { + state: function() { + return state; + }, + always: function() { + deferred.done( arguments ).fail( arguments ); + return this; + }, + "catch": function( fn ) { + return promise.then( null, fn ); + }, + + // Keep pipe for back-compat + pipe: function( /* fnDone, fnFail, fnProgress */ ) { + var fns = arguments; + + return jQuery.Deferred( function( newDefer ) { + jQuery.each( tuples, function( _i, tuple ) { + + // Map tuples (progress, done, fail) to arguments (done, fail, progress) + var fn = isFunction( fns[ tuple[ 4 ] ] ) && fns[ tuple[ 4 ] ]; + + // deferred.progress(function() { bind to newDefer or newDefer.notify }) + // deferred.done(function() { bind to newDefer or newDefer.resolve }) + // deferred.fail(function() { bind to newDefer or newDefer.reject }) + deferred[ tuple[ 1 ] ]( function() { + var returned = fn && fn.apply( this, arguments ); + if ( returned && isFunction( returned.promise ) ) { + returned.promise() + .progress( newDefer.notify ) + .done( newDefer.resolve ) + .fail( newDefer.reject ); + } else { + newDefer[ tuple[ 0 ] + "With" ]( + this, + fn ? [ returned ] : arguments + ); + } + } ); + } ); + fns = null; + } ).promise(); + }, + then: function( onFulfilled, onRejected, onProgress ) { + var maxDepth = 0; + function resolve( depth, deferred, handler, special ) { + return function() { + var that = this, + args = arguments, + mightThrow = function() { + var returned, then; + + // Support: Promises/A+ section 2.3.3.3.3 + // https://promisesaplus.com/#point-59 + // Ignore double-resolution attempts + if ( depth < maxDepth ) { + return; + } + + returned = handler.apply( that, args ); + + // Support: Promises/A+ section 2.3.1 + // https://promisesaplus.com/#point-48 + if ( returned === deferred.promise() ) { + throw new TypeError( "Thenable self-resolution" ); + } + + // Support: Promises/A+ sections 2.3.3.1, 3.5 + // https://promisesaplus.com/#point-54 + // https://promisesaplus.com/#point-75 + // Retrieve `then` only once + then = returned && + + // Support: Promises/A+ section 2.3.4 + // https://promisesaplus.com/#point-64 + // Only check objects and functions for thenability + ( typeof returned === "object" || + typeof returned === "function" ) && + returned.then; + + // Handle a returned thenable + if ( isFunction( then ) ) { + + // Special processors (notify) just wait for resolution + if ( special ) { + then.call( + returned, + resolve( maxDepth, deferred, Identity, special ), + resolve( maxDepth, deferred, Thrower, special ) + ); + + // Normal processors (resolve) also hook into progress + } else { + + // ...and disregard older resolution values + maxDepth++; + + then.call( + returned, + resolve( maxDepth, deferred, Identity, special ), + resolve( maxDepth, deferred, Thrower, special ), + resolve( maxDepth, deferred, Identity, + deferred.notifyWith ) + ); + } + + // Handle all other returned values + } else { + + // Only substitute handlers pass on context + // and multiple values (non-spec behavior) + if ( handler !== Identity ) { + that = undefined; + args = [ returned ]; + } + + // Process the value(s) + // Default process is resolve + ( special || deferred.resolveWith )( that, args ); + } + }, + + // Only normal processors (resolve) catch and reject exceptions + process = special ? + mightThrow : + function() { + try { + mightThrow(); + } catch ( e ) { + + if ( jQuery.Deferred.exceptionHook ) { + jQuery.Deferred.exceptionHook( e, + process.stackTrace ); + } + + // Support: Promises/A+ section 2.3.3.3.4.1 + // https://promisesaplus.com/#point-61 + // Ignore post-resolution exceptions + if ( depth + 1 >= maxDepth ) { + + // Only substitute handlers pass on context + // and multiple values (non-spec behavior) + if ( handler !== Thrower ) { + that = undefined; + args = [ e ]; + } + + deferred.rejectWith( that, args ); + } + } + }; + + // Support: Promises/A+ section 2.3.3.3.1 + // https://promisesaplus.com/#point-57 + // Re-resolve promises immediately to dodge false rejection from + // subsequent errors + if ( depth ) { + process(); + } else { + + // Call an optional hook to record the stack, in case of exception + // since it's otherwise lost when execution goes async + if ( jQuery.Deferred.getStackHook ) { + process.stackTrace = jQuery.Deferred.getStackHook(); + } + window.setTimeout( process ); + } + }; + } + + return jQuery.Deferred( function( newDefer ) { + + // progress_handlers.add( ... ) + tuples[ 0 ][ 3 ].add( + resolve( + 0, + newDefer, + isFunction( onProgress ) ? + onProgress : + Identity, + newDefer.notifyWith + ) + ); + + // fulfilled_handlers.add( ... ) + tuples[ 1 ][ 3 ].add( + resolve( + 0, + newDefer, + isFunction( onFulfilled ) ? + onFulfilled : + Identity + ) + ); + + // rejected_handlers.add( ... ) + tuples[ 2 ][ 3 ].add( + resolve( + 0, + newDefer, + isFunction( onRejected ) ? + onRejected : + Thrower + ) + ); + } ).promise(); + }, + + // Get a promise for this deferred + // If obj is provided, the promise aspect is added to the object + promise: function( obj ) { + return obj != null ? jQuery.extend( obj, promise ) : promise; + } + }, + deferred = {}; + + // Add list-specific methods + jQuery.each( tuples, function( i, tuple ) { + var list = tuple[ 2 ], + stateString = tuple[ 5 ]; + + // promise.progress = list.add + // promise.done = list.add + // promise.fail = list.add + promise[ tuple[ 1 ] ] = list.add; + + // Handle state + if ( stateString ) { + list.add( + function() { + + // state = "resolved" (i.e., fulfilled) + // state = "rejected" + state = stateString; + }, + + // rejected_callbacks.disable + // fulfilled_callbacks.disable + tuples[ 3 - i ][ 2 ].disable, + + // rejected_handlers.disable + // fulfilled_handlers.disable + tuples[ 3 - i ][ 3 ].disable, + + // progress_callbacks.lock + tuples[ 0 ][ 2 ].lock, + + // progress_handlers.lock + tuples[ 0 ][ 3 ].lock + ); + } + + // progress_handlers.fire + // fulfilled_handlers.fire + // rejected_handlers.fire + list.add( tuple[ 3 ].fire ); + + // deferred.notify = function() { deferred.notifyWith(...) } + // deferred.resolve = function() { deferred.resolveWith(...) } + // deferred.reject = function() { deferred.rejectWith(...) } + deferred[ tuple[ 0 ] ] = function() { + deferred[ tuple[ 0 ] + "With" ]( this === deferred ? undefined : this, arguments ); + return this; + }; + + // deferred.notifyWith = list.fireWith + // deferred.resolveWith = list.fireWith + // deferred.rejectWith = list.fireWith + deferred[ tuple[ 0 ] + "With" ] = list.fireWith; + } ); + + // Make the deferred a promise + promise.promise( deferred ); + + // Call given func if any + if ( func ) { + func.call( deferred, deferred ); + } + + // All done! + return deferred; + }, + + // Deferred helper + when: function( singleValue ) { + var + + // count of uncompleted subordinates + remaining = arguments.length, + + // count of unprocessed arguments + i = remaining, + + // subordinate fulfillment data + resolveContexts = Array( i ), + resolveValues = slice.call( arguments ), + + // the master Deferred + master = jQuery.Deferred(), + + // subordinate callback factory + updateFunc = function( i ) { + return function( value ) { + resolveContexts[ i ] = this; + resolveValues[ i ] = arguments.length > 1 ? slice.call( arguments ) : value; + if ( !( --remaining ) ) { + master.resolveWith( resolveContexts, resolveValues ); + } + }; + }; + + // Single- and empty arguments are adopted like Promise.resolve + if ( remaining <= 1 ) { + adoptValue( singleValue, master.done( updateFunc( i ) ).resolve, master.reject, + !remaining ); + + // Use .then() to unwrap secondary thenables (cf. gh-3000) + if ( master.state() === "pending" || + isFunction( resolveValues[ i ] && resolveValues[ i ].then ) ) { + + return master.then(); + } + } + + // Multiple arguments are aggregated like Promise.all array elements + while ( i-- ) { + adoptValue( resolveValues[ i ], updateFunc( i ), master.reject ); + } + + return master.promise(); + } +} ); + + +// These usually indicate a programmer mistake during development, +// warn about them ASAP rather than swallowing them by default. +var rerrorNames = /^(Eval|Internal|Range|Reference|Syntax|Type|URI)Error$/; + +jQuery.Deferred.exceptionHook = function( error, stack ) { + + // Support: IE 8 - 9 only + // Console exists when dev tools are open, which can happen at any time + if ( window.console && window.console.warn && error && rerrorNames.test( error.name ) ) { + window.console.warn( "jQuery.Deferred exception: " + error.message, error.stack, stack ); + } +}; + + + + +jQuery.readyException = function( error ) { + window.setTimeout( function() { + throw error; + } ); +}; + + + + +// The deferred used on DOM ready +var readyList = jQuery.Deferred(); + +jQuery.fn.ready = function( fn ) { + + readyList + .then( fn ) + + // Wrap jQuery.readyException in a function so that the lookup + // happens at the time of error handling instead of callback + // registration. + .catch( function( error ) { + jQuery.readyException( error ); + } ); + + return this; +}; + +jQuery.extend( { + + // Is the DOM ready to be used? Set to true once it occurs. + isReady: false, + + // A counter to track how many items to wait for before + // the ready event fires. See #6781 + readyWait: 1, + + // Handle when the DOM is ready + ready: function( wait ) { + + // Abort if there are pending holds or we're already ready + if ( wait === true ? --jQuery.readyWait : jQuery.isReady ) { + return; + } + + // Remember that the DOM is ready + jQuery.isReady = true; + + // If a normal DOM Ready event fired, decrement, and wait if need be + if ( wait !== true && --jQuery.readyWait > 0 ) { + return; + } + + // If there are functions bound, to execute + readyList.resolveWith( document, [ jQuery ] ); + } +} ); + +jQuery.ready.then = readyList.then; + +// The ready event handler and self cleanup method +function completed() { + document.removeEventListener( "DOMContentLoaded", completed ); + window.removeEventListener( "load", completed ); + jQuery.ready(); +} + +// Catch cases where $(document).ready() is called +// after the browser event has already occurred. +// Support: IE <=9 - 10 only +// Older IE sometimes signals "interactive" too soon +if ( document.readyState === "complete" || + ( document.readyState !== "loading" && !document.documentElement.doScroll ) ) { + + // Handle it asynchronously to allow scripts the opportunity to delay ready + window.setTimeout( jQuery.ready ); + +} else { + + // Use the handy event callback + document.addEventListener( "DOMContentLoaded", completed ); + + // A fallback to window.onload, that will always work + window.addEventListener( "load", completed ); +} + + + + +// Multifunctional method to get and set values of a collection +// The value/s can optionally be executed if it's a function +var access = function( elems, fn, key, value, chainable, emptyGet, raw ) { + var i = 0, + len = elems.length, + bulk = key == null; + + // Sets many values + if ( toType( key ) === "object" ) { + chainable = true; + for ( i in key ) { + access( elems, fn, i, key[ i ], true, emptyGet, raw ); + } + + // Sets one value + } else if ( value !== undefined ) { + chainable = true; + + if ( !isFunction( value ) ) { + raw = true; + } + + if ( bulk ) { + + // Bulk operations run against the entire set + if ( raw ) { + fn.call( elems, value ); + fn = null; + + // ...except when executing function values + } else { + bulk = fn; + fn = function( elem, _key, value ) { + return bulk.call( jQuery( elem ), value ); + }; + } + } + + if ( fn ) { + for ( ; i < len; i++ ) { + fn( + elems[ i ], key, raw ? + value : + value.call( elems[ i ], i, fn( elems[ i ], key ) ) + ); + } + } + } + + if ( chainable ) { + return elems; + } + + // Gets + if ( bulk ) { + return fn.call( elems ); + } + + return len ? fn( elems[ 0 ], key ) : emptyGet; +}; + + +// Matches dashed string for camelizing +var rmsPrefix = /^-ms-/, + rdashAlpha = /-([a-z])/g; + +// Used by camelCase as callback to replace() +function fcamelCase( _all, letter ) { + return letter.toUpperCase(); +} + +// Convert dashed to camelCase; used by the css and data modules +// Support: IE <=9 - 11, Edge 12 - 15 +// Microsoft forgot to hump their vendor prefix (#9572) +function camelCase( string ) { + return string.replace( rmsPrefix, "ms-" ).replace( rdashAlpha, fcamelCase ); +} +var acceptData = function( owner ) { + + // Accepts only: + // - Node + // - Node.ELEMENT_NODE + // - Node.DOCUMENT_NODE + // - Object + // - Any + return owner.nodeType === 1 || owner.nodeType === 9 || !( +owner.nodeType ); +}; + + + + +function Data() { + this.expando = jQuery.expando + Data.uid++; +} + +Data.uid = 1; + +Data.prototype = { + + cache: function( owner ) { + + // Check if the owner object already has a cache + var value = owner[ this.expando ]; + + // If not, create one + if ( !value ) { + value = {}; + + // We can accept data for non-element nodes in modern browsers, + // but we should not, see #8335. + // Always return an empty object. + if ( acceptData( owner ) ) { + + // If it is a node unlikely to be stringify-ed or looped over + // use plain assignment + if ( owner.nodeType ) { + owner[ this.expando ] = value; + + // Otherwise secure it in a non-enumerable property + // configurable must be true to allow the property to be + // deleted when data is removed + } else { + Object.defineProperty( owner, this.expando, { + value: value, + configurable: true + } ); + } + } + } + + return value; + }, + set: function( owner, data, value ) { + var prop, + cache = this.cache( owner ); + + // Handle: [ owner, key, value ] args + // Always use camelCase key (gh-2257) + if ( typeof data === "string" ) { + cache[ camelCase( data ) ] = value; + + // Handle: [ owner, { properties } ] args + } else { + + // Copy the properties one-by-one to the cache object + for ( prop in data ) { + cache[ camelCase( prop ) ] = data[ prop ]; + } + } + return cache; + }, + get: function( owner, key ) { + return key === undefined ? + this.cache( owner ) : + + // Always use camelCase key (gh-2257) + owner[ this.expando ] && owner[ this.expando ][ camelCase( key ) ]; + }, + access: function( owner, key, value ) { + + // In cases where either: + // + // 1. No key was specified + // 2. A string key was specified, but no value provided + // + // Take the "read" path and allow the get method to determine + // which value to return, respectively either: + // + // 1. The entire cache object + // 2. The data stored at the key + // + if ( key === undefined || + ( ( key && typeof key === "string" ) && value === undefined ) ) { + + return this.get( owner, key ); + } + + // When the key is not a string, or both a key and value + // are specified, set or extend (existing objects) with either: + // + // 1. An object of properties + // 2. A key and value + // + this.set( owner, key, value ); + + // Since the "set" path can have two possible entry points + // return the expected data based on which path was taken[*] + return value !== undefined ? value : key; + }, + remove: function( owner, key ) { + var i, + cache = owner[ this.expando ]; + + if ( cache === undefined ) { + return; + } + + if ( key !== undefined ) { + + // Support array or space separated string of keys + if ( Array.isArray( key ) ) { + + // If key is an array of keys... + // We always set camelCase keys, so remove that. + key = key.map( camelCase ); + } else { + key = camelCase( key ); + + // If a key with the spaces exists, use it. + // Otherwise, create an array by matching non-whitespace + key = key in cache ? + [ key ] : + ( key.match( rnothtmlwhite ) || [] ); + } + + i = key.length; + + while ( i-- ) { + delete cache[ key[ i ] ]; + } + } + + // Remove the expando if there's no more data + if ( key === undefined || jQuery.isEmptyObject( cache ) ) { + + // Support: Chrome <=35 - 45 + // Webkit & Blink performance suffers when deleting properties + // from DOM nodes, so set to undefined instead + // https://bugs.chromium.org/p/chromium/issues/detail?id=378607 (bug restricted) + if ( owner.nodeType ) { + owner[ this.expando ] = undefined; + } else { + delete owner[ this.expando ]; + } + } + }, + hasData: function( owner ) { + var cache = owner[ this.expando ]; + return cache !== undefined && !jQuery.isEmptyObject( cache ); + } +}; +var dataPriv = new Data(); + +var dataUser = new Data(); + + + +// Implementation Summary +// +// 1. Enforce API surface and semantic compatibility with 1.9.x branch +// 2. Improve the module's maintainability by reducing the storage +// paths to a single mechanism. +// 3. Use the same single mechanism to support "private" and "user" data. +// 4. _Never_ expose "private" data to user code (TODO: Drop _data, _removeData) +// 5. Avoid exposing implementation details on user objects (eg. expando properties) +// 6. Provide a clear path for implementation upgrade to WeakMap in 2014 + +var rbrace = /^(?:\{[\w\W]*\}|\[[\w\W]*\])$/, + rmultiDash = /[A-Z]/g; + +function getData( data ) { + if ( data === "true" ) { + return true; + } + + if ( data === "false" ) { + return false; + } + + if ( data === "null" ) { + return null; + } + + // Only convert to a number if it doesn't change the string + if ( data === +data + "" ) { + return +data; + } + + if ( rbrace.test( data ) ) { + return JSON.parse( data ); + } + + return data; +} + +function dataAttr( elem, key, data ) { + var name; + + // If nothing was found internally, try to fetch any + // data from the HTML5 data-* attribute + if ( data === undefined && elem.nodeType === 1 ) { + name = "data-" + key.replace( rmultiDash, "-$&" ).toLowerCase(); + data = elem.getAttribute( name ); + + if ( typeof data === "string" ) { + try { + data = getData( data ); + } catch ( e ) {} + + // Make sure we set the data so it isn't changed later + dataUser.set( elem, key, data ); + } else { + data = undefined; + } + } + return data; +} + +jQuery.extend( { + hasData: function( elem ) { + return dataUser.hasData( elem ) || dataPriv.hasData( elem ); + }, + + data: function( elem, name, data ) { + return dataUser.access( elem, name, data ); + }, + + removeData: function( elem, name ) { + dataUser.remove( elem, name ); + }, + + // TODO: Now that all calls to _data and _removeData have been replaced + // with direct calls to dataPriv methods, these can be deprecated. + _data: function( elem, name, data ) { + return dataPriv.access( elem, name, data ); + }, + + _removeData: function( elem, name ) { + dataPriv.remove( elem, name ); + } +} ); + +jQuery.fn.extend( { + data: function( key, value ) { + var i, name, data, + elem = this[ 0 ], + attrs = elem && elem.attributes; + + // Gets all values + if ( key === undefined ) { + if ( this.length ) { + data = dataUser.get( elem ); + + if ( elem.nodeType === 1 && !dataPriv.get( elem, "hasDataAttrs" ) ) { + i = attrs.length; + while ( i-- ) { + + // Support: IE 11 only + // The attrs elements can be null (#14894) + if ( attrs[ i ] ) { + name = attrs[ i ].name; + if ( name.indexOf( "data-" ) === 0 ) { + name = camelCase( name.slice( 5 ) ); + dataAttr( elem, name, data[ name ] ); + } + } + } + dataPriv.set( elem, "hasDataAttrs", true ); + } + } + + return data; + } + + // Sets multiple values + if ( typeof key === "object" ) { + return this.each( function() { + dataUser.set( this, key ); + } ); + } + + return access( this, function( value ) { + var data; + + // The calling jQuery object (element matches) is not empty + // (and therefore has an element appears at this[ 0 ]) and the + // `value` parameter was not undefined. An empty jQuery object + // will result in `undefined` for elem = this[ 0 ] which will + // throw an exception if an attempt to read a data cache is made. + if ( elem && value === undefined ) { + + // Attempt to get data from the cache + // The key will always be camelCased in Data + data = dataUser.get( elem, key ); + if ( data !== undefined ) { + return data; + } + + // Attempt to "discover" the data in + // HTML5 custom data-* attrs + data = dataAttr( elem, key ); + if ( data !== undefined ) { + return data; + } + + // We tried really hard, but the data doesn't exist. + return; + } + + // Set the data... + this.each( function() { + + // We always store the camelCased key + dataUser.set( this, key, value ); + } ); + }, null, value, arguments.length > 1, null, true ); + }, + + removeData: function( key ) { + return this.each( function() { + dataUser.remove( this, key ); + } ); + } +} ); + + +jQuery.extend( { + queue: function( elem, type, data ) { + var queue; + + if ( elem ) { + type = ( type || "fx" ) + "queue"; + queue = dataPriv.get( elem, type ); + + // Speed up dequeue by getting out quickly if this is just a lookup + if ( data ) { + if ( !queue || Array.isArray( data ) ) { + queue = dataPriv.access( elem, type, jQuery.makeArray( data ) ); + } else { + queue.push( data ); + } + } + return queue || []; + } + }, + + dequeue: function( elem, type ) { + type = type || "fx"; + + var queue = jQuery.queue( elem, type ), + startLength = queue.length, + fn = queue.shift(), + hooks = jQuery._queueHooks( elem, type ), + next = function() { + jQuery.dequeue( elem, type ); + }; + + // If the fx queue is dequeued, always remove the progress sentinel + if ( fn === "inprogress" ) { + fn = queue.shift(); + startLength--; + } + + if ( fn ) { + + // Add a progress sentinel to prevent the fx queue from being + // automatically dequeued + if ( type === "fx" ) { + queue.unshift( "inprogress" ); + } + + // Clear up the last queue stop function + delete hooks.stop; + fn.call( elem, next, hooks ); + } + + if ( !startLength && hooks ) { + hooks.empty.fire(); + } + }, + + // Not public - generate a queueHooks object, or return the current one + _queueHooks: function( elem, type ) { + var key = type + "queueHooks"; + return dataPriv.get( elem, key ) || dataPriv.access( elem, key, { + empty: jQuery.Callbacks( "once memory" ).add( function() { + dataPriv.remove( elem, [ type + "queue", key ] ); + } ) + } ); + } +} ); + +jQuery.fn.extend( { + queue: function( type, data ) { + var setter = 2; + + if ( typeof type !== "string" ) { + data = type; + type = "fx"; + setter--; + } + + if ( arguments.length < setter ) { + return jQuery.queue( this[ 0 ], type ); + } + + return data === undefined ? + this : + this.each( function() { + var queue = jQuery.queue( this, type, data ); + + // Ensure a hooks for this queue + jQuery._queueHooks( this, type ); + + if ( type === "fx" && queue[ 0 ] !== "inprogress" ) { + jQuery.dequeue( this, type ); + } + } ); + }, + dequeue: function( type ) { + return this.each( function() { + jQuery.dequeue( this, type ); + } ); + }, + clearQueue: function( type ) { + return this.queue( type || "fx", [] ); + }, + + // Get a promise resolved when queues of a certain type + // are emptied (fx is the type by default) + promise: function( type, obj ) { + var tmp, + count = 1, + defer = jQuery.Deferred(), + elements = this, + i = this.length, + resolve = function() { + if ( !( --count ) ) { + defer.resolveWith( elements, [ elements ] ); + } + }; + + if ( typeof type !== "string" ) { + obj = type; + type = undefined; + } + type = type || "fx"; + + while ( i-- ) { + tmp = dataPriv.get( elements[ i ], type + "queueHooks" ); + if ( tmp && tmp.empty ) { + count++; + tmp.empty.add( resolve ); + } + } + resolve(); + return defer.promise( obj ); + } +} ); +var pnum = ( /[+-]?(?:\d*\.|)\d+(?:[eE][+-]?\d+|)/ ).source; + +var rcssNum = new RegExp( "^(?:([+-])=|)(" + pnum + ")([a-z%]*)$", "i" ); + + +var cssExpand = [ "Top", "Right", "Bottom", "Left" ]; + +var documentElement = document.documentElement; + + + + var isAttached = function( elem ) { + return jQuery.contains( elem.ownerDocument, elem ); + }, + composed = { composed: true }; + + // Support: IE 9 - 11+, Edge 12 - 18+, iOS 10.0 - 10.2 only + // Check attachment across shadow DOM boundaries when possible (gh-3504) + // Support: iOS 10.0-10.2 only + // Early iOS 10 versions support `attachShadow` but not `getRootNode`, + // leading to errors. We need to check for `getRootNode`. + if ( documentElement.getRootNode ) { + isAttached = function( elem ) { + return jQuery.contains( elem.ownerDocument, elem ) || + elem.getRootNode( composed ) === elem.ownerDocument; + }; + } +var isHiddenWithinTree = function( elem, el ) { + + // isHiddenWithinTree might be called from jQuery#filter function; + // in that case, element will be second argument + elem = el || elem; + + // Inline style trumps all + return elem.style.display === "none" || + elem.style.display === "" && + + // Otherwise, check computed style + // Support: Firefox <=43 - 45 + // Disconnected elements can have computed display: none, so first confirm that elem is + // in the document. + isAttached( elem ) && + + jQuery.css( elem, "display" ) === "none"; + }; + + + +function adjustCSS( elem, prop, valueParts, tween ) { + var adjusted, scale, + maxIterations = 20, + currentValue = tween ? + function() { + return tween.cur(); + } : + function() { + return jQuery.css( elem, prop, "" ); + }, + initial = currentValue(), + unit = valueParts && valueParts[ 3 ] || ( jQuery.cssNumber[ prop ] ? "" : "px" ), + + // Starting value computation is required for potential unit mismatches + initialInUnit = elem.nodeType && + ( jQuery.cssNumber[ prop ] || unit !== "px" && +initial ) && + rcssNum.exec( jQuery.css( elem, prop ) ); + + if ( initialInUnit && initialInUnit[ 3 ] !== unit ) { + + // Support: Firefox <=54 + // Halve the iteration target value to prevent interference from CSS upper bounds (gh-2144) + initial = initial / 2; + + // Trust units reported by jQuery.css + unit = unit || initialInUnit[ 3 ]; + + // Iteratively approximate from a nonzero starting point + initialInUnit = +initial || 1; + + while ( maxIterations-- ) { + + // Evaluate and update our best guess (doubling guesses that zero out). + // Finish if the scale equals or crosses 1 (making the old*new product non-positive). + jQuery.style( elem, prop, initialInUnit + unit ); + if ( ( 1 - scale ) * ( 1 - ( scale = currentValue() / initial || 0.5 ) ) <= 0 ) { + maxIterations = 0; + } + initialInUnit = initialInUnit / scale; + + } + + initialInUnit = initialInUnit * 2; + jQuery.style( elem, prop, initialInUnit + unit ); + + // Make sure we update the tween properties later on + valueParts = valueParts || []; + } + + if ( valueParts ) { + initialInUnit = +initialInUnit || +initial || 0; + + // Apply relative offset (+=/-=) if specified + adjusted = valueParts[ 1 ] ? + initialInUnit + ( valueParts[ 1 ] + 1 ) * valueParts[ 2 ] : + +valueParts[ 2 ]; + if ( tween ) { + tween.unit = unit; + tween.start = initialInUnit; + tween.end = adjusted; + } + } + return adjusted; +} + + +var defaultDisplayMap = {}; + +function getDefaultDisplay( elem ) { + var temp, + doc = elem.ownerDocument, + nodeName = elem.nodeName, + display = defaultDisplayMap[ nodeName ]; + + if ( display ) { + return display; + } + + temp = doc.body.appendChild( doc.createElement( nodeName ) ); + display = jQuery.css( temp, "display" ); + + temp.parentNode.removeChild( temp ); + + if ( display === "none" ) { + display = "block"; + } + defaultDisplayMap[ nodeName ] = display; + + return display; +} + +function showHide( elements, show ) { + var display, elem, + values = [], + index = 0, + length = elements.length; + + // Determine new display value for elements that need to change + for ( ; index < length; index++ ) { + elem = elements[ index ]; + if ( !elem.style ) { + continue; + } + + display = elem.style.display; + if ( show ) { + + // Since we force visibility upon cascade-hidden elements, an immediate (and slow) + // check is required in this first loop unless we have a nonempty display value (either + // inline or about-to-be-restored) + if ( display === "none" ) { + values[ index ] = dataPriv.get( elem, "display" ) || null; + if ( !values[ index ] ) { + elem.style.display = ""; + } + } + if ( elem.style.display === "" && isHiddenWithinTree( elem ) ) { + values[ index ] = getDefaultDisplay( elem ); + } + } else { + if ( display !== "none" ) { + values[ index ] = "none"; + + // Remember what we're overwriting + dataPriv.set( elem, "display", display ); + } + } + } + + // Set the display of the elements in a second loop to avoid constant reflow + for ( index = 0; index < length; index++ ) { + if ( values[ index ] != null ) { + elements[ index ].style.display = values[ index ]; + } + } + + return elements; +} + +jQuery.fn.extend( { + show: function() { + return showHide( this, true ); + }, + hide: function() { + return showHide( this ); + }, + toggle: function( state ) { + if ( typeof state === "boolean" ) { + return state ? this.show() : this.hide(); + } + + return this.each( function() { + if ( isHiddenWithinTree( this ) ) { + jQuery( this ).show(); + } else { + jQuery( this ).hide(); + } + } ); + } +} ); +var rcheckableType = ( /^(?:checkbox|radio)$/i ); + +var rtagName = ( /<([a-z][^\/\0>\x20\t\r\n\f]*)/i ); + +var rscriptType = ( /^$|^module$|\/(?:java|ecma)script/i ); + + + +( function() { + var fragment = document.createDocumentFragment(), + div = fragment.appendChild( document.createElement( "div" ) ), + input = document.createElement( "input" ); + + // Support: Android 4.0 - 4.3 only + // Check state lost if the name is set (#11217) + // Support: Windows Web Apps (WWA) + // `name` and `type` must use .setAttribute for WWA (#14901) + input.setAttribute( "type", "radio" ); + input.setAttribute( "checked", "checked" ); + input.setAttribute( "name", "t" ); + + div.appendChild( input ); + + // Support: Android <=4.1 only + // Older WebKit doesn't clone checked state correctly in fragments + support.checkClone = div.cloneNode( true ).cloneNode( true ).lastChild.checked; + + // Support: IE <=11 only + // Make sure textarea (and checkbox) defaultValue is properly cloned + div.innerHTML = ""; + support.noCloneChecked = !!div.cloneNode( true ).lastChild.defaultValue; + + // Support: IE <=9 only + // IE <=9 replaces "; + support.option = !!div.lastChild; +} )(); + + +// We have to close these tags to support XHTML (#13200) +var wrapMap = { + + // XHTML parsers do not magically insert elements in the + // same way that tag soup parsers do. So we cannot shorten + // this by omitting or other required elements. + thead: [ 1, "", "
" ], + col: [ 2, "", "
" ], + tr: [ 2, "", "
" ], + td: [ 3, "", "
" ], + + _default: [ 0, "", "" ] +}; + +wrapMap.tbody = wrapMap.tfoot = wrapMap.colgroup = wrapMap.caption = wrapMap.thead; +wrapMap.th = wrapMap.td; + +// Support: IE <=9 only +if ( !support.option ) { + wrapMap.optgroup = wrapMap.option = [ 1, "" ]; +} + + +function getAll( context, tag ) { + + // Support: IE <=9 - 11 only + // Use typeof to avoid zero-argument method invocation on host objects (#15151) + var ret; + + if ( typeof context.getElementsByTagName !== "undefined" ) { + ret = context.getElementsByTagName( tag || "*" ); + + } else if ( typeof context.querySelectorAll !== "undefined" ) { + ret = context.querySelectorAll( tag || "*" ); + + } else { + ret = []; + } + + if ( tag === undefined || tag && nodeName( context, tag ) ) { + return jQuery.merge( [ context ], ret ); + } + + return ret; +} + + +// Mark scripts as having already been evaluated +function setGlobalEval( elems, refElements ) { + var i = 0, + l = elems.length; + + for ( ; i < l; i++ ) { + dataPriv.set( + elems[ i ], + "globalEval", + !refElements || dataPriv.get( refElements[ i ], "globalEval" ) + ); + } +} + + +var rhtml = /<|&#?\w+;/; + +function buildFragment( elems, context, scripts, selection, ignored ) { + var elem, tmp, tag, wrap, attached, j, + fragment = context.createDocumentFragment(), + nodes = [], + i = 0, + l = elems.length; + + for ( ; i < l; i++ ) { + elem = elems[ i ]; + + if ( elem || elem === 0 ) { + + // Add nodes directly + if ( toType( elem ) === "object" ) { + + // Support: Android <=4.0 only, PhantomJS 1 only + // push.apply(_, arraylike) throws on ancient WebKit + jQuery.merge( nodes, elem.nodeType ? [ elem ] : elem ); + + // Convert non-html into a text node + } else if ( !rhtml.test( elem ) ) { + nodes.push( context.createTextNode( elem ) ); + + // Convert html into DOM nodes + } else { + tmp = tmp || fragment.appendChild( context.createElement( "div" ) ); + + // Deserialize a standard representation + tag = ( rtagName.exec( elem ) || [ "", "" ] )[ 1 ].toLowerCase(); + wrap = wrapMap[ tag ] || wrapMap._default; + tmp.innerHTML = wrap[ 1 ] + jQuery.htmlPrefilter( elem ) + wrap[ 2 ]; + + // Descend through wrappers to the right content + j = wrap[ 0 ]; + while ( j-- ) { + tmp = tmp.lastChild; + } + + // Support: Android <=4.0 only, PhantomJS 1 only + // push.apply(_, arraylike) throws on ancient WebKit + jQuery.merge( nodes, tmp.childNodes ); + + // Remember the top-level container + tmp = fragment.firstChild; + + // Ensure the created nodes are orphaned (#12392) + tmp.textContent = ""; + } + } + } + + // Remove wrapper from fragment + fragment.textContent = ""; + + i = 0; + while ( ( elem = nodes[ i++ ] ) ) { + + // Skip elements already in the context collection (trac-4087) + if ( selection && jQuery.inArray( elem, selection ) > -1 ) { + if ( ignored ) { + ignored.push( elem ); + } + continue; + } + + attached = isAttached( elem ); + + // Append to fragment + tmp = getAll( fragment.appendChild( elem ), "script" ); + + // Preserve script evaluation history + if ( attached ) { + setGlobalEval( tmp ); + } + + // Capture executables + if ( scripts ) { + j = 0; + while ( ( elem = tmp[ j++ ] ) ) { + if ( rscriptType.test( elem.type || "" ) ) { + scripts.push( elem ); + } + } + } + } + + return fragment; +} + + +var + rkeyEvent = /^key/, + rmouseEvent = /^(?:mouse|pointer|contextmenu|drag|drop)|click/, + rtypenamespace = /^([^.]*)(?:\.(.+)|)/; + +function returnTrue() { + return true; +} + +function returnFalse() { + return false; +} + +// Support: IE <=9 - 11+ +// focus() and blur() are asynchronous, except when they are no-op. +// So expect focus to be synchronous when the element is already active, +// and blur to be synchronous when the element is not already active. +// (focus and blur are always synchronous in other supported browsers, +// this just defines when we can count on it). +function expectSync( elem, type ) { + return ( elem === safeActiveElement() ) === ( type === "focus" ); +} + +// Support: IE <=9 only +// Accessing document.activeElement can throw unexpectedly +// https://bugs.jquery.com/ticket/13393 +function safeActiveElement() { + try { + return document.activeElement; + } catch ( err ) { } +} + +function on( elem, types, selector, data, fn, one ) { + var origFn, type; + + // Types can be a map of types/handlers + if ( typeof types === "object" ) { + + // ( types-Object, selector, data ) + if ( typeof selector !== "string" ) { + + // ( types-Object, data ) + data = data || selector; + selector = undefined; + } + for ( type in types ) { + on( elem, type, selector, data, types[ type ], one ); + } + return elem; + } + + if ( data == null && fn == null ) { + + // ( types, fn ) + fn = selector; + data = selector = undefined; + } else if ( fn == null ) { + if ( typeof selector === "string" ) { + + // ( types, selector, fn ) + fn = data; + data = undefined; + } else { + + // ( types, data, fn ) + fn = data; + data = selector; + selector = undefined; + } + } + if ( fn === false ) { + fn = returnFalse; + } else if ( !fn ) { + return elem; + } + + if ( one === 1 ) { + origFn = fn; + fn = function( event ) { + + // Can use an empty set, since event contains the info + jQuery().off( event ); + return origFn.apply( this, arguments ); + }; + + // Use same guid so caller can remove using origFn + fn.guid = origFn.guid || ( origFn.guid = jQuery.guid++ ); + } + return elem.each( function() { + jQuery.event.add( this, types, fn, data, selector ); + } ); +} + +/* + * Helper functions for managing events -- not part of the public interface. + * Props to Dean Edwards' addEvent library for many of the ideas. + */ +jQuery.event = { + + global: {}, + + add: function( elem, types, handler, data, selector ) { + + var handleObjIn, eventHandle, tmp, + events, t, handleObj, + special, handlers, type, namespaces, origType, + elemData = dataPriv.get( elem ); + + // Only attach events to objects that accept data + if ( !acceptData( elem ) ) { + return; + } + + // Caller can pass in an object of custom data in lieu of the handler + if ( handler.handler ) { + handleObjIn = handler; + handler = handleObjIn.handler; + selector = handleObjIn.selector; + } + + // Ensure that invalid selectors throw exceptions at attach time + // Evaluate against documentElement in case elem is a non-element node (e.g., document) + if ( selector ) { + jQuery.find.matchesSelector( documentElement, selector ); + } + + // Make sure that the handler has a unique ID, used to find/remove it later + if ( !handler.guid ) { + handler.guid = jQuery.guid++; + } + + // Init the element's event structure and main handler, if this is the first + if ( !( events = elemData.events ) ) { + events = elemData.events = Object.create( null ); + } + if ( !( eventHandle = elemData.handle ) ) { + eventHandle = elemData.handle = function( e ) { + + // Discard the second event of a jQuery.event.trigger() and + // when an event is called after a page has unloaded + return typeof jQuery !== "undefined" && jQuery.event.triggered !== e.type ? + jQuery.event.dispatch.apply( elem, arguments ) : undefined; + }; + } + + // Handle multiple events separated by a space + types = ( types || "" ).match( rnothtmlwhite ) || [ "" ]; + t = types.length; + while ( t-- ) { + tmp = rtypenamespace.exec( types[ t ] ) || []; + type = origType = tmp[ 1 ]; + namespaces = ( tmp[ 2 ] || "" ).split( "." ).sort(); + + // There *must* be a type, no attaching namespace-only handlers + if ( !type ) { + continue; + } + + // If event changes its type, use the special event handlers for the changed type + special = jQuery.event.special[ type ] || {}; + + // If selector defined, determine special event api type, otherwise given type + type = ( selector ? special.delegateType : special.bindType ) || type; + + // Update special based on newly reset type + special = jQuery.event.special[ type ] || {}; + + // handleObj is passed to all event handlers + handleObj = jQuery.extend( { + type: type, + origType: origType, + data: data, + handler: handler, + guid: handler.guid, + selector: selector, + needsContext: selector && jQuery.expr.match.needsContext.test( selector ), + namespace: namespaces.join( "." ) + }, handleObjIn ); + + // Init the event handler queue if we're the first + if ( !( handlers = events[ type ] ) ) { + handlers = events[ type ] = []; + handlers.delegateCount = 0; + + // Only use addEventListener if the special events handler returns false + if ( !special.setup || + special.setup.call( elem, data, namespaces, eventHandle ) === false ) { + + if ( elem.addEventListener ) { + elem.addEventListener( type, eventHandle ); + } + } + } + + if ( special.add ) { + special.add.call( elem, handleObj ); + + if ( !handleObj.handler.guid ) { + handleObj.handler.guid = handler.guid; + } + } + + // Add to the element's handler list, delegates in front + if ( selector ) { + handlers.splice( handlers.delegateCount++, 0, handleObj ); + } else { + handlers.push( handleObj ); + } + + // Keep track of which events have ever been used, for event optimization + jQuery.event.global[ type ] = true; + } + + }, + + // Detach an event or set of events from an element + remove: function( elem, types, handler, selector, mappedTypes ) { + + var j, origCount, tmp, + events, t, handleObj, + special, handlers, type, namespaces, origType, + elemData = dataPriv.hasData( elem ) && dataPriv.get( elem ); + + if ( !elemData || !( events = elemData.events ) ) { + return; + } + + // Once for each type.namespace in types; type may be omitted + types = ( types || "" ).match( rnothtmlwhite ) || [ "" ]; + t = types.length; + while ( t-- ) { + tmp = rtypenamespace.exec( types[ t ] ) || []; + type = origType = tmp[ 1 ]; + namespaces = ( tmp[ 2 ] || "" ).split( "." ).sort(); + + // Unbind all events (on this namespace, if provided) for the element + if ( !type ) { + for ( type in events ) { + jQuery.event.remove( elem, type + types[ t ], handler, selector, true ); + } + continue; + } + + special = jQuery.event.special[ type ] || {}; + type = ( selector ? special.delegateType : special.bindType ) || type; + handlers = events[ type ] || []; + tmp = tmp[ 2 ] && + new RegExp( "(^|\\.)" + namespaces.join( "\\.(?:.*\\.|)" ) + "(\\.|$)" ); + + // Remove matching events + origCount = j = handlers.length; + while ( j-- ) { + handleObj = handlers[ j ]; + + if ( ( mappedTypes || origType === handleObj.origType ) && + ( !handler || handler.guid === handleObj.guid ) && + ( !tmp || tmp.test( handleObj.namespace ) ) && + ( !selector || selector === handleObj.selector || + selector === "**" && handleObj.selector ) ) { + handlers.splice( j, 1 ); + + if ( handleObj.selector ) { + handlers.delegateCount--; + } + if ( special.remove ) { + special.remove.call( elem, handleObj ); + } + } + } + + // Remove generic event handler if we removed something and no more handlers exist + // (avoids potential for endless recursion during removal of special event handlers) + if ( origCount && !handlers.length ) { + if ( !special.teardown || + special.teardown.call( elem, namespaces, elemData.handle ) === false ) { + + jQuery.removeEvent( elem, type, elemData.handle ); + } + + delete events[ type ]; + } + } + + // Remove data and the expando if it's no longer used + if ( jQuery.isEmptyObject( events ) ) { + dataPriv.remove( elem, "handle events" ); + } + }, + + dispatch: function( nativeEvent ) { + + var i, j, ret, matched, handleObj, handlerQueue, + args = new Array( arguments.length ), + + // Make a writable jQuery.Event from the native event object + event = jQuery.event.fix( nativeEvent ), + + handlers = ( + dataPriv.get( this, "events" ) || Object.create( null ) + )[ event.type ] || [], + special = jQuery.event.special[ event.type ] || {}; + + // Use the fix-ed jQuery.Event rather than the (read-only) native event + args[ 0 ] = event; + + for ( i = 1; i < arguments.length; i++ ) { + args[ i ] = arguments[ i ]; + } + + event.delegateTarget = this; + + // Call the preDispatch hook for the mapped type, and let it bail if desired + if ( special.preDispatch && special.preDispatch.call( this, event ) === false ) { + return; + } + + // Determine handlers + handlerQueue = jQuery.event.handlers.call( this, event, handlers ); + + // Run delegates first; they may want to stop propagation beneath us + i = 0; + while ( ( matched = handlerQueue[ i++ ] ) && !event.isPropagationStopped() ) { + event.currentTarget = matched.elem; + + j = 0; + while ( ( handleObj = matched.handlers[ j++ ] ) && + !event.isImmediatePropagationStopped() ) { + + // If the event is namespaced, then each handler is only invoked if it is + // specially universal or its namespaces are a superset of the event's. + if ( !event.rnamespace || handleObj.namespace === false || + event.rnamespace.test( handleObj.namespace ) ) { + + event.handleObj = handleObj; + event.data = handleObj.data; + + ret = ( ( jQuery.event.special[ handleObj.origType ] || {} ).handle || + handleObj.handler ).apply( matched.elem, args ); + + if ( ret !== undefined ) { + if ( ( event.result = ret ) === false ) { + event.preventDefault(); + event.stopPropagation(); + } + } + } + } + } + + // Call the postDispatch hook for the mapped type + if ( special.postDispatch ) { + special.postDispatch.call( this, event ); + } + + return event.result; + }, + + handlers: function( event, handlers ) { + var i, handleObj, sel, matchedHandlers, matchedSelectors, + handlerQueue = [], + delegateCount = handlers.delegateCount, + cur = event.target; + + // Find delegate handlers + if ( delegateCount && + + // Support: IE <=9 + // Black-hole SVG instance trees (trac-13180) + cur.nodeType && + + // Support: Firefox <=42 + // Suppress spec-violating clicks indicating a non-primary pointer button (trac-3861) + // https://www.w3.org/TR/DOM-Level-3-Events/#event-type-click + // Support: IE 11 only + // ...but not arrow key "clicks" of radio inputs, which can have `button` -1 (gh-2343) + !( event.type === "click" && event.button >= 1 ) ) { + + for ( ; cur !== this; cur = cur.parentNode || this ) { + + // Don't check non-elements (#13208) + // Don't process clicks on disabled elements (#6911, #8165, #11382, #11764) + if ( cur.nodeType === 1 && !( event.type === "click" && cur.disabled === true ) ) { + matchedHandlers = []; + matchedSelectors = {}; + for ( i = 0; i < delegateCount; i++ ) { + handleObj = handlers[ i ]; + + // Don't conflict with Object.prototype properties (#13203) + sel = handleObj.selector + " "; + + if ( matchedSelectors[ sel ] === undefined ) { + matchedSelectors[ sel ] = handleObj.needsContext ? + jQuery( sel, this ).index( cur ) > -1 : + jQuery.find( sel, this, null, [ cur ] ).length; + } + if ( matchedSelectors[ sel ] ) { + matchedHandlers.push( handleObj ); + } + } + if ( matchedHandlers.length ) { + handlerQueue.push( { elem: cur, handlers: matchedHandlers } ); + } + } + } + } + + // Add the remaining (directly-bound) handlers + cur = this; + if ( delegateCount < handlers.length ) { + handlerQueue.push( { elem: cur, handlers: handlers.slice( delegateCount ) } ); + } + + return handlerQueue; + }, + + addProp: function( name, hook ) { + Object.defineProperty( jQuery.Event.prototype, name, { + enumerable: true, + configurable: true, + + get: isFunction( hook ) ? + function() { + if ( this.originalEvent ) { + return hook( this.originalEvent ); + } + } : + function() { + if ( this.originalEvent ) { + return this.originalEvent[ name ]; + } + }, + + set: function( value ) { + Object.defineProperty( this, name, { + enumerable: true, + configurable: true, + writable: true, + value: value + } ); + } + } ); + }, + + fix: function( originalEvent ) { + return originalEvent[ jQuery.expando ] ? + originalEvent : + new jQuery.Event( originalEvent ); + }, + + special: { + load: { + + // Prevent triggered image.load events from bubbling to window.load + noBubble: true + }, + click: { + + // Utilize native event to ensure correct state for checkable inputs + setup: function( data ) { + + // For mutual compressibility with _default, replace `this` access with a local var. + // `|| data` is dead code meant only to preserve the variable through minification. + var el = this || data; + + // Claim the first handler + if ( rcheckableType.test( el.type ) && + el.click && nodeName( el, "input" ) ) { + + // dataPriv.set( el, "click", ... ) + leverageNative( el, "click", returnTrue ); + } + + // Return false to allow normal processing in the caller + return false; + }, + trigger: function( data ) { + + // For mutual compressibility with _default, replace `this` access with a local var. + // `|| data` is dead code meant only to preserve the variable through minification. + var el = this || data; + + // Force setup before triggering a click + if ( rcheckableType.test( el.type ) && + el.click && nodeName( el, "input" ) ) { + + leverageNative( el, "click" ); + } + + // Return non-false to allow normal event-path propagation + return true; + }, + + // For cross-browser consistency, suppress native .click() on links + // Also prevent it if we're currently inside a leveraged native-event stack + _default: function( event ) { + var target = event.target; + return rcheckableType.test( target.type ) && + target.click && nodeName( target, "input" ) && + dataPriv.get( target, "click" ) || + nodeName( target, "a" ); + } + }, + + beforeunload: { + postDispatch: function( event ) { + + // Support: Firefox 20+ + // Firefox doesn't alert if the returnValue field is not set. + if ( event.result !== undefined && event.originalEvent ) { + event.originalEvent.returnValue = event.result; + } + } + } + } +}; + +// Ensure the presence of an event listener that handles manually-triggered +// synthetic events by interrupting progress until reinvoked in response to +// *native* events that it fires directly, ensuring that state changes have +// already occurred before other listeners are invoked. +function leverageNative( el, type, expectSync ) { + + // Missing expectSync indicates a trigger call, which must force setup through jQuery.event.add + if ( !expectSync ) { + if ( dataPriv.get( el, type ) === undefined ) { + jQuery.event.add( el, type, returnTrue ); + } + return; + } + + // Register the controller as a special universal handler for all event namespaces + dataPriv.set( el, type, false ); + jQuery.event.add( el, type, { + namespace: false, + handler: function( event ) { + var notAsync, result, + saved = dataPriv.get( this, type ); + + if ( ( event.isTrigger & 1 ) && this[ type ] ) { + + // Interrupt processing of the outer synthetic .trigger()ed event + // Saved data should be false in such cases, but might be a leftover capture object + // from an async native handler (gh-4350) + if ( !saved.length ) { + + // Store arguments for use when handling the inner native event + // There will always be at least one argument (an event object), so this array + // will not be confused with a leftover capture object. + saved = slice.call( arguments ); + dataPriv.set( this, type, saved ); + + // Trigger the native event and capture its result + // Support: IE <=9 - 11+ + // focus() and blur() are asynchronous + notAsync = expectSync( this, type ); + this[ type ](); + result = dataPriv.get( this, type ); + if ( saved !== result || notAsync ) { + dataPriv.set( this, type, false ); + } else { + result = {}; + } + if ( saved !== result ) { + + // Cancel the outer synthetic event + event.stopImmediatePropagation(); + event.preventDefault(); + return result.value; + } + + // If this is an inner synthetic event for an event with a bubbling surrogate + // (focus or blur), assume that the surrogate already propagated from triggering the + // native event and prevent that from happening again here. + // This technically gets the ordering wrong w.r.t. to `.trigger()` (in which the + // bubbling surrogate propagates *after* the non-bubbling base), but that seems + // less bad than duplication. + } else if ( ( jQuery.event.special[ type ] || {} ).delegateType ) { + event.stopPropagation(); + } + + // If this is a native event triggered above, everything is now in order + // Fire an inner synthetic event with the original arguments + } else if ( saved.length ) { + + // ...and capture the result + dataPriv.set( this, type, { + value: jQuery.event.trigger( + + // Support: IE <=9 - 11+ + // Extend with the prototype to reset the above stopImmediatePropagation() + jQuery.extend( saved[ 0 ], jQuery.Event.prototype ), + saved.slice( 1 ), + this + ) + } ); + + // Abort handling of the native event + event.stopImmediatePropagation(); + } + } + } ); +} + +jQuery.removeEvent = function( elem, type, handle ) { + + // This "if" is needed for plain objects + if ( elem.removeEventListener ) { + elem.removeEventListener( type, handle ); + } +}; + +jQuery.Event = function( src, props ) { + + // Allow instantiation without the 'new' keyword + if ( !( this instanceof jQuery.Event ) ) { + return new jQuery.Event( src, props ); + } + + // Event object + if ( src && src.type ) { + this.originalEvent = src; + this.type = src.type; + + // Events bubbling up the document may have been marked as prevented + // by a handler lower down the tree; reflect the correct value. + this.isDefaultPrevented = src.defaultPrevented || + src.defaultPrevented === undefined && + + // Support: Android <=2.3 only + src.returnValue === false ? + returnTrue : + returnFalse; + + // Create target properties + // Support: Safari <=6 - 7 only + // Target should not be a text node (#504, #13143) + this.target = ( src.target && src.target.nodeType === 3 ) ? + src.target.parentNode : + src.target; + + this.currentTarget = src.currentTarget; + this.relatedTarget = src.relatedTarget; + + // Event type + } else { + this.type = src; + } + + // Put explicitly provided properties onto the event object + if ( props ) { + jQuery.extend( this, props ); + } + + // Create a timestamp if incoming event doesn't have one + this.timeStamp = src && src.timeStamp || Date.now(); + + // Mark it as fixed + this[ jQuery.expando ] = true; +}; + +// jQuery.Event is based on DOM3 Events as specified by the ECMAScript Language Binding +// https://www.w3.org/TR/2003/WD-DOM-Level-3-Events-20030331/ecma-script-binding.html +jQuery.Event.prototype = { + constructor: jQuery.Event, + isDefaultPrevented: returnFalse, + isPropagationStopped: returnFalse, + isImmediatePropagationStopped: returnFalse, + isSimulated: false, + + preventDefault: function() { + var e = this.originalEvent; + + this.isDefaultPrevented = returnTrue; + + if ( e && !this.isSimulated ) { + e.preventDefault(); + } + }, + stopPropagation: function() { + var e = this.originalEvent; + + this.isPropagationStopped = returnTrue; + + if ( e && !this.isSimulated ) { + e.stopPropagation(); + } + }, + stopImmediatePropagation: function() { + var e = this.originalEvent; + + this.isImmediatePropagationStopped = returnTrue; + + if ( e && !this.isSimulated ) { + e.stopImmediatePropagation(); + } + + this.stopPropagation(); + } +}; + +// Includes all common event props including KeyEvent and MouseEvent specific props +jQuery.each( { + altKey: true, + bubbles: true, + cancelable: true, + changedTouches: true, + ctrlKey: true, + detail: true, + eventPhase: true, + metaKey: true, + pageX: true, + pageY: true, + shiftKey: true, + view: true, + "char": true, + code: true, + charCode: true, + key: true, + keyCode: true, + button: true, + buttons: true, + clientX: true, + clientY: true, + offsetX: true, + offsetY: true, + pointerId: true, + pointerType: true, + screenX: true, + screenY: true, + targetTouches: true, + toElement: true, + touches: true, + + which: function( event ) { + var button = event.button; + + // Add which for key events + if ( event.which == null && rkeyEvent.test( event.type ) ) { + return event.charCode != null ? event.charCode : event.keyCode; + } + + // Add which for click: 1 === left; 2 === middle; 3 === right + if ( !event.which && button !== undefined && rmouseEvent.test( event.type ) ) { + if ( button & 1 ) { + return 1; + } + + if ( button & 2 ) { + return 3; + } + + if ( button & 4 ) { + return 2; + } + + return 0; + } + + return event.which; + } +}, jQuery.event.addProp ); + +jQuery.each( { focus: "focusin", blur: "focusout" }, function( type, delegateType ) { + jQuery.event.special[ type ] = { + + // Utilize native event if possible so blur/focus sequence is correct + setup: function() { + + // Claim the first handler + // dataPriv.set( this, "focus", ... ) + // dataPriv.set( this, "blur", ... ) + leverageNative( this, type, expectSync ); + + // Return false to allow normal processing in the caller + return false; + }, + trigger: function() { + + // Force setup before trigger + leverageNative( this, type ); + + // Return non-false to allow normal event-path propagation + return true; + }, + + delegateType: delegateType + }; +} ); + +// Create mouseenter/leave events using mouseover/out and event-time checks +// so that event delegation works in jQuery. +// Do the same for pointerenter/pointerleave and pointerover/pointerout +// +// Support: Safari 7 only +// Safari sends mouseenter too often; see: +// https://bugs.chromium.org/p/chromium/issues/detail?id=470258 +// for the description of the bug (it existed in older Chrome versions as well). +jQuery.each( { + mouseenter: "mouseover", + mouseleave: "mouseout", + pointerenter: "pointerover", + pointerleave: "pointerout" +}, function( orig, fix ) { + jQuery.event.special[ orig ] = { + delegateType: fix, + bindType: fix, + + handle: function( event ) { + var ret, + target = this, + related = event.relatedTarget, + handleObj = event.handleObj; + + // For mouseenter/leave call the handler if related is outside the target. + // NB: No relatedTarget if the mouse left/entered the browser window + if ( !related || ( related !== target && !jQuery.contains( target, related ) ) ) { + event.type = handleObj.origType; + ret = handleObj.handler.apply( this, arguments ); + event.type = fix; + } + return ret; + } + }; +} ); + +jQuery.fn.extend( { + + on: function( types, selector, data, fn ) { + return on( this, types, selector, data, fn ); + }, + one: function( types, selector, data, fn ) { + return on( this, types, selector, data, fn, 1 ); + }, + off: function( types, selector, fn ) { + var handleObj, type; + if ( types && types.preventDefault && types.handleObj ) { + + // ( event ) dispatched jQuery.Event + handleObj = types.handleObj; + jQuery( types.delegateTarget ).off( + handleObj.namespace ? + handleObj.origType + "." + handleObj.namespace : + handleObj.origType, + handleObj.selector, + handleObj.handler + ); + return this; + } + if ( typeof types === "object" ) { + + // ( types-object [, selector] ) + for ( type in types ) { + this.off( type, selector, types[ type ] ); + } + return this; + } + if ( selector === false || typeof selector === "function" ) { + + // ( types [, fn] ) + fn = selector; + selector = undefined; + } + if ( fn === false ) { + fn = returnFalse; + } + return this.each( function() { + jQuery.event.remove( this, types, fn, selector ); + } ); + } +} ); + + +var + + // Support: IE <=10 - 11, Edge 12 - 13 only + // In IE/Edge using regex groups here causes severe slowdowns. + // See https://connect.microsoft.com/IE/feedback/details/1736512/ + rnoInnerhtml = /\s*$/g; + +// Prefer a tbody over its parent table for containing new rows +function manipulationTarget( elem, content ) { + if ( nodeName( elem, "table" ) && + nodeName( content.nodeType !== 11 ? content : content.firstChild, "tr" ) ) { + + return jQuery( elem ).children( "tbody" )[ 0 ] || elem; + } + + return elem; +} + +// Replace/restore the type attribute of script elements for safe DOM manipulation +function disableScript( elem ) { + elem.type = ( elem.getAttribute( "type" ) !== null ) + "/" + elem.type; + return elem; +} +function restoreScript( elem ) { + if ( ( elem.type || "" ).slice( 0, 5 ) === "true/" ) { + elem.type = elem.type.slice( 5 ); + } else { + elem.removeAttribute( "type" ); + } + + return elem; +} + +function cloneCopyEvent( src, dest ) { + var i, l, type, pdataOld, udataOld, udataCur, events; + + if ( dest.nodeType !== 1 ) { + return; + } + + // 1. Copy private data: events, handlers, etc. + if ( dataPriv.hasData( src ) ) { + pdataOld = dataPriv.get( src ); + events = pdataOld.events; + + if ( events ) { + dataPriv.remove( dest, "handle events" ); + + for ( type in events ) { + for ( i = 0, l = events[ type ].length; i < l; i++ ) { + jQuery.event.add( dest, type, events[ type ][ i ] ); + } + } + } + } + + // 2. Copy user data + if ( dataUser.hasData( src ) ) { + udataOld = dataUser.access( src ); + udataCur = jQuery.extend( {}, udataOld ); + + dataUser.set( dest, udataCur ); + } +} + +// Fix IE bugs, see support tests +function fixInput( src, dest ) { + var nodeName = dest.nodeName.toLowerCase(); + + // Fails to persist the checked state of a cloned checkbox or radio button. + if ( nodeName === "input" && rcheckableType.test( src.type ) ) { + dest.checked = src.checked; + + // Fails to return the selected option to the default selected state when cloning options + } else if ( nodeName === "input" || nodeName === "textarea" ) { + dest.defaultValue = src.defaultValue; + } +} + +function domManip( collection, args, callback, ignored ) { + + // Flatten any nested arrays + args = flat( args ); + + var fragment, first, scripts, hasScripts, node, doc, + i = 0, + l = collection.length, + iNoClone = l - 1, + value = args[ 0 ], + valueIsFunction = isFunction( value ); + + // We can't cloneNode fragments that contain checked, in WebKit + if ( valueIsFunction || + ( l > 1 && typeof value === "string" && + !support.checkClone && rchecked.test( value ) ) ) { + return collection.each( function( index ) { + var self = collection.eq( index ); + if ( valueIsFunction ) { + args[ 0 ] = value.call( this, index, self.html() ); + } + domManip( self, args, callback, ignored ); + } ); + } + + if ( l ) { + fragment = buildFragment( args, collection[ 0 ].ownerDocument, false, collection, ignored ); + first = fragment.firstChild; + + if ( fragment.childNodes.length === 1 ) { + fragment = first; + } + + // Require either new content or an interest in ignored elements to invoke the callback + if ( first || ignored ) { + scripts = jQuery.map( getAll( fragment, "script" ), disableScript ); + hasScripts = scripts.length; + + // Use the original fragment for the last item + // instead of the first because it can end up + // being emptied incorrectly in certain situations (#8070). + for ( ; i < l; i++ ) { + node = fragment; + + if ( i !== iNoClone ) { + node = jQuery.clone( node, true, true ); + + // Keep references to cloned scripts for later restoration + if ( hasScripts ) { + + // Support: Android <=4.0 only, PhantomJS 1 only + // push.apply(_, arraylike) throws on ancient WebKit + jQuery.merge( scripts, getAll( node, "script" ) ); + } + } + + callback.call( collection[ i ], node, i ); + } + + if ( hasScripts ) { + doc = scripts[ scripts.length - 1 ].ownerDocument; + + // Reenable scripts + jQuery.map( scripts, restoreScript ); + + // Evaluate executable scripts on first document insertion + for ( i = 0; i < hasScripts; i++ ) { + node = scripts[ i ]; + if ( rscriptType.test( node.type || "" ) && + !dataPriv.access( node, "globalEval" ) && + jQuery.contains( doc, node ) ) { + + if ( node.src && ( node.type || "" ).toLowerCase() !== "module" ) { + + // Optional AJAX dependency, but won't run scripts if not present + if ( jQuery._evalUrl && !node.noModule ) { + jQuery._evalUrl( node.src, { + nonce: node.nonce || node.getAttribute( "nonce" ) + }, doc ); + } + } else { + DOMEval( node.textContent.replace( rcleanScript, "" ), node, doc ); + } + } + } + } + } + } + + return collection; +} + +function remove( elem, selector, keepData ) { + var node, + nodes = selector ? jQuery.filter( selector, elem ) : elem, + i = 0; + + for ( ; ( node = nodes[ i ] ) != null; i++ ) { + if ( !keepData && node.nodeType === 1 ) { + jQuery.cleanData( getAll( node ) ); + } + + if ( node.parentNode ) { + if ( keepData && isAttached( node ) ) { + setGlobalEval( getAll( node, "script" ) ); + } + node.parentNode.removeChild( node ); + } + } + + return elem; +} + +jQuery.extend( { + htmlPrefilter: function( html ) { + return html; + }, + + clone: function( elem, dataAndEvents, deepDataAndEvents ) { + var i, l, srcElements, destElements, + clone = elem.cloneNode( true ), + inPage = isAttached( elem ); + + // Fix IE cloning issues + if ( !support.noCloneChecked && ( elem.nodeType === 1 || elem.nodeType === 11 ) && + !jQuery.isXMLDoc( elem ) ) { + + // We eschew Sizzle here for performance reasons: https://jsperf.com/getall-vs-sizzle/2 + destElements = getAll( clone ); + srcElements = getAll( elem ); + + for ( i = 0, l = srcElements.length; i < l; i++ ) { + fixInput( srcElements[ i ], destElements[ i ] ); + } + } + + // Copy the events from the original to the clone + if ( dataAndEvents ) { + if ( deepDataAndEvents ) { + srcElements = srcElements || getAll( elem ); + destElements = destElements || getAll( clone ); + + for ( i = 0, l = srcElements.length; i < l; i++ ) { + cloneCopyEvent( srcElements[ i ], destElements[ i ] ); + } + } else { + cloneCopyEvent( elem, clone ); + } + } + + // Preserve script evaluation history + destElements = getAll( clone, "script" ); + if ( destElements.length > 0 ) { + setGlobalEval( destElements, !inPage && getAll( elem, "script" ) ); + } + + // Return the cloned set + return clone; + }, + + cleanData: function( elems ) { + var data, elem, type, + special = jQuery.event.special, + i = 0; + + for ( ; ( elem = elems[ i ] ) !== undefined; i++ ) { + if ( acceptData( elem ) ) { + if ( ( data = elem[ dataPriv.expando ] ) ) { + if ( data.events ) { + for ( type in data.events ) { + if ( special[ type ] ) { + jQuery.event.remove( elem, type ); + + // This is a shortcut to avoid jQuery.event.remove's overhead + } else { + jQuery.removeEvent( elem, type, data.handle ); + } + } + } + + // Support: Chrome <=35 - 45+ + // Assign undefined instead of using delete, see Data#remove + elem[ dataPriv.expando ] = undefined; + } + if ( elem[ dataUser.expando ] ) { + + // Support: Chrome <=35 - 45+ + // Assign undefined instead of using delete, see Data#remove + elem[ dataUser.expando ] = undefined; + } + } + } + } +} ); + +jQuery.fn.extend( { + detach: function( selector ) { + return remove( this, selector, true ); + }, + + remove: function( selector ) { + return remove( this, selector ); + }, + + text: function( value ) { + return access( this, function( value ) { + return value === undefined ? + jQuery.text( this ) : + this.empty().each( function() { + if ( this.nodeType === 1 || this.nodeType === 11 || this.nodeType === 9 ) { + this.textContent = value; + } + } ); + }, null, value, arguments.length ); + }, + + append: function() { + return domManip( this, arguments, function( elem ) { + if ( this.nodeType === 1 || this.nodeType === 11 || this.nodeType === 9 ) { + var target = manipulationTarget( this, elem ); + target.appendChild( elem ); + } + } ); + }, + + prepend: function() { + return domManip( this, arguments, function( elem ) { + if ( this.nodeType === 1 || this.nodeType === 11 || this.nodeType === 9 ) { + var target = manipulationTarget( this, elem ); + target.insertBefore( elem, target.firstChild ); + } + } ); + }, + + before: function() { + return domManip( this, arguments, function( elem ) { + if ( this.parentNode ) { + this.parentNode.insertBefore( elem, this ); + } + } ); + }, + + after: function() { + return domManip( this, arguments, function( elem ) { + if ( this.parentNode ) { + this.parentNode.insertBefore( elem, this.nextSibling ); + } + } ); + }, + + empty: function() { + var elem, + i = 0; + + for ( ; ( elem = this[ i ] ) != null; i++ ) { + if ( elem.nodeType === 1 ) { + + // Prevent memory leaks + jQuery.cleanData( getAll( elem, false ) ); + + // Remove any remaining nodes + elem.textContent = ""; + } + } + + return this; + }, + + clone: function( dataAndEvents, deepDataAndEvents ) { + dataAndEvents = dataAndEvents == null ? false : dataAndEvents; + deepDataAndEvents = deepDataAndEvents == null ? dataAndEvents : deepDataAndEvents; + + return this.map( function() { + return jQuery.clone( this, dataAndEvents, deepDataAndEvents ); + } ); + }, + + html: function( value ) { + return access( this, function( value ) { + var elem = this[ 0 ] || {}, + i = 0, + l = this.length; + + if ( value === undefined && elem.nodeType === 1 ) { + return elem.innerHTML; + } + + // See if we can take a shortcut and just use innerHTML + if ( typeof value === "string" && !rnoInnerhtml.test( value ) && + !wrapMap[ ( rtagName.exec( value ) || [ "", "" ] )[ 1 ].toLowerCase() ] ) { + + value = jQuery.htmlPrefilter( value ); + + try { + for ( ; i < l; i++ ) { + elem = this[ i ] || {}; + + // Remove element nodes and prevent memory leaks + if ( elem.nodeType === 1 ) { + jQuery.cleanData( getAll( elem, false ) ); + elem.innerHTML = value; + } + } + + elem = 0; + + // If using innerHTML throws an exception, use the fallback method + } catch ( e ) {} + } + + if ( elem ) { + this.empty().append( value ); + } + }, null, value, arguments.length ); + }, + + replaceWith: function() { + var ignored = []; + + // Make the changes, replacing each non-ignored context element with the new content + return domManip( this, arguments, function( elem ) { + var parent = this.parentNode; + + if ( jQuery.inArray( this, ignored ) < 0 ) { + jQuery.cleanData( getAll( this ) ); + if ( parent ) { + parent.replaceChild( elem, this ); + } + } + + // Force callback invocation + }, ignored ); + } +} ); + +jQuery.each( { + appendTo: "append", + prependTo: "prepend", + insertBefore: "before", + insertAfter: "after", + replaceAll: "replaceWith" +}, function( name, original ) { + jQuery.fn[ name ] = function( selector ) { + var elems, + ret = [], + insert = jQuery( selector ), + last = insert.length - 1, + i = 0; + + for ( ; i <= last; i++ ) { + elems = i === last ? this : this.clone( true ); + jQuery( insert[ i ] )[ original ]( elems ); + + // Support: Android <=4.0 only, PhantomJS 1 only + // .get() because push.apply(_, arraylike) throws on ancient WebKit + push.apply( ret, elems.get() ); + } + + return this.pushStack( ret ); + }; +} ); +var rnumnonpx = new RegExp( "^(" + pnum + ")(?!px)[a-z%]+$", "i" ); + +var getStyles = function( elem ) { + + // Support: IE <=11 only, Firefox <=30 (#15098, #14150) + // IE throws on elements created in popups + // FF meanwhile throws on frame elements through "defaultView.getComputedStyle" + var view = elem.ownerDocument.defaultView; + + if ( !view || !view.opener ) { + view = window; + } + + return view.getComputedStyle( elem ); + }; + +var swap = function( elem, options, callback ) { + var ret, name, + old = {}; + + // Remember the old values, and insert the new ones + for ( name in options ) { + old[ name ] = elem.style[ name ]; + elem.style[ name ] = options[ name ]; + } + + ret = callback.call( elem ); + + // Revert the old values + for ( name in options ) { + elem.style[ name ] = old[ name ]; + } + + return ret; +}; + + +var rboxStyle = new RegExp( cssExpand.join( "|" ), "i" ); + + + +( function() { + + // Executing both pixelPosition & boxSizingReliable tests require only one layout + // so they're executed at the same time to save the second computation. + function computeStyleTests() { + + // This is a singleton, we need to execute it only once + if ( !div ) { + return; + } + + container.style.cssText = "position:absolute;left:-11111px;width:60px;" + + "margin-top:1px;padding:0;border:0"; + div.style.cssText = + "position:relative;display:block;box-sizing:border-box;overflow:scroll;" + + "margin:auto;border:1px;padding:1px;" + + "width:60%;top:1%"; + documentElement.appendChild( container ).appendChild( div ); + + var divStyle = window.getComputedStyle( div ); + pixelPositionVal = divStyle.top !== "1%"; + + // Support: Android 4.0 - 4.3 only, Firefox <=3 - 44 + reliableMarginLeftVal = roundPixelMeasures( divStyle.marginLeft ) === 12; + + // Support: Android 4.0 - 4.3 only, Safari <=9.1 - 10.1, iOS <=7.0 - 9.3 + // Some styles come back with percentage values, even though they shouldn't + div.style.right = "60%"; + pixelBoxStylesVal = roundPixelMeasures( divStyle.right ) === 36; + + // Support: IE 9 - 11 only + // Detect misreporting of content dimensions for box-sizing:border-box elements + boxSizingReliableVal = roundPixelMeasures( divStyle.width ) === 36; + + // Support: IE 9 only + // Detect overflow:scroll screwiness (gh-3699) + // Support: Chrome <=64 + // Don't get tricked when zoom affects offsetWidth (gh-4029) + div.style.position = "absolute"; + scrollboxSizeVal = roundPixelMeasures( div.offsetWidth / 3 ) === 12; + + documentElement.removeChild( container ); + + // Nullify the div so it wouldn't be stored in the memory and + // it will also be a sign that checks already performed + div = null; + } + + function roundPixelMeasures( measure ) { + return Math.round( parseFloat( measure ) ); + } + + var pixelPositionVal, boxSizingReliableVal, scrollboxSizeVal, pixelBoxStylesVal, + reliableTrDimensionsVal, reliableMarginLeftVal, + container = document.createElement( "div" ), + div = document.createElement( "div" ); + + // Finish early in limited (non-browser) environments + if ( !div.style ) { + return; + } + + // Support: IE <=9 - 11 only + // Style of cloned element affects source element cloned (#8908) + div.style.backgroundClip = "content-box"; + div.cloneNode( true ).style.backgroundClip = ""; + support.clearCloneStyle = div.style.backgroundClip === "content-box"; + + jQuery.extend( support, { + boxSizingReliable: function() { + computeStyleTests(); + return boxSizingReliableVal; + }, + pixelBoxStyles: function() { + computeStyleTests(); + return pixelBoxStylesVal; + }, + pixelPosition: function() { + computeStyleTests(); + return pixelPositionVal; + }, + reliableMarginLeft: function() { + computeStyleTests(); + return reliableMarginLeftVal; + }, + scrollboxSize: function() { + computeStyleTests(); + return scrollboxSizeVal; + }, + + // Support: IE 9 - 11+, Edge 15 - 18+ + // IE/Edge misreport `getComputedStyle` of table rows with width/height + // set in CSS while `offset*` properties report correct values. + // Behavior in IE 9 is more subtle than in newer versions & it passes + // some versions of this test; make sure not to make it pass there! + reliableTrDimensions: function() { + var table, tr, trChild, trStyle; + if ( reliableTrDimensionsVal == null ) { + table = document.createElement( "table" ); + tr = document.createElement( "tr" ); + trChild = document.createElement( "div" ); + + table.style.cssText = "position:absolute;left:-11111px"; + tr.style.height = "1px"; + trChild.style.height = "9px"; + + documentElement + .appendChild( table ) + .appendChild( tr ) + .appendChild( trChild ); + + trStyle = window.getComputedStyle( tr ); + reliableTrDimensionsVal = parseInt( trStyle.height ) > 3; + + documentElement.removeChild( table ); + } + return reliableTrDimensionsVal; + } + } ); +} )(); + + +function curCSS( elem, name, computed ) { + var width, minWidth, maxWidth, ret, + + // Support: Firefox 51+ + // Retrieving style before computed somehow + // fixes an issue with getting wrong values + // on detached elements + style = elem.style; + + computed = computed || getStyles( elem ); + + // getPropertyValue is needed for: + // .css('filter') (IE 9 only, #12537) + // .css('--customProperty) (#3144) + if ( computed ) { + ret = computed.getPropertyValue( name ) || computed[ name ]; + + if ( ret === "" && !isAttached( elem ) ) { + ret = jQuery.style( elem, name ); + } + + // A tribute to the "awesome hack by Dean Edwards" + // Android Browser returns percentage for some values, + // but width seems to be reliably pixels. + // This is against the CSSOM draft spec: + // https://drafts.csswg.org/cssom/#resolved-values + if ( !support.pixelBoxStyles() && rnumnonpx.test( ret ) && rboxStyle.test( name ) ) { + + // Remember the original values + width = style.width; + minWidth = style.minWidth; + maxWidth = style.maxWidth; + + // Put in the new values to get a computed value out + style.minWidth = style.maxWidth = style.width = ret; + ret = computed.width; + + // Revert the changed values + style.width = width; + style.minWidth = minWidth; + style.maxWidth = maxWidth; + } + } + + return ret !== undefined ? + + // Support: IE <=9 - 11 only + // IE returns zIndex value as an integer. + ret + "" : + ret; +} + + +function addGetHookIf( conditionFn, hookFn ) { + + // Define the hook, we'll check on the first run if it's really needed. + return { + get: function() { + if ( conditionFn() ) { + + // Hook not needed (or it's not possible to use it due + // to missing dependency), remove it. + delete this.get; + return; + } + + // Hook needed; redefine it so that the support test is not executed again. + return ( this.get = hookFn ).apply( this, arguments ); + } + }; +} + + +var cssPrefixes = [ "Webkit", "Moz", "ms" ], + emptyStyle = document.createElement( "div" ).style, + vendorProps = {}; + +// Return a vendor-prefixed property or undefined +function vendorPropName( name ) { + + // Check for vendor prefixed names + var capName = name[ 0 ].toUpperCase() + name.slice( 1 ), + i = cssPrefixes.length; + + while ( i-- ) { + name = cssPrefixes[ i ] + capName; + if ( name in emptyStyle ) { + return name; + } + } +} + +// Return a potentially-mapped jQuery.cssProps or vendor prefixed property +function finalPropName( name ) { + var final = jQuery.cssProps[ name ] || vendorProps[ name ]; + + if ( final ) { + return final; + } + if ( name in emptyStyle ) { + return name; + } + return vendorProps[ name ] = vendorPropName( name ) || name; +} + + +var + + // Swappable if display is none or starts with table + // except "table", "table-cell", or "table-caption" + // See here for display values: https://developer.mozilla.org/en-US/docs/CSS/display + rdisplayswap = /^(none|table(?!-c[ea]).+)/, + rcustomProp = /^--/, + cssShow = { position: "absolute", visibility: "hidden", display: "block" }, + cssNormalTransform = { + letterSpacing: "0", + fontWeight: "400" + }; + +function setPositiveNumber( _elem, value, subtract ) { + + // Any relative (+/-) values have already been + // normalized at this point + var matches = rcssNum.exec( value ); + return matches ? + + // Guard against undefined "subtract", e.g., when used as in cssHooks + Math.max( 0, matches[ 2 ] - ( subtract || 0 ) ) + ( matches[ 3 ] || "px" ) : + value; +} + +function boxModelAdjustment( elem, dimension, box, isBorderBox, styles, computedVal ) { + var i = dimension === "width" ? 1 : 0, + extra = 0, + delta = 0; + + // Adjustment may not be necessary + if ( box === ( isBorderBox ? "border" : "content" ) ) { + return 0; + } + + for ( ; i < 4; i += 2 ) { + + // Both box models exclude margin + if ( box === "margin" ) { + delta += jQuery.css( elem, box + cssExpand[ i ], true, styles ); + } + + // If we get here with a content-box, we're seeking "padding" or "border" or "margin" + if ( !isBorderBox ) { + + // Add padding + delta += jQuery.css( elem, "padding" + cssExpand[ i ], true, styles ); + + // For "border" or "margin", add border + if ( box !== "padding" ) { + delta += jQuery.css( elem, "border" + cssExpand[ i ] + "Width", true, styles ); + + // But still keep track of it otherwise + } else { + extra += jQuery.css( elem, "border" + cssExpand[ i ] + "Width", true, styles ); + } + + // If we get here with a border-box (content + padding + border), we're seeking "content" or + // "padding" or "margin" + } else { + + // For "content", subtract padding + if ( box === "content" ) { + delta -= jQuery.css( elem, "padding" + cssExpand[ i ], true, styles ); + } + + // For "content" or "padding", subtract border + if ( box !== "margin" ) { + delta -= jQuery.css( elem, "border" + cssExpand[ i ] + "Width", true, styles ); + } + } + } + + // Account for positive content-box scroll gutter when requested by providing computedVal + if ( !isBorderBox && computedVal >= 0 ) { + + // offsetWidth/offsetHeight is a rounded sum of content, padding, scroll gutter, and border + // Assuming integer scroll gutter, subtract the rest and round down + delta += Math.max( 0, Math.ceil( + elem[ "offset" + dimension[ 0 ].toUpperCase() + dimension.slice( 1 ) ] - + computedVal - + delta - + extra - + 0.5 + + // If offsetWidth/offsetHeight is unknown, then we can't determine content-box scroll gutter + // Use an explicit zero to avoid NaN (gh-3964) + ) ) || 0; + } + + return delta; +} + +function getWidthOrHeight( elem, dimension, extra ) { + + // Start with computed style + var styles = getStyles( elem ), + + // To avoid forcing a reflow, only fetch boxSizing if we need it (gh-4322). + // Fake content-box until we know it's needed to know the true value. + boxSizingNeeded = !support.boxSizingReliable() || extra, + isBorderBox = boxSizingNeeded && + jQuery.css( elem, "boxSizing", false, styles ) === "border-box", + valueIsBorderBox = isBorderBox, + + val = curCSS( elem, dimension, styles ), + offsetProp = "offset" + dimension[ 0 ].toUpperCase() + dimension.slice( 1 ); + + // Support: Firefox <=54 + // Return a confounding non-pixel value or feign ignorance, as appropriate. + if ( rnumnonpx.test( val ) ) { + if ( !extra ) { + return val; + } + val = "auto"; + } + + + // Support: IE 9 - 11 only + // Use offsetWidth/offsetHeight for when box sizing is unreliable. + // In those cases, the computed value can be trusted to be border-box. + if ( ( !support.boxSizingReliable() && isBorderBox || + + // Support: IE 10 - 11+, Edge 15 - 18+ + // IE/Edge misreport `getComputedStyle` of table rows with width/height + // set in CSS while `offset*` properties report correct values. + // Interestingly, in some cases IE 9 doesn't suffer from this issue. + !support.reliableTrDimensions() && nodeName( elem, "tr" ) || + + // Fall back to offsetWidth/offsetHeight when value is "auto" + // This happens for inline elements with no explicit setting (gh-3571) + val === "auto" || + + // Support: Android <=4.1 - 4.3 only + // Also use offsetWidth/offsetHeight for misreported inline dimensions (gh-3602) + !parseFloat( val ) && jQuery.css( elem, "display", false, styles ) === "inline" ) && + + // Make sure the element is visible & connected + elem.getClientRects().length ) { + + isBorderBox = jQuery.css( elem, "boxSizing", false, styles ) === "border-box"; + + // Where available, offsetWidth/offsetHeight approximate border box dimensions. + // Where not available (e.g., SVG), assume unreliable box-sizing and interpret the + // retrieved value as a content box dimension. + valueIsBorderBox = offsetProp in elem; + if ( valueIsBorderBox ) { + val = elem[ offsetProp ]; + } + } + + // Normalize "" and auto + val = parseFloat( val ) || 0; + + // Adjust for the element's box model + return ( val + + boxModelAdjustment( + elem, + dimension, + extra || ( isBorderBox ? "border" : "content" ), + valueIsBorderBox, + styles, + + // Provide the current computed size to request scroll gutter calculation (gh-3589) + val + ) + ) + "px"; +} + +jQuery.extend( { + + // Add in style property hooks for overriding the default + // behavior of getting and setting a style property + cssHooks: { + opacity: { + get: function( elem, computed ) { + if ( computed ) { + + // We should always get a number back from opacity + var ret = curCSS( elem, "opacity" ); + return ret === "" ? "1" : ret; + } + } + } + }, + + // Don't automatically add "px" to these possibly-unitless properties + cssNumber: { + "animationIterationCount": true, + "columnCount": true, + "fillOpacity": true, + "flexGrow": true, + "flexShrink": true, + "fontWeight": true, + "gridArea": true, + "gridColumn": true, + "gridColumnEnd": true, + "gridColumnStart": true, + "gridRow": true, + "gridRowEnd": true, + "gridRowStart": true, + "lineHeight": true, + "opacity": true, + "order": true, + "orphans": true, + "widows": true, + "zIndex": true, + "zoom": true + }, + + // Add in properties whose names you wish to fix before + // setting or getting the value + cssProps: {}, + + // Get and set the style property on a DOM Node + style: function( elem, name, value, extra ) { + + // Don't set styles on text and comment nodes + if ( !elem || elem.nodeType === 3 || elem.nodeType === 8 || !elem.style ) { + return; + } + + // Make sure that we're working with the right name + var ret, type, hooks, + origName = camelCase( name ), + isCustomProp = rcustomProp.test( name ), + style = elem.style; + + // Make sure that we're working with the right name. We don't + // want to query the value if it is a CSS custom property + // since they are user-defined. + if ( !isCustomProp ) { + name = finalPropName( origName ); + } + + // Gets hook for the prefixed version, then unprefixed version + hooks = jQuery.cssHooks[ name ] || jQuery.cssHooks[ origName ]; + + // Check if we're setting a value + if ( value !== undefined ) { + type = typeof value; + + // Convert "+=" or "-=" to relative numbers (#7345) + if ( type === "string" && ( ret = rcssNum.exec( value ) ) && ret[ 1 ] ) { + value = adjustCSS( elem, name, ret ); + + // Fixes bug #9237 + type = "number"; + } + + // Make sure that null and NaN values aren't set (#7116) + if ( value == null || value !== value ) { + return; + } + + // If a number was passed in, add the unit (except for certain CSS properties) + // The isCustomProp check can be removed in jQuery 4.0 when we only auto-append + // "px" to a few hardcoded values. + if ( type === "number" && !isCustomProp ) { + value += ret && ret[ 3 ] || ( jQuery.cssNumber[ origName ] ? "" : "px" ); + } + + // background-* props affect original clone's values + if ( !support.clearCloneStyle && value === "" && name.indexOf( "background" ) === 0 ) { + style[ name ] = "inherit"; + } + + // If a hook was provided, use that value, otherwise just set the specified value + if ( !hooks || !( "set" in hooks ) || + ( value = hooks.set( elem, value, extra ) ) !== undefined ) { + + if ( isCustomProp ) { + style.setProperty( name, value ); + } else { + style[ name ] = value; + } + } + + } else { + + // If a hook was provided get the non-computed value from there + if ( hooks && "get" in hooks && + ( ret = hooks.get( elem, false, extra ) ) !== undefined ) { + + return ret; + } + + // Otherwise just get the value from the style object + return style[ name ]; + } + }, + + css: function( elem, name, extra, styles ) { + var val, num, hooks, + origName = camelCase( name ), + isCustomProp = rcustomProp.test( name ); + + // Make sure that we're working with the right name. We don't + // want to modify the value if it is a CSS custom property + // since they are user-defined. + if ( !isCustomProp ) { + name = finalPropName( origName ); + } + + // Try prefixed name followed by the unprefixed name + hooks = jQuery.cssHooks[ name ] || jQuery.cssHooks[ origName ]; + + // If a hook was provided get the computed value from there + if ( hooks && "get" in hooks ) { + val = hooks.get( elem, true, extra ); + } + + // Otherwise, if a way to get the computed value exists, use that + if ( val === undefined ) { + val = curCSS( elem, name, styles ); + } + + // Convert "normal" to computed value + if ( val === "normal" && name in cssNormalTransform ) { + val = cssNormalTransform[ name ]; + } + + // Make numeric if forced or a qualifier was provided and val looks numeric + if ( extra === "" || extra ) { + num = parseFloat( val ); + return extra === true || isFinite( num ) ? num || 0 : val; + } + + return val; + } +} ); + +jQuery.each( [ "height", "width" ], function( _i, dimension ) { + jQuery.cssHooks[ dimension ] = { + get: function( elem, computed, extra ) { + if ( computed ) { + + // Certain elements can have dimension info if we invisibly show them + // but it must have a current display style that would benefit + return rdisplayswap.test( jQuery.css( elem, "display" ) ) && + + // Support: Safari 8+ + // Table columns in Safari have non-zero offsetWidth & zero + // getBoundingClientRect().width unless display is changed. + // Support: IE <=11 only + // Running getBoundingClientRect on a disconnected node + // in IE throws an error. + ( !elem.getClientRects().length || !elem.getBoundingClientRect().width ) ? + swap( elem, cssShow, function() { + return getWidthOrHeight( elem, dimension, extra ); + } ) : + getWidthOrHeight( elem, dimension, extra ); + } + }, + + set: function( elem, value, extra ) { + var matches, + styles = getStyles( elem ), + + // Only read styles.position if the test has a chance to fail + // to avoid forcing a reflow. + scrollboxSizeBuggy = !support.scrollboxSize() && + styles.position === "absolute", + + // To avoid forcing a reflow, only fetch boxSizing if we need it (gh-3991) + boxSizingNeeded = scrollboxSizeBuggy || extra, + isBorderBox = boxSizingNeeded && + jQuery.css( elem, "boxSizing", false, styles ) === "border-box", + subtract = extra ? + boxModelAdjustment( + elem, + dimension, + extra, + isBorderBox, + styles + ) : + 0; + + // Account for unreliable border-box dimensions by comparing offset* to computed and + // faking a content-box to get border and padding (gh-3699) + if ( isBorderBox && scrollboxSizeBuggy ) { + subtract -= Math.ceil( + elem[ "offset" + dimension[ 0 ].toUpperCase() + dimension.slice( 1 ) ] - + parseFloat( styles[ dimension ] ) - + boxModelAdjustment( elem, dimension, "border", false, styles ) - + 0.5 + ); + } + + // Convert to pixels if value adjustment is needed + if ( subtract && ( matches = rcssNum.exec( value ) ) && + ( matches[ 3 ] || "px" ) !== "px" ) { + + elem.style[ dimension ] = value; + value = jQuery.css( elem, dimension ); + } + + return setPositiveNumber( elem, value, subtract ); + } + }; +} ); + +jQuery.cssHooks.marginLeft = addGetHookIf( support.reliableMarginLeft, + function( elem, computed ) { + if ( computed ) { + return ( parseFloat( curCSS( elem, "marginLeft" ) ) || + elem.getBoundingClientRect().left - + swap( elem, { marginLeft: 0 }, function() { + return elem.getBoundingClientRect().left; + } ) + ) + "px"; + } + } +); + +// These hooks are used by animate to expand properties +jQuery.each( { + margin: "", + padding: "", + border: "Width" +}, function( prefix, suffix ) { + jQuery.cssHooks[ prefix + suffix ] = { + expand: function( value ) { + var i = 0, + expanded = {}, + + // Assumes a single number if not a string + parts = typeof value === "string" ? value.split( " " ) : [ value ]; + + for ( ; i < 4; i++ ) { + expanded[ prefix + cssExpand[ i ] + suffix ] = + parts[ i ] || parts[ i - 2 ] || parts[ 0 ]; + } + + return expanded; + } + }; + + if ( prefix !== "margin" ) { + jQuery.cssHooks[ prefix + suffix ].set = setPositiveNumber; + } +} ); + +jQuery.fn.extend( { + css: function( name, value ) { + return access( this, function( elem, name, value ) { + var styles, len, + map = {}, + i = 0; + + if ( Array.isArray( name ) ) { + styles = getStyles( elem ); + len = name.length; + + for ( ; i < len; i++ ) { + map[ name[ i ] ] = jQuery.css( elem, name[ i ], false, styles ); + } + + return map; + } + + return value !== undefined ? + jQuery.style( elem, name, value ) : + jQuery.css( elem, name ); + }, name, value, arguments.length > 1 ); + } +} ); + + +function Tween( elem, options, prop, end, easing ) { + return new Tween.prototype.init( elem, options, prop, end, easing ); +} +jQuery.Tween = Tween; + +Tween.prototype = { + constructor: Tween, + init: function( elem, options, prop, end, easing, unit ) { + this.elem = elem; + this.prop = prop; + this.easing = easing || jQuery.easing._default; + this.options = options; + this.start = this.now = this.cur(); + this.end = end; + this.unit = unit || ( jQuery.cssNumber[ prop ] ? "" : "px" ); + }, + cur: function() { + var hooks = Tween.propHooks[ this.prop ]; + + return hooks && hooks.get ? + hooks.get( this ) : + Tween.propHooks._default.get( this ); + }, + run: function( percent ) { + var eased, + hooks = Tween.propHooks[ this.prop ]; + + if ( this.options.duration ) { + this.pos = eased = jQuery.easing[ this.easing ]( + percent, this.options.duration * percent, 0, 1, this.options.duration + ); + } else { + this.pos = eased = percent; + } + this.now = ( this.end - this.start ) * eased + this.start; + + if ( this.options.step ) { + this.options.step.call( this.elem, this.now, this ); + } + + if ( hooks && hooks.set ) { + hooks.set( this ); + } else { + Tween.propHooks._default.set( this ); + } + return this; + } +}; + +Tween.prototype.init.prototype = Tween.prototype; + +Tween.propHooks = { + _default: { + get: function( tween ) { + var result; + + // Use a property on the element directly when it is not a DOM element, + // or when there is no matching style property that exists. + if ( tween.elem.nodeType !== 1 || + tween.elem[ tween.prop ] != null && tween.elem.style[ tween.prop ] == null ) { + return tween.elem[ tween.prop ]; + } + + // Passing an empty string as a 3rd parameter to .css will automatically + // attempt a parseFloat and fallback to a string if the parse fails. + // Simple values such as "10px" are parsed to Float; + // complex values such as "rotate(1rad)" are returned as-is. + result = jQuery.css( tween.elem, tween.prop, "" ); + + // Empty strings, null, undefined and "auto" are converted to 0. + return !result || result === "auto" ? 0 : result; + }, + set: function( tween ) { + + // Use step hook for back compat. + // Use cssHook if its there. + // Use .style if available and use plain properties where available. + if ( jQuery.fx.step[ tween.prop ] ) { + jQuery.fx.step[ tween.prop ]( tween ); + } else if ( tween.elem.nodeType === 1 && ( + jQuery.cssHooks[ tween.prop ] || + tween.elem.style[ finalPropName( tween.prop ) ] != null ) ) { + jQuery.style( tween.elem, tween.prop, tween.now + tween.unit ); + } else { + tween.elem[ tween.prop ] = tween.now; + } + } + } +}; + +// Support: IE <=9 only +// Panic based approach to setting things on disconnected nodes +Tween.propHooks.scrollTop = Tween.propHooks.scrollLeft = { + set: function( tween ) { + if ( tween.elem.nodeType && tween.elem.parentNode ) { + tween.elem[ tween.prop ] = tween.now; + } + } +}; + +jQuery.easing = { + linear: function( p ) { + return p; + }, + swing: function( p ) { + return 0.5 - Math.cos( p * Math.PI ) / 2; + }, + _default: "swing" +}; + +jQuery.fx = Tween.prototype.init; + +// Back compat <1.8 extension point +jQuery.fx.step = {}; + + + + +var + fxNow, inProgress, + rfxtypes = /^(?:toggle|show|hide)$/, + rrun = /queueHooks$/; + +function schedule() { + if ( inProgress ) { + if ( document.hidden === false && window.requestAnimationFrame ) { + window.requestAnimationFrame( schedule ); + } else { + window.setTimeout( schedule, jQuery.fx.interval ); + } + + jQuery.fx.tick(); + } +} + +// Animations created synchronously will run synchronously +function createFxNow() { + window.setTimeout( function() { + fxNow = undefined; + } ); + return ( fxNow = Date.now() ); +} + +// Generate parameters to create a standard animation +function genFx( type, includeWidth ) { + var which, + i = 0, + attrs = { height: type }; + + // If we include width, step value is 1 to do all cssExpand values, + // otherwise step value is 2 to skip over Left and Right + includeWidth = includeWidth ? 1 : 0; + for ( ; i < 4; i += 2 - includeWidth ) { + which = cssExpand[ i ]; + attrs[ "margin" + which ] = attrs[ "padding" + which ] = type; + } + + if ( includeWidth ) { + attrs.opacity = attrs.width = type; + } + + return attrs; +} + +function createTween( value, prop, animation ) { + var tween, + collection = ( Animation.tweeners[ prop ] || [] ).concat( Animation.tweeners[ "*" ] ), + index = 0, + length = collection.length; + for ( ; index < length; index++ ) { + if ( ( tween = collection[ index ].call( animation, prop, value ) ) ) { + + // We're done with this property + return tween; + } + } +} + +function defaultPrefilter( elem, props, opts ) { + var prop, value, toggle, hooks, oldfire, propTween, restoreDisplay, display, + isBox = "width" in props || "height" in props, + anim = this, + orig = {}, + style = elem.style, + hidden = elem.nodeType && isHiddenWithinTree( elem ), + dataShow = dataPriv.get( elem, "fxshow" ); + + // Queue-skipping animations hijack the fx hooks + if ( !opts.queue ) { + hooks = jQuery._queueHooks( elem, "fx" ); + if ( hooks.unqueued == null ) { + hooks.unqueued = 0; + oldfire = hooks.empty.fire; + hooks.empty.fire = function() { + if ( !hooks.unqueued ) { + oldfire(); + } + }; + } + hooks.unqueued++; + + anim.always( function() { + + // Ensure the complete handler is called before this completes + anim.always( function() { + hooks.unqueued--; + if ( !jQuery.queue( elem, "fx" ).length ) { + hooks.empty.fire(); + } + } ); + } ); + } + + // Detect show/hide animations + for ( prop in props ) { + value = props[ prop ]; + if ( rfxtypes.test( value ) ) { + delete props[ prop ]; + toggle = toggle || value === "toggle"; + if ( value === ( hidden ? "hide" : "show" ) ) { + + // Pretend to be hidden if this is a "show" and + // there is still data from a stopped show/hide + if ( value === "show" && dataShow && dataShow[ prop ] !== undefined ) { + hidden = true; + + // Ignore all other no-op show/hide data + } else { + continue; + } + } + orig[ prop ] = dataShow && dataShow[ prop ] || jQuery.style( elem, prop ); + } + } + + // Bail out if this is a no-op like .hide().hide() + propTween = !jQuery.isEmptyObject( props ); + if ( !propTween && jQuery.isEmptyObject( orig ) ) { + return; + } + + // Restrict "overflow" and "display" styles during box animations + if ( isBox && elem.nodeType === 1 ) { + + // Support: IE <=9 - 11, Edge 12 - 15 + // Record all 3 overflow attributes because IE does not infer the shorthand + // from identically-valued overflowX and overflowY and Edge just mirrors + // the overflowX value there. + opts.overflow = [ style.overflow, style.overflowX, style.overflowY ]; + + // Identify a display type, preferring old show/hide data over the CSS cascade + restoreDisplay = dataShow && dataShow.display; + if ( restoreDisplay == null ) { + restoreDisplay = dataPriv.get( elem, "display" ); + } + display = jQuery.css( elem, "display" ); + if ( display === "none" ) { + if ( restoreDisplay ) { + display = restoreDisplay; + } else { + + // Get nonempty value(s) by temporarily forcing visibility + showHide( [ elem ], true ); + restoreDisplay = elem.style.display || restoreDisplay; + display = jQuery.css( elem, "display" ); + showHide( [ elem ] ); + } + } + + // Animate inline elements as inline-block + if ( display === "inline" || display === "inline-block" && restoreDisplay != null ) { + if ( jQuery.css( elem, "float" ) === "none" ) { + + // Restore the original display value at the end of pure show/hide animations + if ( !propTween ) { + anim.done( function() { + style.display = restoreDisplay; + } ); + if ( restoreDisplay == null ) { + display = style.display; + restoreDisplay = display === "none" ? "" : display; + } + } + style.display = "inline-block"; + } + } + } + + if ( opts.overflow ) { + style.overflow = "hidden"; + anim.always( function() { + style.overflow = opts.overflow[ 0 ]; + style.overflowX = opts.overflow[ 1 ]; + style.overflowY = opts.overflow[ 2 ]; + } ); + } + + // Implement show/hide animations + propTween = false; + for ( prop in orig ) { + + // General show/hide setup for this element animation + if ( !propTween ) { + if ( dataShow ) { + if ( "hidden" in dataShow ) { + hidden = dataShow.hidden; + } + } else { + dataShow = dataPriv.access( elem, "fxshow", { display: restoreDisplay } ); + } + + // Store hidden/visible for toggle so `.stop().toggle()` "reverses" + if ( toggle ) { + dataShow.hidden = !hidden; + } + + // Show elements before animating them + if ( hidden ) { + showHide( [ elem ], true ); + } + + /* eslint-disable no-loop-func */ + + anim.done( function() { + + /* eslint-enable no-loop-func */ + + // The final step of a "hide" animation is actually hiding the element + if ( !hidden ) { + showHide( [ elem ] ); + } + dataPriv.remove( elem, "fxshow" ); + for ( prop in orig ) { + jQuery.style( elem, prop, orig[ prop ] ); + } + } ); + } + + // Per-property setup + propTween = createTween( hidden ? dataShow[ prop ] : 0, prop, anim ); + if ( !( prop in dataShow ) ) { + dataShow[ prop ] = propTween.start; + if ( hidden ) { + propTween.end = propTween.start; + propTween.start = 0; + } + } + } +} + +function propFilter( props, specialEasing ) { + var index, name, easing, value, hooks; + + // camelCase, specialEasing and expand cssHook pass + for ( index in props ) { + name = camelCase( index ); + easing = specialEasing[ name ]; + value = props[ index ]; + if ( Array.isArray( value ) ) { + easing = value[ 1 ]; + value = props[ index ] = value[ 0 ]; + } + + if ( index !== name ) { + props[ name ] = value; + delete props[ index ]; + } + + hooks = jQuery.cssHooks[ name ]; + if ( hooks && "expand" in hooks ) { + value = hooks.expand( value ); + delete props[ name ]; + + // Not quite $.extend, this won't overwrite existing keys. + // Reusing 'index' because we have the correct "name" + for ( index in value ) { + if ( !( index in props ) ) { + props[ index ] = value[ index ]; + specialEasing[ index ] = easing; + } + } + } else { + specialEasing[ name ] = easing; + } + } +} + +function Animation( elem, properties, options ) { + var result, + stopped, + index = 0, + length = Animation.prefilters.length, + deferred = jQuery.Deferred().always( function() { + + // Don't match elem in the :animated selector + delete tick.elem; + } ), + tick = function() { + if ( stopped ) { + return false; + } + var currentTime = fxNow || createFxNow(), + remaining = Math.max( 0, animation.startTime + animation.duration - currentTime ), + + // Support: Android 2.3 only + // Archaic crash bug won't allow us to use `1 - ( 0.5 || 0 )` (#12497) + temp = remaining / animation.duration || 0, + percent = 1 - temp, + index = 0, + length = animation.tweens.length; + + for ( ; index < length; index++ ) { + animation.tweens[ index ].run( percent ); + } + + deferred.notifyWith( elem, [ animation, percent, remaining ] ); + + // If there's more to do, yield + if ( percent < 1 && length ) { + return remaining; + } + + // If this was an empty animation, synthesize a final progress notification + if ( !length ) { + deferred.notifyWith( elem, [ animation, 1, 0 ] ); + } + + // Resolve the animation and report its conclusion + deferred.resolveWith( elem, [ animation ] ); + return false; + }, + animation = deferred.promise( { + elem: elem, + props: jQuery.extend( {}, properties ), + opts: jQuery.extend( true, { + specialEasing: {}, + easing: jQuery.easing._default + }, options ), + originalProperties: properties, + originalOptions: options, + startTime: fxNow || createFxNow(), + duration: options.duration, + tweens: [], + createTween: function( prop, end ) { + var tween = jQuery.Tween( elem, animation.opts, prop, end, + animation.opts.specialEasing[ prop ] || animation.opts.easing ); + animation.tweens.push( tween ); + return tween; + }, + stop: function( gotoEnd ) { + var index = 0, + + // If we are going to the end, we want to run all the tweens + // otherwise we skip this part + length = gotoEnd ? animation.tweens.length : 0; + if ( stopped ) { + return this; + } + stopped = true; + for ( ; index < length; index++ ) { + animation.tweens[ index ].run( 1 ); + } + + // Resolve when we played the last frame; otherwise, reject + if ( gotoEnd ) { + deferred.notifyWith( elem, [ animation, 1, 0 ] ); + deferred.resolveWith( elem, [ animation, gotoEnd ] ); + } else { + deferred.rejectWith( elem, [ animation, gotoEnd ] ); + } + return this; + } + } ), + props = animation.props; + + propFilter( props, animation.opts.specialEasing ); + + for ( ; index < length; index++ ) { + result = Animation.prefilters[ index ].call( animation, elem, props, animation.opts ); + if ( result ) { + if ( isFunction( result.stop ) ) { + jQuery._queueHooks( animation.elem, animation.opts.queue ).stop = + result.stop.bind( result ); + } + return result; + } + } + + jQuery.map( props, createTween, animation ); + + if ( isFunction( animation.opts.start ) ) { + animation.opts.start.call( elem, animation ); + } + + // Attach callbacks from options + animation + .progress( animation.opts.progress ) + .done( animation.opts.done, animation.opts.complete ) + .fail( animation.opts.fail ) + .always( animation.opts.always ); + + jQuery.fx.timer( + jQuery.extend( tick, { + elem: elem, + anim: animation, + queue: animation.opts.queue + } ) + ); + + return animation; +} + +jQuery.Animation = jQuery.extend( Animation, { + + tweeners: { + "*": [ function( prop, value ) { + var tween = this.createTween( prop, value ); + adjustCSS( tween.elem, prop, rcssNum.exec( value ), tween ); + return tween; + } ] + }, + + tweener: function( props, callback ) { + if ( isFunction( props ) ) { + callback = props; + props = [ "*" ]; + } else { + props = props.match( rnothtmlwhite ); + } + + var prop, + index = 0, + length = props.length; + + for ( ; index < length; index++ ) { + prop = props[ index ]; + Animation.tweeners[ prop ] = Animation.tweeners[ prop ] || []; + Animation.tweeners[ prop ].unshift( callback ); + } + }, + + prefilters: [ defaultPrefilter ], + + prefilter: function( callback, prepend ) { + if ( prepend ) { + Animation.prefilters.unshift( callback ); + } else { + Animation.prefilters.push( callback ); + } + } +} ); + +jQuery.speed = function( speed, easing, fn ) { + var opt = speed && typeof speed === "object" ? jQuery.extend( {}, speed ) : { + complete: fn || !fn && easing || + isFunction( speed ) && speed, + duration: speed, + easing: fn && easing || easing && !isFunction( easing ) && easing + }; + + // Go to the end state if fx are off + if ( jQuery.fx.off ) { + opt.duration = 0; + + } else { + if ( typeof opt.duration !== "number" ) { + if ( opt.duration in jQuery.fx.speeds ) { + opt.duration = jQuery.fx.speeds[ opt.duration ]; + + } else { + opt.duration = jQuery.fx.speeds._default; + } + } + } + + // Normalize opt.queue - true/undefined/null -> "fx" + if ( opt.queue == null || opt.queue === true ) { + opt.queue = "fx"; + } + + // Queueing + opt.old = opt.complete; + + opt.complete = function() { + if ( isFunction( opt.old ) ) { + opt.old.call( this ); + } + + if ( opt.queue ) { + jQuery.dequeue( this, opt.queue ); + } + }; + + return opt; +}; + +jQuery.fn.extend( { + fadeTo: function( speed, to, easing, callback ) { + + // Show any hidden elements after setting opacity to 0 + return this.filter( isHiddenWithinTree ).css( "opacity", 0 ).show() + + // Animate to the value specified + .end().animate( { opacity: to }, speed, easing, callback ); + }, + animate: function( prop, speed, easing, callback ) { + var empty = jQuery.isEmptyObject( prop ), + optall = jQuery.speed( speed, easing, callback ), + doAnimation = function() { + + // Operate on a copy of prop so per-property easing won't be lost + var anim = Animation( this, jQuery.extend( {}, prop ), optall ); + + // Empty animations, or finishing resolves immediately + if ( empty || dataPriv.get( this, "finish" ) ) { + anim.stop( true ); + } + }; + doAnimation.finish = doAnimation; + + return empty || optall.queue === false ? + this.each( doAnimation ) : + this.queue( optall.queue, doAnimation ); + }, + stop: function( type, clearQueue, gotoEnd ) { + var stopQueue = function( hooks ) { + var stop = hooks.stop; + delete hooks.stop; + stop( gotoEnd ); + }; + + if ( typeof type !== "string" ) { + gotoEnd = clearQueue; + clearQueue = type; + type = undefined; + } + if ( clearQueue ) { + this.queue( type || "fx", [] ); + } + + return this.each( function() { + var dequeue = true, + index = type != null && type + "queueHooks", + timers = jQuery.timers, + data = dataPriv.get( this ); + + if ( index ) { + if ( data[ index ] && data[ index ].stop ) { + stopQueue( data[ index ] ); + } + } else { + for ( index in data ) { + if ( data[ index ] && data[ index ].stop && rrun.test( index ) ) { + stopQueue( data[ index ] ); + } + } + } + + for ( index = timers.length; index--; ) { + if ( timers[ index ].elem === this && + ( type == null || timers[ index ].queue === type ) ) { + + timers[ index ].anim.stop( gotoEnd ); + dequeue = false; + timers.splice( index, 1 ); + } + } + + // Start the next in the queue if the last step wasn't forced. + // Timers currently will call their complete callbacks, which + // will dequeue but only if they were gotoEnd. + if ( dequeue || !gotoEnd ) { + jQuery.dequeue( this, type ); + } + } ); + }, + finish: function( type ) { + if ( type !== false ) { + type = type || "fx"; + } + return this.each( function() { + var index, + data = dataPriv.get( this ), + queue = data[ type + "queue" ], + hooks = data[ type + "queueHooks" ], + timers = jQuery.timers, + length = queue ? queue.length : 0; + + // Enable finishing flag on private data + data.finish = true; + + // Empty the queue first + jQuery.queue( this, type, [] ); + + if ( hooks && hooks.stop ) { + hooks.stop.call( this, true ); + } + + // Look for any active animations, and finish them + for ( index = timers.length; index--; ) { + if ( timers[ index ].elem === this && timers[ index ].queue === type ) { + timers[ index ].anim.stop( true ); + timers.splice( index, 1 ); + } + } + + // Look for any animations in the old queue and finish them + for ( index = 0; index < length; index++ ) { + if ( queue[ index ] && queue[ index ].finish ) { + queue[ index ].finish.call( this ); + } + } + + // Turn off finishing flag + delete data.finish; + } ); + } +} ); + +jQuery.each( [ "toggle", "show", "hide" ], function( _i, name ) { + var cssFn = jQuery.fn[ name ]; + jQuery.fn[ name ] = function( speed, easing, callback ) { + return speed == null || typeof speed === "boolean" ? + cssFn.apply( this, arguments ) : + this.animate( genFx( name, true ), speed, easing, callback ); + }; +} ); + +// Generate shortcuts for custom animations +jQuery.each( { + slideDown: genFx( "show" ), + slideUp: genFx( "hide" ), + slideToggle: genFx( "toggle" ), + fadeIn: { opacity: "show" }, + fadeOut: { opacity: "hide" }, + fadeToggle: { opacity: "toggle" } +}, function( name, props ) { + jQuery.fn[ name ] = function( speed, easing, callback ) { + return this.animate( props, speed, easing, callback ); + }; +} ); + +jQuery.timers = []; +jQuery.fx.tick = function() { + var timer, + i = 0, + timers = jQuery.timers; + + fxNow = Date.now(); + + for ( ; i < timers.length; i++ ) { + timer = timers[ i ]; + + // Run the timer and safely remove it when done (allowing for external removal) + if ( !timer() && timers[ i ] === timer ) { + timers.splice( i--, 1 ); + } + } + + if ( !timers.length ) { + jQuery.fx.stop(); + } + fxNow = undefined; +}; + +jQuery.fx.timer = function( timer ) { + jQuery.timers.push( timer ); + jQuery.fx.start(); +}; + +jQuery.fx.interval = 13; +jQuery.fx.start = function() { + if ( inProgress ) { + return; + } + + inProgress = true; + schedule(); +}; + +jQuery.fx.stop = function() { + inProgress = null; +}; + +jQuery.fx.speeds = { + slow: 600, + fast: 200, + + // Default speed + _default: 400 +}; + + +// Based off of the plugin by Clint Helfers, with permission. +// https://web.archive.org/web/20100324014747/http://blindsignals.com/index.php/2009/07/jquery-delay/ +jQuery.fn.delay = function( time, type ) { + time = jQuery.fx ? jQuery.fx.speeds[ time ] || time : time; + type = type || "fx"; + + return this.queue( type, function( next, hooks ) { + var timeout = window.setTimeout( next, time ); + hooks.stop = function() { + window.clearTimeout( timeout ); + }; + } ); +}; + + +( function() { + var input = document.createElement( "input" ), + select = document.createElement( "select" ), + opt = select.appendChild( document.createElement( "option" ) ); + + input.type = "checkbox"; + + // Support: Android <=4.3 only + // Default value for a checkbox should be "on" + support.checkOn = input.value !== ""; + + // Support: IE <=11 only + // Must access selectedIndex to make default options select + support.optSelected = opt.selected; + + // Support: IE <=11 only + // An input loses its value after becoming a radio + input = document.createElement( "input" ); + input.value = "t"; + input.type = "radio"; + support.radioValue = input.value === "t"; +} )(); + + +var boolHook, + attrHandle = jQuery.expr.attrHandle; + +jQuery.fn.extend( { + attr: function( name, value ) { + return access( this, jQuery.attr, name, value, arguments.length > 1 ); + }, + + removeAttr: function( name ) { + return this.each( function() { + jQuery.removeAttr( this, name ); + } ); + } +} ); + +jQuery.extend( { + attr: function( elem, name, value ) { + var ret, hooks, + nType = elem.nodeType; + + // Don't get/set attributes on text, comment and attribute nodes + if ( nType === 3 || nType === 8 || nType === 2 ) { + return; + } + + // Fallback to prop when attributes are not supported + if ( typeof elem.getAttribute === "undefined" ) { + return jQuery.prop( elem, name, value ); + } + + // Attribute hooks are determined by the lowercase version + // Grab necessary hook if one is defined + if ( nType !== 1 || !jQuery.isXMLDoc( elem ) ) { + hooks = jQuery.attrHooks[ name.toLowerCase() ] || + ( jQuery.expr.match.bool.test( name ) ? boolHook : undefined ); + } + + if ( value !== undefined ) { + if ( value === null ) { + jQuery.removeAttr( elem, name ); + return; + } + + if ( hooks && "set" in hooks && + ( ret = hooks.set( elem, value, name ) ) !== undefined ) { + return ret; + } + + elem.setAttribute( name, value + "" ); + return value; + } + + if ( hooks && "get" in hooks && ( ret = hooks.get( elem, name ) ) !== null ) { + return ret; + } + + ret = jQuery.find.attr( elem, name ); + + // Non-existent attributes return null, we normalize to undefined + return ret == null ? undefined : ret; + }, + + attrHooks: { + type: { + set: function( elem, value ) { + if ( !support.radioValue && value === "radio" && + nodeName( elem, "input" ) ) { + var val = elem.value; + elem.setAttribute( "type", value ); + if ( val ) { + elem.value = val; + } + return value; + } + } + } + }, + + removeAttr: function( elem, value ) { + var name, + i = 0, + + // Attribute names can contain non-HTML whitespace characters + // https://html.spec.whatwg.org/multipage/syntax.html#attributes-2 + attrNames = value && value.match( rnothtmlwhite ); + + if ( attrNames && elem.nodeType === 1 ) { + while ( ( name = attrNames[ i++ ] ) ) { + elem.removeAttribute( name ); + } + } + } +} ); + +// Hooks for boolean attributes +boolHook = { + set: function( elem, value, name ) { + if ( value === false ) { + + // Remove boolean attributes when set to false + jQuery.removeAttr( elem, name ); + } else { + elem.setAttribute( name, name ); + } + return name; + } +}; + +jQuery.each( jQuery.expr.match.bool.source.match( /\w+/g ), function( _i, name ) { + var getter = attrHandle[ name ] || jQuery.find.attr; + + attrHandle[ name ] = function( elem, name, isXML ) { + var ret, handle, + lowercaseName = name.toLowerCase(); + + if ( !isXML ) { + + // Avoid an infinite loop by temporarily removing this function from the getter + handle = attrHandle[ lowercaseName ]; + attrHandle[ lowercaseName ] = ret; + ret = getter( elem, name, isXML ) != null ? + lowercaseName : + null; + attrHandle[ lowercaseName ] = handle; + } + return ret; + }; +} ); + + + + +var rfocusable = /^(?:input|select|textarea|button)$/i, + rclickable = /^(?:a|area)$/i; + +jQuery.fn.extend( { + prop: function( name, value ) { + return access( this, jQuery.prop, name, value, arguments.length > 1 ); + }, + + removeProp: function( name ) { + return this.each( function() { + delete this[ jQuery.propFix[ name ] || name ]; + } ); + } +} ); + +jQuery.extend( { + prop: function( elem, name, value ) { + var ret, hooks, + nType = elem.nodeType; + + // Don't get/set properties on text, comment and attribute nodes + if ( nType === 3 || nType === 8 || nType === 2 ) { + return; + } + + if ( nType !== 1 || !jQuery.isXMLDoc( elem ) ) { + + // Fix name and attach hooks + name = jQuery.propFix[ name ] || name; + hooks = jQuery.propHooks[ name ]; + } + + if ( value !== undefined ) { + if ( hooks && "set" in hooks && + ( ret = hooks.set( elem, value, name ) ) !== undefined ) { + return ret; + } + + return ( elem[ name ] = value ); + } + + if ( hooks && "get" in hooks && ( ret = hooks.get( elem, name ) ) !== null ) { + return ret; + } + + return elem[ name ]; + }, + + propHooks: { + tabIndex: { + get: function( elem ) { + + // Support: IE <=9 - 11 only + // elem.tabIndex doesn't always return the + // correct value when it hasn't been explicitly set + // https://web.archive.org/web/20141116233347/http://fluidproject.org/blog/2008/01/09/getting-setting-and-removing-tabindex-values-with-javascript/ + // Use proper attribute retrieval(#12072) + var tabindex = jQuery.find.attr( elem, "tabindex" ); + + if ( tabindex ) { + return parseInt( tabindex, 10 ); + } + + if ( + rfocusable.test( elem.nodeName ) || + rclickable.test( elem.nodeName ) && + elem.href + ) { + return 0; + } + + return -1; + } + } + }, + + propFix: { + "for": "htmlFor", + "class": "className" + } +} ); + +// Support: IE <=11 only +// Accessing the selectedIndex property +// forces the browser to respect setting selected +// on the option +// The getter ensures a default option is selected +// when in an optgroup +// eslint rule "no-unused-expressions" is disabled for this code +// since it considers such accessions noop +if ( !support.optSelected ) { + jQuery.propHooks.selected = { + get: function( elem ) { + + /* eslint no-unused-expressions: "off" */ + + var parent = elem.parentNode; + if ( parent && parent.parentNode ) { + parent.parentNode.selectedIndex; + } + return null; + }, + set: function( elem ) { + + /* eslint no-unused-expressions: "off" */ + + var parent = elem.parentNode; + if ( parent ) { + parent.selectedIndex; + + if ( parent.parentNode ) { + parent.parentNode.selectedIndex; + } + } + } + }; +} + +jQuery.each( [ + "tabIndex", + "readOnly", + "maxLength", + "cellSpacing", + "cellPadding", + "rowSpan", + "colSpan", + "useMap", + "frameBorder", + "contentEditable" +], function() { + jQuery.propFix[ this.toLowerCase() ] = this; +} ); + + + + + // Strip and collapse whitespace according to HTML spec + // https://infra.spec.whatwg.org/#strip-and-collapse-ascii-whitespace + function stripAndCollapse( value ) { + var tokens = value.match( rnothtmlwhite ) || []; + return tokens.join( " " ); + } + + +function getClass( elem ) { + return elem.getAttribute && elem.getAttribute( "class" ) || ""; +} + +function classesToArray( value ) { + if ( Array.isArray( value ) ) { + return value; + } + if ( typeof value === "string" ) { + return value.match( rnothtmlwhite ) || []; + } + return []; +} + +jQuery.fn.extend( { + addClass: function( value ) { + var classes, elem, cur, curValue, clazz, j, finalValue, + i = 0; + + if ( isFunction( value ) ) { + return this.each( function( j ) { + jQuery( this ).addClass( value.call( this, j, getClass( this ) ) ); + } ); + } + + classes = classesToArray( value ); + + if ( classes.length ) { + while ( ( elem = this[ i++ ] ) ) { + curValue = getClass( elem ); + cur = elem.nodeType === 1 && ( " " + stripAndCollapse( curValue ) + " " ); + + if ( cur ) { + j = 0; + while ( ( clazz = classes[ j++ ] ) ) { + if ( cur.indexOf( " " + clazz + " " ) < 0 ) { + cur += clazz + " "; + } + } + + // Only assign if different to avoid unneeded rendering. + finalValue = stripAndCollapse( cur ); + if ( curValue !== finalValue ) { + elem.setAttribute( "class", finalValue ); + } + } + } + } + + return this; + }, + + removeClass: function( value ) { + var classes, elem, cur, curValue, clazz, j, finalValue, + i = 0; + + if ( isFunction( value ) ) { + return this.each( function( j ) { + jQuery( this ).removeClass( value.call( this, j, getClass( this ) ) ); + } ); + } + + if ( !arguments.length ) { + return this.attr( "class", "" ); + } + + classes = classesToArray( value ); + + if ( classes.length ) { + while ( ( elem = this[ i++ ] ) ) { + curValue = getClass( elem ); + + // This expression is here for better compressibility (see addClass) + cur = elem.nodeType === 1 && ( " " + stripAndCollapse( curValue ) + " " ); + + if ( cur ) { + j = 0; + while ( ( clazz = classes[ j++ ] ) ) { + + // Remove *all* instances + while ( cur.indexOf( " " + clazz + " " ) > -1 ) { + cur = cur.replace( " " + clazz + " ", " " ); + } + } + + // Only assign if different to avoid unneeded rendering. + finalValue = stripAndCollapse( cur ); + if ( curValue !== finalValue ) { + elem.setAttribute( "class", finalValue ); + } + } + } + } + + return this; + }, + + toggleClass: function( value, stateVal ) { + var type = typeof value, + isValidValue = type === "string" || Array.isArray( value ); + + if ( typeof stateVal === "boolean" && isValidValue ) { + return stateVal ? this.addClass( value ) : this.removeClass( value ); + } + + if ( isFunction( value ) ) { + return this.each( function( i ) { + jQuery( this ).toggleClass( + value.call( this, i, getClass( this ), stateVal ), + stateVal + ); + } ); + } + + return this.each( function() { + var className, i, self, classNames; + + if ( isValidValue ) { + + // Toggle individual class names + i = 0; + self = jQuery( this ); + classNames = classesToArray( value ); + + while ( ( className = classNames[ i++ ] ) ) { + + // Check each className given, space separated list + if ( self.hasClass( className ) ) { + self.removeClass( className ); + } else { + self.addClass( className ); + } + } + + // Toggle whole class name + } else if ( value === undefined || type === "boolean" ) { + className = getClass( this ); + if ( className ) { + + // Store className if set + dataPriv.set( this, "__className__", className ); + } + + // If the element has a class name or if we're passed `false`, + // then remove the whole classname (if there was one, the above saved it). + // Otherwise bring back whatever was previously saved (if anything), + // falling back to the empty string if nothing was stored. + if ( this.setAttribute ) { + this.setAttribute( "class", + className || value === false ? + "" : + dataPriv.get( this, "__className__" ) || "" + ); + } + } + } ); + }, + + hasClass: function( selector ) { + var className, elem, + i = 0; + + className = " " + selector + " "; + while ( ( elem = this[ i++ ] ) ) { + if ( elem.nodeType === 1 && + ( " " + stripAndCollapse( getClass( elem ) ) + " " ).indexOf( className ) > -1 ) { + return true; + } + } + + return false; + } +} ); + + + + +var rreturn = /\r/g; + +jQuery.fn.extend( { + val: function( value ) { + var hooks, ret, valueIsFunction, + elem = this[ 0 ]; + + if ( !arguments.length ) { + if ( elem ) { + hooks = jQuery.valHooks[ elem.type ] || + jQuery.valHooks[ elem.nodeName.toLowerCase() ]; + + if ( hooks && + "get" in hooks && + ( ret = hooks.get( elem, "value" ) ) !== undefined + ) { + return ret; + } + + ret = elem.value; + + // Handle most common string cases + if ( typeof ret === "string" ) { + return ret.replace( rreturn, "" ); + } + + // Handle cases where value is null/undef or number + return ret == null ? "" : ret; + } + + return; + } + + valueIsFunction = isFunction( value ); + + return this.each( function( i ) { + var val; + + if ( this.nodeType !== 1 ) { + return; + } + + if ( valueIsFunction ) { + val = value.call( this, i, jQuery( this ).val() ); + } else { + val = value; + } + + // Treat null/undefined as ""; convert numbers to string + if ( val == null ) { + val = ""; + + } else if ( typeof val === "number" ) { + val += ""; + + } else if ( Array.isArray( val ) ) { + val = jQuery.map( val, function( value ) { + return value == null ? "" : value + ""; + } ); + } + + hooks = jQuery.valHooks[ this.type ] || jQuery.valHooks[ this.nodeName.toLowerCase() ]; + + // If set returns undefined, fall back to normal setting + if ( !hooks || !( "set" in hooks ) || hooks.set( this, val, "value" ) === undefined ) { + this.value = val; + } + } ); + } +} ); + +jQuery.extend( { + valHooks: { + option: { + get: function( elem ) { + + var val = jQuery.find.attr( elem, "value" ); + return val != null ? + val : + + // Support: IE <=10 - 11 only + // option.text throws exceptions (#14686, #14858) + // Strip and collapse whitespace + // https://html.spec.whatwg.org/#strip-and-collapse-whitespace + stripAndCollapse( jQuery.text( elem ) ); + } + }, + select: { + get: function( elem ) { + var value, option, i, + options = elem.options, + index = elem.selectedIndex, + one = elem.type === "select-one", + values = one ? null : [], + max = one ? index + 1 : options.length; + + if ( index < 0 ) { + i = max; + + } else { + i = one ? index : 0; + } + + // Loop through all the selected options + for ( ; i < max; i++ ) { + option = options[ i ]; + + // Support: IE <=9 only + // IE8-9 doesn't update selected after form reset (#2551) + if ( ( option.selected || i === index ) && + + // Don't return options that are disabled or in a disabled optgroup + !option.disabled && + ( !option.parentNode.disabled || + !nodeName( option.parentNode, "optgroup" ) ) ) { + + // Get the specific value for the option + value = jQuery( option ).val(); + + // We don't need an array for one selects + if ( one ) { + return value; + } + + // Multi-Selects return an array + values.push( value ); + } + } + + return values; + }, + + set: function( elem, value ) { + var optionSet, option, + options = elem.options, + values = jQuery.makeArray( value ), + i = options.length; + + while ( i-- ) { + option = options[ i ]; + + /* eslint-disable no-cond-assign */ + + if ( option.selected = + jQuery.inArray( jQuery.valHooks.option.get( option ), values ) > -1 + ) { + optionSet = true; + } + + /* eslint-enable no-cond-assign */ + } + + // Force browsers to behave consistently when non-matching value is set + if ( !optionSet ) { + elem.selectedIndex = -1; + } + return values; + } + } + } +} ); + +// Radios and checkboxes getter/setter +jQuery.each( [ "radio", "checkbox" ], function() { + jQuery.valHooks[ this ] = { + set: function( elem, value ) { + if ( Array.isArray( value ) ) { + return ( elem.checked = jQuery.inArray( jQuery( elem ).val(), value ) > -1 ); + } + } + }; + if ( !support.checkOn ) { + jQuery.valHooks[ this ].get = function( elem ) { + return elem.getAttribute( "value" ) === null ? "on" : elem.value; + }; + } +} ); + + + + +// Return jQuery for attributes-only inclusion + + +support.focusin = "onfocusin" in window; + + +var rfocusMorph = /^(?:focusinfocus|focusoutblur)$/, + stopPropagationCallback = function( e ) { + e.stopPropagation(); + }; + +jQuery.extend( jQuery.event, { + + trigger: function( event, data, elem, onlyHandlers ) { + + var i, cur, tmp, bubbleType, ontype, handle, special, lastElement, + eventPath = [ elem || document ], + type = hasOwn.call( event, "type" ) ? event.type : event, + namespaces = hasOwn.call( event, "namespace" ) ? event.namespace.split( "." ) : []; + + cur = lastElement = tmp = elem = elem || document; + + // Don't do events on text and comment nodes + if ( elem.nodeType === 3 || elem.nodeType === 8 ) { + return; + } + + // focus/blur morphs to focusin/out; ensure we're not firing them right now + if ( rfocusMorph.test( type + jQuery.event.triggered ) ) { + return; + } + + if ( type.indexOf( "." ) > -1 ) { + + // Namespaced trigger; create a regexp to match event type in handle() + namespaces = type.split( "." ); + type = namespaces.shift(); + namespaces.sort(); + } + ontype = type.indexOf( ":" ) < 0 && "on" + type; + + // Caller can pass in a jQuery.Event object, Object, or just an event type string + event = event[ jQuery.expando ] ? + event : + new jQuery.Event( type, typeof event === "object" && event ); + + // Trigger bitmask: & 1 for native handlers; & 2 for jQuery (always true) + event.isTrigger = onlyHandlers ? 2 : 3; + event.namespace = namespaces.join( "." ); + event.rnamespace = event.namespace ? + new RegExp( "(^|\\.)" + namespaces.join( "\\.(?:.*\\.|)" ) + "(\\.|$)" ) : + null; + + // Clean up the event in case it is being reused + event.result = undefined; + if ( !event.target ) { + event.target = elem; + } + + // Clone any incoming data and prepend the event, creating the handler arg list + data = data == null ? + [ event ] : + jQuery.makeArray( data, [ event ] ); + + // Allow special events to draw outside the lines + special = jQuery.event.special[ type ] || {}; + if ( !onlyHandlers && special.trigger && special.trigger.apply( elem, data ) === false ) { + return; + } + + // Determine event propagation path in advance, per W3C events spec (#9951) + // Bubble up to document, then to window; watch for a global ownerDocument var (#9724) + if ( !onlyHandlers && !special.noBubble && !isWindow( elem ) ) { + + bubbleType = special.delegateType || type; + if ( !rfocusMorph.test( bubbleType + type ) ) { + cur = cur.parentNode; + } + for ( ; cur; cur = cur.parentNode ) { + eventPath.push( cur ); + tmp = cur; + } + + // Only add window if we got to document (e.g., not plain obj or detached DOM) + if ( tmp === ( elem.ownerDocument || document ) ) { + eventPath.push( tmp.defaultView || tmp.parentWindow || window ); + } + } + + // Fire handlers on the event path + i = 0; + while ( ( cur = eventPath[ i++ ] ) && !event.isPropagationStopped() ) { + lastElement = cur; + event.type = i > 1 ? + bubbleType : + special.bindType || type; + + // jQuery handler + handle = ( + dataPriv.get( cur, "events" ) || Object.create( null ) + )[ event.type ] && + dataPriv.get( cur, "handle" ); + if ( handle ) { + handle.apply( cur, data ); + } + + // Native handler + handle = ontype && cur[ ontype ]; + if ( handle && handle.apply && acceptData( cur ) ) { + event.result = handle.apply( cur, data ); + if ( event.result === false ) { + event.preventDefault(); + } + } + } + event.type = type; + + // If nobody prevented the default action, do it now + if ( !onlyHandlers && !event.isDefaultPrevented() ) { + + if ( ( !special._default || + special._default.apply( eventPath.pop(), data ) === false ) && + acceptData( elem ) ) { + + // Call a native DOM method on the target with the same name as the event. + // Don't do default actions on window, that's where global variables be (#6170) + if ( ontype && isFunction( elem[ type ] ) && !isWindow( elem ) ) { + + // Don't re-trigger an onFOO event when we call its FOO() method + tmp = elem[ ontype ]; + + if ( tmp ) { + elem[ ontype ] = null; + } + + // Prevent re-triggering of the same event, since we already bubbled it above + jQuery.event.triggered = type; + + if ( event.isPropagationStopped() ) { + lastElement.addEventListener( type, stopPropagationCallback ); + } + + elem[ type ](); + + if ( event.isPropagationStopped() ) { + lastElement.removeEventListener( type, stopPropagationCallback ); + } + + jQuery.event.triggered = undefined; + + if ( tmp ) { + elem[ ontype ] = tmp; + } + } + } + } + + return event.result; + }, + + // Piggyback on a donor event to simulate a different one + // Used only for `focus(in | out)` events + simulate: function( type, elem, event ) { + var e = jQuery.extend( + new jQuery.Event(), + event, + { + type: type, + isSimulated: true + } + ); + + jQuery.event.trigger( e, null, elem ); + } + +} ); + +jQuery.fn.extend( { + + trigger: function( type, data ) { + return this.each( function() { + jQuery.event.trigger( type, data, this ); + } ); + }, + triggerHandler: function( type, data ) { + var elem = this[ 0 ]; + if ( elem ) { + return jQuery.event.trigger( type, data, elem, true ); + } + } +} ); + + +// Support: Firefox <=44 +// Firefox doesn't have focus(in | out) events +// Related ticket - https://bugzilla.mozilla.org/show_bug.cgi?id=687787 +// +// Support: Chrome <=48 - 49, Safari <=9.0 - 9.1 +// focus(in | out) events fire after focus & blur events, +// which is spec violation - http://www.w3.org/TR/DOM-Level-3-Events/#events-focusevent-event-order +// Related ticket - https://bugs.chromium.org/p/chromium/issues/detail?id=449857 +if ( !support.focusin ) { + jQuery.each( { focus: "focusin", blur: "focusout" }, function( orig, fix ) { + + // Attach a single capturing handler on the document while someone wants focusin/focusout + var handler = function( event ) { + jQuery.event.simulate( fix, event.target, jQuery.event.fix( event ) ); + }; + + jQuery.event.special[ fix ] = { + setup: function() { + + // Handle: regular nodes (via `this.ownerDocument`), window + // (via `this.document`) & document (via `this`). + var doc = this.ownerDocument || this.document || this, + attaches = dataPriv.access( doc, fix ); + + if ( !attaches ) { + doc.addEventListener( orig, handler, true ); + } + dataPriv.access( doc, fix, ( attaches || 0 ) + 1 ); + }, + teardown: function() { + var doc = this.ownerDocument || this.document || this, + attaches = dataPriv.access( doc, fix ) - 1; + + if ( !attaches ) { + doc.removeEventListener( orig, handler, true ); + dataPriv.remove( doc, fix ); + + } else { + dataPriv.access( doc, fix, attaches ); + } + } + }; + } ); +} +var location = window.location; + +var nonce = { guid: Date.now() }; + +var rquery = ( /\?/ ); + + + +// Cross-browser xml parsing +jQuery.parseXML = function( data ) { + var xml; + if ( !data || typeof data !== "string" ) { + return null; + } + + // Support: IE 9 - 11 only + // IE throws on parseFromString with invalid input. + try { + xml = ( new window.DOMParser() ).parseFromString( data, "text/xml" ); + } catch ( e ) { + xml = undefined; + } + + if ( !xml || xml.getElementsByTagName( "parsererror" ).length ) { + jQuery.error( "Invalid XML: " + data ); + } + return xml; +}; + + +var + rbracket = /\[\]$/, + rCRLF = /\r?\n/g, + rsubmitterTypes = /^(?:submit|button|image|reset|file)$/i, + rsubmittable = /^(?:input|select|textarea|keygen)/i; + +function buildParams( prefix, obj, traditional, add ) { + var name; + + if ( Array.isArray( obj ) ) { + + // Serialize array item. + jQuery.each( obj, function( i, v ) { + if ( traditional || rbracket.test( prefix ) ) { + + // Treat each array item as a scalar. + add( prefix, v ); + + } else { + + // Item is non-scalar (array or object), encode its numeric index. + buildParams( + prefix + "[" + ( typeof v === "object" && v != null ? i : "" ) + "]", + v, + traditional, + add + ); + } + } ); + + } else if ( !traditional && toType( obj ) === "object" ) { + + // Serialize object item. + for ( name in obj ) { + buildParams( prefix + "[" + name + "]", obj[ name ], traditional, add ); + } + + } else { + + // Serialize scalar item. + add( prefix, obj ); + } +} + +// Serialize an array of form elements or a set of +// key/values into a query string +jQuery.param = function( a, traditional ) { + var prefix, + s = [], + add = function( key, valueOrFunction ) { + + // If value is a function, invoke it and use its return value + var value = isFunction( valueOrFunction ) ? + valueOrFunction() : + valueOrFunction; + + s[ s.length ] = encodeURIComponent( key ) + "=" + + encodeURIComponent( value == null ? "" : value ); + }; + + if ( a == null ) { + return ""; + } + + // If an array was passed in, assume that it is an array of form elements. + if ( Array.isArray( a ) || ( a.jquery && !jQuery.isPlainObject( a ) ) ) { + + // Serialize the form elements + jQuery.each( a, function() { + add( this.name, this.value ); + } ); + + } else { + + // If traditional, encode the "old" way (the way 1.3.2 or older + // did it), otherwise encode params recursively. + for ( prefix in a ) { + buildParams( prefix, a[ prefix ], traditional, add ); + } + } + + // Return the resulting serialization + return s.join( "&" ); +}; + +jQuery.fn.extend( { + serialize: function() { + return jQuery.param( this.serializeArray() ); + }, + serializeArray: function() { + return this.map( function() { + + // Can add propHook for "elements" to filter or add form elements + var elements = jQuery.prop( this, "elements" ); + return elements ? jQuery.makeArray( elements ) : this; + } ) + .filter( function() { + var type = this.type; + + // Use .is( ":disabled" ) so that fieldset[disabled] works + return this.name && !jQuery( this ).is( ":disabled" ) && + rsubmittable.test( this.nodeName ) && !rsubmitterTypes.test( type ) && + ( this.checked || !rcheckableType.test( type ) ); + } ) + .map( function( _i, elem ) { + var val = jQuery( this ).val(); + + if ( val == null ) { + return null; + } + + if ( Array.isArray( val ) ) { + return jQuery.map( val, function( val ) { + return { name: elem.name, value: val.replace( rCRLF, "\r\n" ) }; + } ); + } + + return { name: elem.name, value: val.replace( rCRLF, "\r\n" ) }; + } ).get(); + } +} ); + + +var + r20 = /%20/g, + rhash = /#.*$/, + rantiCache = /([?&])_=[^&]*/, + rheaders = /^(.*?):[ \t]*([^\r\n]*)$/mg, + + // #7653, #8125, #8152: local protocol detection + rlocalProtocol = /^(?:about|app|app-storage|.+-extension|file|res|widget):$/, + rnoContent = /^(?:GET|HEAD)$/, + rprotocol = /^\/\//, + + /* Prefilters + * 1) They are useful to introduce custom dataTypes (see ajax/jsonp.js for an example) + * 2) These are called: + * - BEFORE asking for a transport + * - AFTER param serialization (s.data is a string if s.processData is true) + * 3) key is the dataType + * 4) the catchall symbol "*" can be used + * 5) execution will start with transport dataType and THEN continue down to "*" if needed + */ + prefilters = {}, + + /* Transports bindings + * 1) key is the dataType + * 2) the catchall symbol "*" can be used + * 3) selection will start with transport dataType and THEN go to "*" if needed + */ + transports = {}, + + // Avoid comment-prolog char sequence (#10098); must appease lint and evade compression + allTypes = "*/".concat( "*" ), + + // Anchor tag for parsing the document origin + originAnchor = document.createElement( "a" ); + originAnchor.href = location.href; + +// Base "constructor" for jQuery.ajaxPrefilter and jQuery.ajaxTransport +function addToPrefiltersOrTransports( structure ) { + + // dataTypeExpression is optional and defaults to "*" + return function( dataTypeExpression, func ) { + + if ( typeof dataTypeExpression !== "string" ) { + func = dataTypeExpression; + dataTypeExpression = "*"; + } + + var dataType, + i = 0, + dataTypes = dataTypeExpression.toLowerCase().match( rnothtmlwhite ) || []; + + if ( isFunction( func ) ) { + + // For each dataType in the dataTypeExpression + while ( ( dataType = dataTypes[ i++ ] ) ) { + + // Prepend if requested + if ( dataType[ 0 ] === "+" ) { + dataType = dataType.slice( 1 ) || "*"; + ( structure[ dataType ] = structure[ dataType ] || [] ).unshift( func ); + + // Otherwise append + } else { + ( structure[ dataType ] = structure[ dataType ] || [] ).push( func ); + } + } + } + }; +} + +// Base inspection function for prefilters and transports +function inspectPrefiltersOrTransports( structure, options, originalOptions, jqXHR ) { + + var inspected = {}, + seekingTransport = ( structure === transports ); + + function inspect( dataType ) { + var selected; + inspected[ dataType ] = true; + jQuery.each( structure[ dataType ] || [], function( _, prefilterOrFactory ) { + var dataTypeOrTransport = prefilterOrFactory( options, originalOptions, jqXHR ); + if ( typeof dataTypeOrTransport === "string" && + !seekingTransport && !inspected[ dataTypeOrTransport ] ) { + + options.dataTypes.unshift( dataTypeOrTransport ); + inspect( dataTypeOrTransport ); + return false; + } else if ( seekingTransport ) { + return !( selected = dataTypeOrTransport ); + } + } ); + return selected; + } + + return inspect( options.dataTypes[ 0 ] ) || !inspected[ "*" ] && inspect( "*" ); +} + +// A special extend for ajax options +// that takes "flat" options (not to be deep extended) +// Fixes #9887 +function ajaxExtend( target, src ) { + var key, deep, + flatOptions = jQuery.ajaxSettings.flatOptions || {}; + + for ( key in src ) { + if ( src[ key ] !== undefined ) { + ( flatOptions[ key ] ? target : ( deep || ( deep = {} ) ) )[ key ] = src[ key ]; + } + } + if ( deep ) { + jQuery.extend( true, target, deep ); + } + + return target; +} + +/* Handles responses to an ajax request: + * - finds the right dataType (mediates between content-type and expected dataType) + * - returns the corresponding response + */ +function ajaxHandleResponses( s, jqXHR, responses ) { + + var ct, type, finalDataType, firstDataType, + contents = s.contents, + dataTypes = s.dataTypes; + + // Remove auto dataType and get content-type in the process + while ( dataTypes[ 0 ] === "*" ) { + dataTypes.shift(); + if ( ct === undefined ) { + ct = s.mimeType || jqXHR.getResponseHeader( "Content-Type" ); + } + } + + // Check if we're dealing with a known content-type + if ( ct ) { + for ( type in contents ) { + if ( contents[ type ] && contents[ type ].test( ct ) ) { + dataTypes.unshift( type ); + break; + } + } + } + + // Check to see if we have a response for the expected dataType + if ( dataTypes[ 0 ] in responses ) { + finalDataType = dataTypes[ 0 ]; + } else { + + // Try convertible dataTypes + for ( type in responses ) { + if ( !dataTypes[ 0 ] || s.converters[ type + " " + dataTypes[ 0 ] ] ) { + finalDataType = type; + break; + } + if ( !firstDataType ) { + firstDataType = type; + } + } + + // Or just use first one + finalDataType = finalDataType || firstDataType; + } + + // If we found a dataType + // We add the dataType to the list if needed + // and return the corresponding response + if ( finalDataType ) { + if ( finalDataType !== dataTypes[ 0 ] ) { + dataTypes.unshift( finalDataType ); + } + return responses[ finalDataType ]; + } +} + +/* Chain conversions given the request and the original response + * Also sets the responseXXX fields on the jqXHR instance + */ +function ajaxConvert( s, response, jqXHR, isSuccess ) { + var conv2, current, conv, tmp, prev, + converters = {}, + + // Work with a copy of dataTypes in case we need to modify it for conversion + dataTypes = s.dataTypes.slice(); + + // Create converters map with lowercased keys + if ( dataTypes[ 1 ] ) { + for ( conv in s.converters ) { + converters[ conv.toLowerCase() ] = s.converters[ conv ]; + } + } + + current = dataTypes.shift(); + + // Convert to each sequential dataType + while ( current ) { + + if ( s.responseFields[ current ] ) { + jqXHR[ s.responseFields[ current ] ] = response; + } + + // Apply the dataFilter if provided + if ( !prev && isSuccess && s.dataFilter ) { + response = s.dataFilter( response, s.dataType ); + } + + prev = current; + current = dataTypes.shift(); + + if ( current ) { + + // There's only work to do if current dataType is non-auto + if ( current === "*" ) { + + current = prev; + + // Convert response if prev dataType is non-auto and differs from current + } else if ( prev !== "*" && prev !== current ) { + + // Seek a direct converter + conv = converters[ prev + " " + current ] || converters[ "* " + current ]; + + // If none found, seek a pair + if ( !conv ) { + for ( conv2 in converters ) { + + // If conv2 outputs current + tmp = conv2.split( " " ); + if ( tmp[ 1 ] === current ) { + + // If prev can be converted to accepted input + conv = converters[ prev + " " + tmp[ 0 ] ] || + converters[ "* " + tmp[ 0 ] ]; + if ( conv ) { + + // Condense equivalence converters + if ( conv === true ) { + conv = converters[ conv2 ]; + + // Otherwise, insert the intermediate dataType + } else if ( converters[ conv2 ] !== true ) { + current = tmp[ 0 ]; + dataTypes.unshift( tmp[ 1 ] ); + } + break; + } + } + } + } + + // Apply converter (if not an equivalence) + if ( conv !== true ) { + + // Unless errors are allowed to bubble, catch and return them + if ( conv && s.throws ) { + response = conv( response ); + } else { + try { + response = conv( response ); + } catch ( e ) { + return { + state: "parsererror", + error: conv ? e : "No conversion from " + prev + " to " + current + }; + } + } + } + } + } + } + + return { state: "success", data: response }; +} + +jQuery.extend( { + + // Counter for holding the number of active queries + active: 0, + + // Last-Modified header cache for next request + lastModified: {}, + etag: {}, + + ajaxSettings: { + url: location.href, + type: "GET", + isLocal: rlocalProtocol.test( location.protocol ), + global: true, + processData: true, + async: true, + contentType: "application/x-www-form-urlencoded; charset=UTF-8", + + /* + timeout: 0, + data: null, + dataType: null, + username: null, + password: null, + cache: null, + throws: false, + traditional: false, + headers: {}, + */ + + accepts: { + "*": allTypes, + text: "text/plain", + html: "text/html", + xml: "application/xml, text/xml", + json: "application/json, text/javascript" + }, + + contents: { + xml: /\bxml\b/, + html: /\bhtml/, + json: /\bjson\b/ + }, + + responseFields: { + xml: "responseXML", + text: "responseText", + json: "responseJSON" + }, + + // Data converters + // Keys separate source (or catchall "*") and destination types with a single space + converters: { + + // Convert anything to text + "* text": String, + + // Text to html (true = no transformation) + "text html": true, + + // Evaluate text as a json expression + "text json": JSON.parse, + + // Parse text as xml + "text xml": jQuery.parseXML + }, + + // For options that shouldn't be deep extended: + // you can add your own custom options here if + // and when you create one that shouldn't be + // deep extended (see ajaxExtend) + flatOptions: { + url: true, + context: true + } + }, + + // Creates a full fledged settings object into target + // with both ajaxSettings and settings fields. + // If target is omitted, writes into ajaxSettings. + ajaxSetup: function( target, settings ) { + return settings ? + + // Building a settings object + ajaxExtend( ajaxExtend( target, jQuery.ajaxSettings ), settings ) : + + // Extending ajaxSettings + ajaxExtend( jQuery.ajaxSettings, target ); + }, + + ajaxPrefilter: addToPrefiltersOrTransports( prefilters ), + ajaxTransport: addToPrefiltersOrTransports( transports ), + + // Main method + ajax: function( url, options ) { + + // If url is an object, simulate pre-1.5 signature + if ( typeof url === "object" ) { + options = url; + url = undefined; + } + + // Force options to be an object + options = options || {}; + + var transport, + + // URL without anti-cache param + cacheURL, + + // Response headers + responseHeadersString, + responseHeaders, + + // timeout handle + timeoutTimer, + + // Url cleanup var + urlAnchor, + + // Request state (becomes false upon send and true upon completion) + completed, + + // To know if global events are to be dispatched + fireGlobals, + + // Loop variable + i, + + // uncached part of the url + uncached, + + // Create the final options object + s = jQuery.ajaxSetup( {}, options ), + + // Callbacks context + callbackContext = s.context || s, + + // Context for global events is callbackContext if it is a DOM node or jQuery collection + globalEventContext = s.context && + ( callbackContext.nodeType || callbackContext.jquery ) ? + jQuery( callbackContext ) : + jQuery.event, + + // Deferreds + deferred = jQuery.Deferred(), + completeDeferred = jQuery.Callbacks( "once memory" ), + + // Status-dependent callbacks + statusCode = s.statusCode || {}, + + // Headers (they are sent all at once) + requestHeaders = {}, + requestHeadersNames = {}, + + // Default abort message + strAbort = "canceled", + + // Fake xhr + jqXHR = { + readyState: 0, + + // Builds headers hashtable if needed + getResponseHeader: function( key ) { + var match; + if ( completed ) { + if ( !responseHeaders ) { + responseHeaders = {}; + while ( ( match = rheaders.exec( responseHeadersString ) ) ) { + responseHeaders[ match[ 1 ].toLowerCase() + " " ] = + ( responseHeaders[ match[ 1 ].toLowerCase() + " " ] || [] ) + .concat( match[ 2 ] ); + } + } + match = responseHeaders[ key.toLowerCase() + " " ]; + } + return match == null ? null : match.join( ", " ); + }, + + // Raw string + getAllResponseHeaders: function() { + return completed ? responseHeadersString : null; + }, + + // Caches the header + setRequestHeader: function( name, value ) { + if ( completed == null ) { + name = requestHeadersNames[ name.toLowerCase() ] = + requestHeadersNames[ name.toLowerCase() ] || name; + requestHeaders[ name ] = value; + } + return this; + }, + + // Overrides response content-type header + overrideMimeType: function( type ) { + if ( completed == null ) { + s.mimeType = type; + } + return this; + }, + + // Status-dependent callbacks + statusCode: function( map ) { + var code; + if ( map ) { + if ( completed ) { + + // Execute the appropriate callbacks + jqXHR.always( map[ jqXHR.status ] ); + } else { + + // Lazy-add the new callbacks in a way that preserves old ones + for ( code in map ) { + statusCode[ code ] = [ statusCode[ code ], map[ code ] ]; + } + } + } + return this; + }, + + // Cancel the request + abort: function( statusText ) { + var finalText = statusText || strAbort; + if ( transport ) { + transport.abort( finalText ); + } + done( 0, finalText ); + return this; + } + }; + + // Attach deferreds + deferred.promise( jqXHR ); + + // Add protocol if not provided (prefilters might expect it) + // Handle falsy url in the settings object (#10093: consistency with old signature) + // We also use the url parameter if available + s.url = ( ( url || s.url || location.href ) + "" ) + .replace( rprotocol, location.protocol + "//" ); + + // Alias method option to type as per ticket #12004 + s.type = options.method || options.type || s.method || s.type; + + // Extract dataTypes list + s.dataTypes = ( s.dataType || "*" ).toLowerCase().match( rnothtmlwhite ) || [ "" ]; + + // A cross-domain request is in order when the origin doesn't match the current origin. + if ( s.crossDomain == null ) { + urlAnchor = document.createElement( "a" ); + + // Support: IE <=8 - 11, Edge 12 - 15 + // IE throws exception on accessing the href property if url is malformed, + // e.g. http://example.com:80x/ + try { + urlAnchor.href = s.url; + + // Support: IE <=8 - 11 only + // Anchor's host property isn't correctly set when s.url is relative + urlAnchor.href = urlAnchor.href; + s.crossDomain = originAnchor.protocol + "//" + originAnchor.host !== + urlAnchor.protocol + "//" + urlAnchor.host; + } catch ( e ) { + + // If there is an error parsing the URL, assume it is crossDomain, + // it can be rejected by the transport if it is invalid + s.crossDomain = true; + } + } + + // Convert data if not already a string + if ( s.data && s.processData && typeof s.data !== "string" ) { + s.data = jQuery.param( s.data, s.traditional ); + } + + // Apply prefilters + inspectPrefiltersOrTransports( prefilters, s, options, jqXHR ); + + // If request was aborted inside a prefilter, stop there + if ( completed ) { + return jqXHR; + } + + // We can fire global events as of now if asked to + // Don't fire events if jQuery.event is undefined in an AMD-usage scenario (#15118) + fireGlobals = jQuery.event && s.global; + + // Watch for a new set of requests + if ( fireGlobals && jQuery.active++ === 0 ) { + jQuery.event.trigger( "ajaxStart" ); + } + + // Uppercase the type + s.type = s.type.toUpperCase(); + + // Determine if request has content + s.hasContent = !rnoContent.test( s.type ); + + // Save the URL in case we're toying with the If-Modified-Since + // and/or If-None-Match header later on + // Remove hash to simplify url manipulation + cacheURL = s.url.replace( rhash, "" ); + + // More options handling for requests with no content + if ( !s.hasContent ) { + + // Remember the hash so we can put it back + uncached = s.url.slice( cacheURL.length ); + + // If data is available and should be processed, append data to url + if ( s.data && ( s.processData || typeof s.data === "string" ) ) { + cacheURL += ( rquery.test( cacheURL ) ? "&" : "?" ) + s.data; + + // #9682: remove data so that it's not used in an eventual retry + delete s.data; + } + + // Add or update anti-cache param if needed + if ( s.cache === false ) { + cacheURL = cacheURL.replace( rantiCache, "$1" ); + uncached = ( rquery.test( cacheURL ) ? "&" : "?" ) + "_=" + ( nonce.guid++ ) + + uncached; + } + + // Put hash and anti-cache on the URL that will be requested (gh-1732) + s.url = cacheURL + uncached; + + // Change '%20' to '+' if this is encoded form body content (gh-2658) + } else if ( s.data && s.processData && + ( s.contentType || "" ).indexOf( "application/x-www-form-urlencoded" ) === 0 ) { + s.data = s.data.replace( r20, "+" ); + } + + // Set the If-Modified-Since and/or If-None-Match header, if in ifModified mode. + if ( s.ifModified ) { + if ( jQuery.lastModified[ cacheURL ] ) { + jqXHR.setRequestHeader( "If-Modified-Since", jQuery.lastModified[ cacheURL ] ); + } + if ( jQuery.etag[ cacheURL ] ) { + jqXHR.setRequestHeader( "If-None-Match", jQuery.etag[ cacheURL ] ); + } + } + + // Set the correct header, if data is being sent + if ( s.data && s.hasContent && s.contentType !== false || options.contentType ) { + jqXHR.setRequestHeader( "Content-Type", s.contentType ); + } + + // Set the Accepts header for the server, depending on the dataType + jqXHR.setRequestHeader( + "Accept", + s.dataTypes[ 0 ] && s.accepts[ s.dataTypes[ 0 ] ] ? + s.accepts[ s.dataTypes[ 0 ] ] + + ( s.dataTypes[ 0 ] !== "*" ? ", " + allTypes + "; q=0.01" : "" ) : + s.accepts[ "*" ] + ); + + // Check for headers option + for ( i in s.headers ) { + jqXHR.setRequestHeader( i, s.headers[ i ] ); + } + + // Allow custom headers/mimetypes and early abort + if ( s.beforeSend && + ( s.beforeSend.call( callbackContext, jqXHR, s ) === false || completed ) ) { + + // Abort if not done already and return + return jqXHR.abort(); + } + + // Aborting is no longer a cancellation + strAbort = "abort"; + + // Install callbacks on deferreds + completeDeferred.add( s.complete ); + jqXHR.done( s.success ); + jqXHR.fail( s.error ); + + // Get transport + transport = inspectPrefiltersOrTransports( transports, s, options, jqXHR ); + + // If no transport, we auto-abort + if ( !transport ) { + done( -1, "No Transport" ); + } else { + jqXHR.readyState = 1; + + // Send global event + if ( fireGlobals ) { + globalEventContext.trigger( "ajaxSend", [ jqXHR, s ] ); + } + + // If request was aborted inside ajaxSend, stop there + if ( completed ) { + return jqXHR; + } + + // Timeout + if ( s.async && s.timeout > 0 ) { + timeoutTimer = window.setTimeout( function() { + jqXHR.abort( "timeout" ); + }, s.timeout ); + } + + try { + completed = false; + transport.send( requestHeaders, done ); + } catch ( e ) { + + // Rethrow post-completion exceptions + if ( completed ) { + throw e; + } + + // Propagate others as results + done( -1, e ); + } + } + + // Callback for when everything is done + function done( status, nativeStatusText, responses, headers ) { + var isSuccess, success, error, response, modified, + statusText = nativeStatusText; + + // Ignore repeat invocations + if ( completed ) { + return; + } + + completed = true; + + // Clear timeout if it exists + if ( timeoutTimer ) { + window.clearTimeout( timeoutTimer ); + } + + // Dereference transport for early garbage collection + // (no matter how long the jqXHR object will be used) + transport = undefined; + + // Cache response headers + responseHeadersString = headers || ""; + + // Set readyState + jqXHR.readyState = status > 0 ? 4 : 0; + + // Determine if successful + isSuccess = status >= 200 && status < 300 || status === 304; + + // Get response data + if ( responses ) { + response = ajaxHandleResponses( s, jqXHR, responses ); + } + + // Use a noop converter for missing script + if ( !isSuccess && jQuery.inArray( "script", s.dataTypes ) > -1 ) { + s.converters[ "text script" ] = function() {}; + } + + // Convert no matter what (that way responseXXX fields are always set) + response = ajaxConvert( s, response, jqXHR, isSuccess ); + + // If successful, handle type chaining + if ( isSuccess ) { + + // Set the If-Modified-Since and/or If-None-Match header, if in ifModified mode. + if ( s.ifModified ) { + modified = jqXHR.getResponseHeader( "Last-Modified" ); + if ( modified ) { + jQuery.lastModified[ cacheURL ] = modified; + } + modified = jqXHR.getResponseHeader( "etag" ); + if ( modified ) { + jQuery.etag[ cacheURL ] = modified; + } + } + + // if no content + if ( status === 204 || s.type === "HEAD" ) { + statusText = "nocontent"; + + // if not modified + } else if ( status === 304 ) { + statusText = "notmodified"; + + // If we have data, let's convert it + } else { + statusText = response.state; + success = response.data; + error = response.error; + isSuccess = !error; + } + } else { + + // Extract error from statusText and normalize for non-aborts + error = statusText; + if ( status || !statusText ) { + statusText = "error"; + if ( status < 0 ) { + status = 0; + } + } + } + + // Set data for the fake xhr object + jqXHR.status = status; + jqXHR.statusText = ( nativeStatusText || statusText ) + ""; + + // Success/Error + if ( isSuccess ) { + deferred.resolveWith( callbackContext, [ success, statusText, jqXHR ] ); + } else { + deferred.rejectWith( callbackContext, [ jqXHR, statusText, error ] ); + } + + // Status-dependent callbacks + jqXHR.statusCode( statusCode ); + statusCode = undefined; + + if ( fireGlobals ) { + globalEventContext.trigger( isSuccess ? "ajaxSuccess" : "ajaxError", + [ jqXHR, s, isSuccess ? success : error ] ); + } + + // Complete + completeDeferred.fireWith( callbackContext, [ jqXHR, statusText ] ); + + if ( fireGlobals ) { + globalEventContext.trigger( "ajaxComplete", [ jqXHR, s ] ); + + // Handle the global AJAX counter + if ( !( --jQuery.active ) ) { + jQuery.event.trigger( "ajaxStop" ); + } + } + } + + return jqXHR; + }, + + getJSON: function( url, data, callback ) { + return jQuery.get( url, data, callback, "json" ); + }, + + getScript: function( url, callback ) { + return jQuery.get( url, undefined, callback, "script" ); + } +} ); + +jQuery.each( [ "get", "post" ], function( _i, method ) { + jQuery[ method ] = function( url, data, callback, type ) { + + // Shift arguments if data argument was omitted + if ( isFunction( data ) ) { + type = type || callback; + callback = data; + data = undefined; + } + + // The url can be an options object (which then must have .url) + return jQuery.ajax( jQuery.extend( { + url: url, + type: method, + dataType: type, + data: data, + success: callback + }, jQuery.isPlainObject( url ) && url ) ); + }; +} ); + +jQuery.ajaxPrefilter( function( s ) { + var i; + for ( i in s.headers ) { + if ( i.toLowerCase() === "content-type" ) { + s.contentType = s.headers[ i ] || ""; + } + } +} ); + + +jQuery._evalUrl = function( url, options, doc ) { + return jQuery.ajax( { + url: url, + + // Make this explicit, since user can override this through ajaxSetup (#11264) + type: "GET", + dataType: "script", + cache: true, + async: false, + global: false, + + // Only evaluate the response if it is successful (gh-4126) + // dataFilter is not invoked for failure responses, so using it instead + // of the default converter is kludgy but it works. + converters: { + "text script": function() {} + }, + dataFilter: function( response ) { + jQuery.globalEval( response, options, doc ); + } + } ); +}; + + +jQuery.fn.extend( { + wrapAll: function( html ) { + var wrap; + + if ( this[ 0 ] ) { + if ( isFunction( html ) ) { + html = html.call( this[ 0 ] ); + } + + // The elements to wrap the target around + wrap = jQuery( html, this[ 0 ].ownerDocument ).eq( 0 ).clone( true ); + + if ( this[ 0 ].parentNode ) { + wrap.insertBefore( this[ 0 ] ); + } + + wrap.map( function() { + var elem = this; + + while ( elem.firstElementChild ) { + elem = elem.firstElementChild; + } + + return elem; + } ).append( this ); + } + + return this; + }, + + wrapInner: function( html ) { + if ( isFunction( html ) ) { + return this.each( function( i ) { + jQuery( this ).wrapInner( html.call( this, i ) ); + } ); + } + + return this.each( function() { + var self = jQuery( this ), + contents = self.contents(); + + if ( contents.length ) { + contents.wrapAll( html ); + + } else { + self.append( html ); + } + } ); + }, + + wrap: function( html ) { + var htmlIsFunction = isFunction( html ); + + return this.each( function( i ) { + jQuery( this ).wrapAll( htmlIsFunction ? html.call( this, i ) : html ); + } ); + }, + + unwrap: function( selector ) { + this.parent( selector ).not( "body" ).each( function() { + jQuery( this ).replaceWith( this.childNodes ); + } ); + return this; + } +} ); + + +jQuery.expr.pseudos.hidden = function( elem ) { + return !jQuery.expr.pseudos.visible( elem ); +}; +jQuery.expr.pseudos.visible = function( elem ) { + return !!( elem.offsetWidth || elem.offsetHeight || elem.getClientRects().length ); +}; + + + + +jQuery.ajaxSettings.xhr = function() { + try { + return new window.XMLHttpRequest(); + } catch ( e ) {} +}; + +var xhrSuccessStatus = { + + // File protocol always yields status code 0, assume 200 + 0: 200, + + // Support: IE <=9 only + // #1450: sometimes IE returns 1223 when it should be 204 + 1223: 204 + }, + xhrSupported = jQuery.ajaxSettings.xhr(); + +support.cors = !!xhrSupported && ( "withCredentials" in xhrSupported ); +support.ajax = xhrSupported = !!xhrSupported; + +jQuery.ajaxTransport( function( options ) { + var callback, errorCallback; + + // Cross domain only allowed if supported through XMLHttpRequest + if ( support.cors || xhrSupported && !options.crossDomain ) { + return { + send: function( headers, complete ) { + var i, + xhr = options.xhr(); + + xhr.open( + options.type, + options.url, + options.async, + options.username, + options.password + ); + + // Apply custom fields if provided + if ( options.xhrFields ) { + for ( i in options.xhrFields ) { + xhr[ i ] = options.xhrFields[ i ]; + } + } + + // Override mime type if needed + if ( options.mimeType && xhr.overrideMimeType ) { + xhr.overrideMimeType( options.mimeType ); + } + + // X-Requested-With header + // For cross-domain requests, seeing as conditions for a preflight are + // akin to a jigsaw puzzle, we simply never set it to be sure. + // (it can always be set on a per-request basis or even using ajaxSetup) + // For same-domain requests, won't change header if already provided. + if ( !options.crossDomain && !headers[ "X-Requested-With" ] ) { + headers[ "X-Requested-With" ] = "XMLHttpRequest"; + } + + // Set headers + for ( i in headers ) { + xhr.setRequestHeader( i, headers[ i ] ); + } + + // Callback + callback = function( type ) { + return function() { + if ( callback ) { + callback = errorCallback = xhr.onload = + xhr.onerror = xhr.onabort = xhr.ontimeout = + xhr.onreadystatechange = null; + + if ( type === "abort" ) { + xhr.abort(); + } else if ( type === "error" ) { + + // Support: IE <=9 only + // On a manual native abort, IE9 throws + // errors on any property access that is not readyState + if ( typeof xhr.status !== "number" ) { + complete( 0, "error" ); + } else { + complete( + + // File: protocol always yields status 0; see #8605, #14207 + xhr.status, + xhr.statusText + ); + } + } else { + complete( + xhrSuccessStatus[ xhr.status ] || xhr.status, + xhr.statusText, + + // Support: IE <=9 only + // IE9 has no XHR2 but throws on binary (trac-11426) + // For XHR2 non-text, let the caller handle it (gh-2498) + ( xhr.responseType || "text" ) !== "text" || + typeof xhr.responseText !== "string" ? + { binary: xhr.response } : + { text: xhr.responseText }, + xhr.getAllResponseHeaders() + ); + } + } + }; + }; + + // Listen to events + xhr.onload = callback(); + errorCallback = xhr.onerror = xhr.ontimeout = callback( "error" ); + + // Support: IE 9 only + // Use onreadystatechange to replace onabort + // to handle uncaught aborts + if ( xhr.onabort !== undefined ) { + xhr.onabort = errorCallback; + } else { + xhr.onreadystatechange = function() { + + // Check readyState before timeout as it changes + if ( xhr.readyState === 4 ) { + + // Allow onerror to be called first, + // but that will not handle a native abort + // Also, save errorCallback to a variable + // as xhr.onerror cannot be accessed + window.setTimeout( function() { + if ( callback ) { + errorCallback(); + } + } ); + } + }; + } + + // Create the abort callback + callback = callback( "abort" ); + + try { + + // Do send the request (this may raise an exception) + xhr.send( options.hasContent && options.data || null ); + } catch ( e ) { + + // #14683: Only rethrow if this hasn't been notified as an error yet + if ( callback ) { + throw e; + } + } + }, + + abort: function() { + if ( callback ) { + callback(); + } + } + }; + } +} ); + + + + +// Prevent auto-execution of scripts when no explicit dataType was provided (See gh-2432) +jQuery.ajaxPrefilter( function( s ) { + if ( s.crossDomain ) { + s.contents.script = false; + } +} ); + +// Install script dataType +jQuery.ajaxSetup( { + accepts: { + script: "text/javascript, application/javascript, " + + "application/ecmascript, application/x-ecmascript" + }, + contents: { + script: /\b(?:java|ecma)script\b/ + }, + converters: { + "text script": function( text ) { + jQuery.globalEval( text ); + return text; + } + } +} ); + +// Handle cache's special case and crossDomain +jQuery.ajaxPrefilter( "script", function( s ) { + if ( s.cache === undefined ) { + s.cache = false; + } + if ( s.crossDomain ) { + s.type = "GET"; + } +} ); + +// Bind script tag hack transport +jQuery.ajaxTransport( "script", function( s ) { + + // This transport only deals with cross domain or forced-by-attrs requests + if ( s.crossDomain || s.scriptAttrs ) { + var script, callback; + return { + send: function( _, complete ) { + script = jQuery( "