dicook · dicook · Mar 25, 2024 · Feb 26, 2024 · Feb 26, 2024 · Feb 26, 2024
diff --git a/1-intro.qmd b/1-intro.qmd
@@ -5,9 +5,11 @@ High-dimensional data means that we have a large number of numeric features or v
 \index{variable}\index{feature}
 \index{projection}
 
-![](images/shadow_puppets.png){width=450 fig-align="center" fig-env="figure*" fig-cap="Viewing high dimensions using low-dimensional displays is like playing shadow puppets, looking at the shadows to guess what the shape is." fig-alt="Three images, each with a hand or two hands, illustrating making shadows of a bird in flight, snail and dog."}
+![Viewing high dimensions using low-dimensional displays is like playing shadow puppets, looking at the shadows to guess what the shape is.](images/shadow_puppets.png){#fig-shadow-puppets width=450 fig-alt="Three images, each with a hand or two hands, illustrating making shadows of a bird in flight, snail and dog."}
 
 
+One approach to visualise high dimensional data and models is by using linear projections, as done in a tour. You can think of projections of high-dimensional data like shadows (@fig-shadow-puppets). Unlike shadow puppets, though the object stays fixed, and with multiple projections we can obtain a *view of the object from all sides*. 
+
 
 ## Getting familiar with tours
 
@@ -24,7 +26,7 @@ s_p <- ggplot(simple_clusters, aes(x=x1, y=x2)) +
   annotate("text", x=2.0, y=2.2, label="(0.707, 0.707)", angle=45) +
   annotate("text", x=2.2, y=2.0, label="most clustered", angle=45) +
   geom_abline(intercept=0, slope=-1) +
-  annotate("text", x=-1.6, y=1.8, label="(-0.707, 0.707)", angle=-45) +
+  annotate("text", x=-1.6, y=1.8, label="(0.707, -0.707)", angle=-45) +
   annotate("text", x=-1.8, y=1.6, label="no clusters", angle=-45) +
   geom_abline(intercept=0, slope=0) +
   annotate("text", x=-1.6, y=0.15, label="(1, 0)") +
@@ -104,7 +106,7 @@ How a tour can be used to explore high-dimensional data illustrated using (a) 2D
 
 ```{r fig-explain-1D-pdf, eval=knitr::is_latex_output()}
 #| echo: false
-#| fig-cap: "How a tour can be used to explore high-dimensional data illustrated using (a) 2D data with two clusters and (b,c,d) 1D projections from a tour shown as a density plot. Imagine spinning a line around the centre of the data plot, with points projected orthogonally onto the line. With this data, when the line is at `x1=x2 (0.707, 0.707)` or `(-0.707, -0.707)` the clustering is the strongest. When it is at `x1=-x2  (0.707, -0.707)` there is no clustering."
+#| fig-cap: "How a tour can be used to explore high-dimensional data illustrated using (a) 2D data with two clusters and (b,c,d) 1D projections from a tour shown as a density plot. Imagine spinning a line around the centre of the data plot, with points projected orthogonally onto the line. With this data, when the line is at `x1=x2 (0.707, 0.707)` or `(-0.707, -0.707)` the clustering is the strongest. When it is at `x1=-x2  (0.707, -0.707)` there is no clustering. {{< fa play-circle >}}"
 #| fig-width: 8
 #| fig-height: 8
 #| out-width: 100%
@@ -228,7 +230,7 @@ How a tour can be used to explore high-dimensional data illustrated by showing a
 
 ```{r fig-explain-2D-pdf, eval=knitr::is_latex_output()}
 #| echo: false
-#| fig-cap: "How a tour can be used to explore high-dimensional data illustrated by showing a sequence of random 2D projections of 3D data (a). The data has a donut shape with the hole revealed in a single 2D projection (b). Data usually arrives with a given number of observations, and when we plot it like this using a scatterplot, it is like shadows of a transparent object."
+#| fig-cap: "How a tour can be used to explore high-dimensional data illustrated by showing a sequence of random 2D projections of 3D data (a). The data has a donut shape with the hole revealed in a single 2D projection (b). Data usually arrives with a given number of observations, and when we plot it like this using a scatterplot, it is like shadows of a transparent object. {{< fa play-circle >}}"
 #| fig-width: 8
 #| fig-height: 8
 #| out-width: 100%
@@ -530,7 +532,7 @@ Two 5D datasets shown as tours of 2D projections. Can you see clusters of points
 
 ![Outliers](images/outlier-intro.png){#fig-tour-clusters width=200}
 
-Frames from 2D tours on two 5D datasets, with clusters of points in (a) and two outliers with a plane in (b). This figure is best viewed in the HTML version of the book.
+Frames from 2D tours on two 5D datasets, with clusters of points in (a) and two outliers with a plane in (b). This figure is best viewed in the HTML version of the book. {{< fa play-circle >}}
 :::
 
 :::
@@ -674,13 +676,6 @@ render_gif(plane_outliers[,1:5],
 The movement of points give further clues about the structure of the data in high-dimensions. In the data with clustering, often we can see a group of points moving differently from the others. Because there are three clusters, you should see three distinct movement patterns. It is similar with outliers, except these may be individual points moving alone, and different from all others. This can be seen in the static plot, one point (top left) has a movement pattern upwards whereas most of the other observations near it are moving down towards the right. 
 :::
 
-<!--
-![Movement pattern indicates clustering as seen in a grand tour.](gifs/trails-clusters.gif){#fig-clusters-trails-tour fig-alt="" width="300"}
-
-![Movement pattern indicates an outlier as seen in a grand tour.](gifs/trails-outlier.gif){#fig-outlier-trails-tour fig-alt="" width="300"}
-
--->
-
 
 This type of visualisation is useful for many activities in dealing with high-dimensional data, including: 
 
@@ -711,13 +706,13 @@ With computer graphics, the capability of animating plots to show more than a si
 
 The methods in this book primarily emerge from @As85's grand tour method. The algorithm provided the first smooth and continuous sequence of low dimensional projections, and guaranteed that all possible low dimensional projections were likely to be shown. The algorithm was refined in @BA86b (and documented in detail in @BCAH05) to make it *efficiently* show all possible projections. Since then there have been numerous varieties of tour algorithms developed to focus on specific tasks in exploring high dimensional data, and these are documented in @tours2022. 
 
-This book is an evolution from @CS07. One of the difficulties in working on interactive and dynamic graphics research has been the rapid change in technology. Programming languages have changed a little (FORTRAN to C to java to python) but graphics toolkits and display devices have changed a lot! The tour software used in this book evolved from XGobi, which was written in C and used the X Window System, which was then rewritten in  GGobi using gtk. The video library has engaging videos of these software systems There have been several other short-lived implementations, including orca [@orca], written in java, and cranvas [@cranvas], written in R with a back-end provided by wrapper functions to qt libraries. 
+This book is an evolution from @CS07. One of the difficulties in working on interactive and dynamic graphics research has been the rapid change in technology. Programming languages have changed a little (FORTRAN to C to java to python) but graphics toolkits and display devices have changed a lot! The tour software used in this book evolved from XGobi, which was written in C and used the X Window System, which was then rewritten in  GGobi using gtk. The video library has engaging videos of these software systems There have been several other short-lived implementations, including orca [@orca], written in java, and cranvas [@cranvas], written in R with a back-end provided by wrapper functions to `qt` libraries. 
 
 Although attempts were made with these ancestor systems to connect the data plots to a statistical analysis system, these were always limited. With the emergence of R, having graphics in the data analysis workflow has been much easier, albeit at the cost of the interactivity with graphics that matches the old systems. We are mostly using the R package, `tourr` [@tourr] for examples in this book. It provides the machinery for running a tour, and has the flexibility that it can be ported, modified, and used as a regular element of data analysis.
 
 ## Exercises {-}
 
-1. Randomly generate data points that are uniformly distributed in a hyper-cube of 3, 5 and 10 dimensions, with 500 points in each sample, using the `cube.solid.random` function of the `geozoo` package. What differences do we expect to see? Now visualise each set in a grand tour and describe how they differ, and whether this matched your expectations? 
+1. Randomly generate data points that are uniformly distributed in a hyper-cube of 3, 5 and 10 dimensions, with 500 points in each sample, using the `cube.solid.random()` function of the `geozoo` package. What differences do we expect to see? Now visualise each set in a grand tour and describe how they differ, and whether this matched your expectations? 
 2. Use the `geozoo` package to generate samples from different shapes and use them to get a better understanding of how shapes appear in a grand tour. You can start with exploring the conic spiral in 3D, a torus in 4D and points along the wire frame of a cube in 5D.
 3. For each of the challenge data sets, `c1`, ..., `c7` from the `mulgar` package, use the grand tour to view and try to identify structure (outliers, clusters, non-linear relationships). 
 
@@ -733,31 +728,23 @@ cube3 <- cube.solid.random(3, 500)$points
 cube5 <- cube.solid.random(5, 500)$points
 cube10 <- cube.solid.random(5, 500)$points
 
-animate(cube3)
-animate(cube5)
-animate(cube10)
+animate_xy(cube3, axes="bottomleft")
+animate_xy(cube5, axes="bottomleft")
+animate_xy(cube10, axes="bottomleft")
 ```
 
+::: {.content-hidden when-format="pdf"}
+::: {.hidden}
+Answer 1. Each of the projections has a boxy shape, which gets less distinct as the dimension increases. 
+
+As the dimension increases, the points tend to concentrate in the centre of the plot window, with a smattering of points in the edges. 
+:::
+:::
+
 ```{r}
 #| eval: false
 #| echo: false
 # Answer to Q3
-library(tourr)
-library(mvtnorm)
-
-s1 <- diag(5)
-s2 <- diag(5)
-s2[3,4] <- 0.7
-s2[4,3] <- 0.7
-s3 <- s2
-s3[1,2] <- 0.7
-s3[2,1] <- 0.7
-
-set.seed(1234)
-d1 <- as.data.frame(rmvnorm(500, sigma = s1))
-d2 <- as.data.frame(rmvnorm(500, sigma = s2))
-d3 <- as.data.frame(rmvnorm(500, sigma = s3))
-
 library(mulgar)
 animate_xy(c1)
 render_gif(c1, 

diff --git a/1-intro_files/figure-html/fig-density-1.png b/1-intro_files/figure-html/fig-density-1.png
diff --git a/1-intro_files/figure-html/fig-dimension-cubes-1.png b/1-intro_files/figure-html/fig-dimension-cubes-1.png
diff --git a/1-intro_files/figure-html/fig-example-structure-1.png b/1-intro_files/figure-html/fig-example-structure-1.png
diff --git a/1-intro_files/figure-html/fig-explain-1D-data-1.png b/1-intro_files/figure-html/fig-explain-1D-data-1.png
diff --git a/1-intro_files/figure-html/fig-explain-2D-data-1.png b/1-intro_files/figure-html/fig-explain-2D-data-1.png
diff --git a/10-model-based.qmd b/10-model-based.qmd
@@ -2,13 +2,13 @@
 
 \index{cluster analysis!model-based} 
 
-Model-based clustering @FR02 fits a multivariate normal mixture model to the data. It uses the EM algorithm to fit the parameters for the mean, variance--covariance of each population, and the mixing proportion. The variance-covariance matrix is re-parameterised using an eigen-decomposition
+Model-based clustering @FR02 fits a multivariate normal mixture model to the data. It uses the EM algorithm to fit the parameters for the mean, variance-covariance of each population, and the mixing proportion. The variance-covariance matrix is re-parameterised using an eigen-decomposition
 
 $$
 \Sigma_k = \lambda_kD_kA_kD_k^\top, ~~~k=1, \dots, g ~~\mbox{(number of clusters)}
 $$
 
-\noindent resulting in several model choices, ranging from simple to complex, as shown in @tbl-covariances.
+\noindent resulting in several model choices, ranging from simple to complex, as shown in `r ifelse(knitr::is_html_output(), '@tbl-covariances-html', '@tbl-covariances-pdf')`.
 
 ```{r echo=knitr::is_html_output()}
 #| label: mc-libraries
@@ -24,15 +24,32 @@ library(colorspace)
 library(tourr)
 ```
 
-```{r}
-#| label: tbl-covariances
+::: {.content-visible when-format="html"}
+
+```{r eval=knitr::is_html_output()}
+#| label: tbl-covariances-html
 #| tbl-cap: "Parameterizations of the covariance matrix."
 #| echo: FALSE
 #| message: FALSE
-readr::read_csv('misc/mclust-covariances.csv') %>%
+readr::read_csv('misc/mclust-covariances-html.csv') %>%
   knitr::kable(align = c('c', 'c', 'c', 'c', 'c', 'c')) %>%
   kableExtra::kable_styling(full_width = FALSE)
 ```
+:::
+
+::: {.content-visible when-format="pdf"}
+```{r eval=knitr::is_latex_output()}
+#| label: tbl-covariances-pdf
+#| tbl-cap: "Parameterizations of the covariance matrix."
+#| echo: FALSE
+#| message: FALSE
+readr::read_csv('misc/mclust-covariances-latex.csv') %>%
+  knitr::kable(align = c('c', 'c', 'c', 'c', 'c', 'c'), 
+               format="latex", booktabs = T, 
+               escape = FALSE) %>%
+  kableExtra::kable_styling(full_width = FALSE)
+```
+:::
 
 \noindent Note the distribution descriptions "spherical" and "ellipsoidal". These are descriptions of the shape of the variance-covariance for a multivariate normal distribution. A standard multivariate normal distribution has a variance-covariance matrix with zeros in the off-diagonal elements, which corresponds to spherically shaped data. When the variances (diagonals) are different or the variables are correlated, then the shape of data from a multivariate normal is ellipsoidal.
 
@@ -60,6 +77,9 @@ ggplot(penguins_sub, aes(x=bl,
   theme(aspect.ratio = 1)
 ```
 
+To draw ellipses in any dimension, a reasonable procedure is to sample points uniformly on a sphere, and then transform this into a sphere using the inverse of the variance-covariance matrix. The `mulgar` function `mc_ellipse()` does this for each cluster in the fitted model.
+
+
 ```{r}
 #| label: fig-penguins-bl-fl-mc
 #| message: FALSE
@@ -68,6 +88,7 @@ ggplot(penguins_sub, aes(x=bl,
 #| fig-height: 4
 #| out-width: 100%
 #| fig-cap: "Summary plots from model-based clustering: (a) BIC values for clusters 2-9 of top four models, (b) variance-covariance ellipses and cluster means (+) corresponding to the best model. The best model is three-cluster EVE, which has differently shaped variance-covariances albeit the same volume and orientation."
+# Fit the model, plot BIC, construct and plot ellipses
 penguins_BIC <- mclustBIC(penguins_sub[,c(1,3)])
 ggmc <- ggmcbic(penguins_BIC, cl=2:9, top=4) + 
   scale_color_discrete_divergingx(palette = "Roma") +
@@ -96,7 +117,7 @@ ggell <- ggplot() +
 ggmc + ggell + plot_layout(ncol=2)
 ```
 
-@fig-penguins-bl-fl-mc summarises the results. All models agree that three clusters is the best. The different variance-covariance models for three clusters have similar BIC values with EVE (different shape, same volume and orientation) being slightly higher. These plots are made from the `mclust` package output using the `ggmcbic` and `mc_ellipse` functions fro the `mulgar` package.
+@fig-penguins-bl-fl-mc summarises the results. All models agree that three clusters is the best. The different variance-covariance models for three clusters have similar BIC values with EVE (different shape, same volume and orientation) being slightly higher. These plots are made from the `mclust` package output using the `ggmcbic()` and `mc_ellipse()` functions from the `mulgar` package.
 
 ## Examining the model in high dimensions
 

diff --git a/12-summary-clust.qmd b/12-summary-clust.qmd
@@ -239,6 +239,41 @@ limn_tour_link(
 
 ![Highlighting the penguins where the methods disagree so we can see where these observations are located relative to the two clusters.](images/compare-clusters2.png){#fig-compare-clusters2}
 
+Linking the confusion matrix with the tour can also be accomplished with `crosstalk` and `detourr`.  
+
+```{r}
+#| eval: false
+#| echo: true
+library(crosstalk)
+library(plotly)
+library(viridis)
+p_cl_shared <- SharedData$new(penguins_cl)
+
+detour_plot <- detour(p_cl_shared, tour_aes(
+  projection = bl:bm,
+  colour = cl_w)) |>
+    tour_path(grand_tour(2), 
+                    max_bases=50, fps = 60) |>
+       show_scatter(alpha = 0.7, axes = FALSE,
+                    width = "100%", height = "450px")
+
+conf_mat <- plot_ly(p_cl_shared, 
+                    x = ~cl_mc_j,
+                    y = ~cl_w_j,
+                    color = ~cl_w,
+                    colors = viridis_pal(option = "D")(3),
+                    height = 450) |>
+  highlight(on = "plotly_selected", 
+              off = "plotly_doubleclick") %>%
+    add_trace(type = "scatter", 
+              mode = "markers")
+
+bscols(
+     detour_plot, conf_mat,
+     widths = c(5, 6)
+ )                 
+```
+
 ## Exercises {-}
 
 1. Compare the results of the four cluster model-based clustering with that of the four cluster Wards linkage clustering of the penguins data.
@@ -248,7 +283,7 @@ limn_tour_link(
 
 ## Project {-}
 
-Most of the time your data will not neatly separate into clusters, but partitioning it into groups of similar observations can still be useful. In this case our toolbox will be useful in comparing and contrasting different methods, understanding to what extend a cluster mean can describe the observations in the cluster, and also how the boundaries between clusters have been drawn. To explore this we will use survey data that examines the risk taking behavior of tourists. The data was collected in Australia in 2015 [@risk-survey] and includes six types of risks (recreational, health, career, financial, safety and social) with responses on a scale from 1 (never) to 5 (very often). The data is available in `risk_MSA.rds` from the book web site.
+Most of the time your data will not neatly separate into clusters, but partitioning it into groups of similar observations can still be useful. In this case our toolbox will be useful in comparing and contrasting different methods, understanding to what extend a cluster mean can describe the observations in the cluster, and also how the boundaries between clusters have been drawn. To explore this we will use survey data that examines the risk taking behavior of tourists, this is the `risk_MSA` data, see the Appendix for details.
 
 1. We first examine the data in a grand tour. Do you notice that each variable was measured on a discrete scale?
 2. Next we explore different solutions from hierarchical clustering of the data. For comparison we will keep the number of clusters fixed to 6 and we will perform the hierarchical clustering with different combinations of distance functions (Manhattan distance and Euclidean distance) and linkage (single, complete and Ward linkage). Which combinations make sense based on what we know about the method and the data?

diff --git a/13-intro-class.qmd b/13-intro-class.qmd
@@ -121,3 +121,25 @@ print(class1 + class2 + class3 + class4 + plot_layout(ncol=2))
 ```
 
 @fig-sup-example shows some 2D examples where the two classes are (a) linearly separable, (b) not completely separable but linearly different, (c) non-linearly separable and (d) not completely separable but with a non-linear difference. We can also see that in (a) only the horizontal variable would be important for the model because the two classes are completely separable in this direction. Although the pattern in (c) is separable classes, most models would have difficulty capturing the separation. It is for this reason that it is important to understand the boundary between classes produced by a fitted model. In each of b, c, d it is likely that some observations would be misclassified. Identifying these cases, and inspecting where they are in the data space is important for understanding the model's future performance. 
+
+## Exercises {-}
+
+1. For the penguins data, use the tour to decide if the species are separable, and if the boundaries between species is linear or non-linear.
+2. Using just the variables `se`, `maxt`, `mint`, `log_dist_road`, and "accident" or "lightning" causes, use the tour to decide whether the two classes are separable, and whether the boundary might be linear or non-linear.
+
+```{r eval=FALSE}
+#| echo: false
+b_sub <- bushfires |>
+  select(se, maxt, mint, log_dist_road, cause) |>
+  filter(cause %in% c("accident", "lightning")) |>
+  rename(ldr = log_dist_road) |>
+  mutate(cause = factor(cause))
+animate_xy(b_sub[,-5], col=b_sub$cause, rescale=TRUE)
+animate_xy(b_sub[,-5], guided_tour(lda_pp(b_sub$cause)), col=b_sub$cause, rescale=TRUE)
+```
+
+::: {.content-hidden}
+Q1 answer: Not separable, but boundary could be linear.
+
+Q2 answer: Gentoo and others are separable. Chinstrap and Adelie are not separable. All bounaries are linear.
+:::