Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First part of book for review #49

Merged
merged 29 commits into from Mar 25, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
0c7faf6
Rephrasing fake tree nn exercise
uschiLaa Feb 26, 2024
9aeca85
Merge branch 'pdf' of https://github.com/dicook/mulgar_book into pdf
uschiLaa Feb 26, 2024
79dd3f1
Wrapping up the NN section
uschiLaa Feb 26, 2024
d4e822f
Data description for risk_MSA
uschiLaa Feb 28, 2024
8eb28de
full pdf version
dicook Feb 28, 2024
0dae551
table in mclust chapter fixed
dicook Feb 28, 2024
dedec26
exercises to intro to classification
dicook Mar 4, 2024
6482ae4
tidied up Appendix A
dicook Mar 4, 2024
c7d41bb
added tidymodels as background reading
dicook Mar 4, 2024
e849d88
Removing commented section in intro
uschiLaa Mar 5, 2024
e1b032a
Small fixes in 2 - notation
uschiLaa Mar 5, 2024
c9dc57b
wrapped .hidden in a hidden when pdf
dicook Mar 5, 2024
77c23c2
added caption to shadow puppets figure and a paragraph
dicook Mar 5, 2024
c966a44
shadow puppets, testing format
dicook Mar 5, 2024
5faaf71
may have solved Ursulas problem with PCA chapter
dicook Mar 5, 2024
b1c6e78
trying to resolve ursula problem with pca chapter
dicook Mar 6, 2024
c3e9842
Adding fa icon to figure caption for still shots
uschiLaa Mar 7, 2024
3f7b98d
Small fixes in pca section
uschiLaa Mar 7, 2024
94ec05f
intro to dimension reduction
dicook Mar 9, 2024
12fb4dc
updated gifs for nn chapter with training/test set
dicook Mar 11, 2024
13af336
Small fixes NLDR chapter
uschiLaa Mar 17, 2024
295569b
Small fixes in 6 and 7
uschiLaa Mar 17, 2024
2a39743
PCA chapter revised
dicook Mar 18, 2024
c657da1
removed extra code from spin-and-brush
dicook Mar 18, 2024
6e35850
done with NLDR chapter
dicook Mar 19, 2024
2ff2916
done with data chapter
dicook Mar 19, 2024
ec88e0d
project continued from NLDR into spin-and-brush
dicook Mar 19, 2024
31bbcc7
Small fixes in the appenix chapters
uschiLaa Mar 20, 2024
368bcb9
chapters 1-4, pages 1-56 for review
dicook Mar 25, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
53 changes: 20 additions & 33 deletions 1-intro.qmd
Expand Up @@ -5,9 +5,11 @@ High-dimensional data means that we have a large number of numeric features or v
\index{variable}\index{feature}
\index{projection}

![](images/shadow_puppets.png){width=450 fig-align="center" fig-env="figure*" fig-cap="Viewing high dimensions using low-dimensional displays is like playing shadow puppets, looking at the shadows to guess what the shape is." fig-alt="Three images, each with a hand or two hands, illustrating making shadows of a bird in flight, snail and dog."}
![Viewing high dimensions using low-dimensional displays is like playing shadow puppets, looking at the shadows to guess what the shape is.](images/shadow_puppets.png){#fig-shadow-puppets width=450 fig-alt="Three images, each with a hand or two hands, illustrating making shadows of a bird in flight, snail and dog."}


One approach to visualise high dimensional data and models is by using linear projections, as done in a tour. You can think of projections of high-dimensional data like shadows (@fig-shadow-puppets). Unlike shadow puppets, though the object stays fixed, and with multiple projections we can obtain a *view of the object from all sides*.


## Getting familiar with tours

Expand All @@ -24,7 +26,7 @@ s_p <- ggplot(simple_clusters, aes(x=x1, y=x2)) +
annotate("text", x=2.0, y=2.2, label="(0.707, 0.707)", angle=45) +
annotate("text", x=2.2, y=2.0, label="most clustered", angle=45) +
geom_abline(intercept=0, slope=-1) +
annotate("text", x=-1.6, y=1.8, label="(-0.707, 0.707)", angle=-45) +
annotate("text", x=-1.6, y=1.8, label="(0.707, -0.707)", angle=-45) +
annotate("text", x=-1.8, y=1.6, label="no clusters", angle=-45) +
geom_abline(intercept=0, slope=0) +
annotate("text", x=-1.6, y=0.15, label="(1, 0)") +
Expand Down Expand Up @@ -104,7 +106,7 @@ How a tour can be used to explore high-dimensional data illustrated using (a) 2D

```{r fig-explain-1D-pdf, eval=knitr::is_latex_output()}
#| echo: false
#| fig-cap: "How a tour can be used to explore high-dimensional data illustrated using (a) 2D data with two clusters and (b,c,d) 1D projections from a tour shown as a density plot. Imagine spinning a line around the centre of the data plot, with points projected orthogonally onto the line. With this data, when the line is at `x1=x2 (0.707, 0.707)` or `(-0.707, -0.707)` the clustering is the strongest. When it is at `x1=-x2 (0.707, -0.707)` there is no clustering."
#| fig-cap: "How a tour can be used to explore high-dimensional data illustrated using (a) 2D data with two clusters and (b,c,d) 1D projections from a tour shown as a density plot. Imagine spinning a line around the centre of the data plot, with points projected orthogonally onto the line. With this data, when the line is at `x1=x2 (0.707, 0.707)` or `(-0.707, -0.707)` the clustering is the strongest. When it is at `x1=-x2 (0.707, -0.707)` there is no clustering. {{< fa play-circle >}}"
#| fig-width: 8
#| fig-height: 8
#| out-width: 100%
Expand Down Expand Up @@ -228,7 +230,7 @@ How a tour can be used to explore high-dimensional data illustrated by showing a

```{r fig-explain-2D-pdf, eval=knitr::is_latex_output()}
#| echo: false
#| fig-cap: "How a tour can be used to explore high-dimensional data illustrated by showing a sequence of random 2D projections of 3D data (a). The data has a donut shape with the hole revealed in a single 2D projection (b). Data usually arrives with a given number of observations, and when we plot it like this using a scatterplot, it is like shadows of a transparent object."
#| fig-cap: "How a tour can be used to explore high-dimensional data illustrated by showing a sequence of random 2D projections of 3D data (a). The data has a donut shape with the hole revealed in a single 2D projection (b). Data usually arrives with a given number of observations, and when we plot it like this using a scatterplot, it is like shadows of a transparent object. {{< fa play-circle >}}"
#| fig-width: 8
#| fig-height: 8
#| out-width: 100%
Expand Down Expand Up @@ -530,7 +532,7 @@ Two 5D datasets shown as tours of 2D projections. Can you see clusters of points

![Outliers](images/outlier-intro.png){#fig-tour-clusters width=200}

Frames from 2D tours on two 5D datasets, with clusters of points in (a) and two outliers with a plane in (b). This figure is best viewed in the HTML version of the book.
Frames from 2D tours on two 5D datasets, with clusters of points in (a) and two outliers with a plane in (b). This figure is best viewed in the HTML version of the book. {{< fa play-circle >}}
:::

:::
Expand Down Expand Up @@ -674,13 +676,6 @@ render_gif(plane_outliers[,1:5],
The movement of points give further clues about the structure of the data in high-dimensions. In the data with clustering, often we can see a group of points moving differently from the others. Because there are three clusters, you should see three distinct movement patterns. It is similar with outliers, except these may be individual points moving alone, and different from all others. This can be seen in the static plot, one point (top left) has a movement pattern upwards whereas most of the other observations near it are moving down towards the right.
:::

<!--
![Movement pattern indicates clustering as seen in a grand tour.](gifs/trails-clusters.gif){#fig-clusters-trails-tour fig-alt="" width="300"}

![Movement pattern indicates an outlier as seen in a grand tour.](gifs/trails-outlier.gif){#fig-outlier-trails-tour fig-alt="" width="300"}

-->


This type of visualisation is useful for many activities in dealing with high-dimensional data, including:

Expand Down Expand Up @@ -711,13 +706,13 @@ With computer graphics, the capability of animating plots to show more than a si

The methods in this book primarily emerge from @As85's grand tour method. The algorithm provided the first smooth and continuous sequence of low dimensional projections, and guaranteed that all possible low dimensional projections were likely to be shown. The algorithm was refined in @BA86b (and documented in detail in @BCAH05) to make it *efficiently* show all possible projections. Since then there have been numerous varieties of tour algorithms developed to focus on specific tasks in exploring high dimensional data, and these are documented in @tours2022.

This book is an evolution from @CS07. One of the difficulties in working on interactive and dynamic graphics research has been the rapid change in technology. Programming languages have changed a little (FORTRAN to C to java to python) but graphics toolkits and display devices have changed a lot! The tour software used in this book evolved from XGobi, which was written in C and used the X Window System, which was then rewritten in GGobi using gtk. The video library has engaging videos of these software systems There have been several other short-lived implementations, including orca [@orca], written in java, and cranvas [@cranvas], written in R with a back-end provided by wrapper functions to qt libraries.
This book is an evolution from @CS07. One of the difficulties in working on interactive and dynamic graphics research has been the rapid change in technology. Programming languages have changed a little (FORTRAN to C to java to python) but graphics toolkits and display devices have changed a lot! The tour software used in this book evolved from XGobi, which was written in C and used the X Window System, which was then rewritten in GGobi using gtk. The video library has engaging videos of these software systems There have been several other short-lived implementations, including orca [@orca], written in java, and cranvas [@cranvas], written in R with a back-end provided by wrapper functions to `qt` libraries.

Although attempts were made with these ancestor systems to connect the data plots to a statistical analysis system, these were always limited. With the emergence of R, having graphics in the data analysis workflow has been much easier, albeit at the cost of the interactivity with graphics that matches the old systems. We are mostly using the R package, `tourr` [@tourr] for examples in this book. It provides the machinery for running a tour, and has the flexibility that it can be ported, modified, and used as a regular element of data analysis.

## Exercises {-}

1. Randomly generate data points that are uniformly distributed in a hyper-cube of 3, 5 and 10 dimensions, with 500 points in each sample, using the `cube.solid.random` function of the `geozoo` package. What differences do we expect to see? Now visualise each set in a grand tour and describe how they differ, and whether this matched your expectations?
1. Randomly generate data points that are uniformly distributed in a hyper-cube of 3, 5 and 10 dimensions, with 500 points in each sample, using the `cube.solid.random()` function of the `geozoo` package. What differences do we expect to see? Now visualise each set in a grand tour and describe how they differ, and whether this matched your expectations?
2. Use the `geozoo` package to generate samples from different shapes and use them to get a better understanding of how shapes appear in a grand tour. You can start with exploring the conic spiral in 3D, a torus in 4D and points along the wire frame of a cube in 5D.
3. For each of the challenge data sets, `c1`, ..., `c7` from the `mulgar` package, use the grand tour to view and try to identify structure (outliers, clusters, non-linear relationships).

Expand All @@ -733,31 +728,23 @@ cube3 <- cube.solid.random(3, 500)$points
cube5 <- cube.solid.random(5, 500)$points
cube10 <- cube.solid.random(5, 500)$points

animate(cube3)
animate(cube5)
animate(cube10)
animate_xy(cube3, axes="bottomleft")
animate_xy(cube5, axes="bottomleft")
animate_xy(cube10, axes="bottomleft")
```

::: {.content-hidden when-format="pdf"}
::: {.hidden}
Answer 1. Each of the projections has a boxy shape, which gets less distinct as the dimension increases.

As the dimension increases, the points tend to concentrate in the centre of the plot window, with a smattering of points in the edges.
:::
:::

```{r}
#| eval: false
#| echo: false
# Answer to Q3
library(tourr)
library(mvtnorm)

s1 <- diag(5)
s2 <- diag(5)
s2[3,4] <- 0.7
s2[4,3] <- 0.7
s3 <- s2
s3[1,2] <- 0.7
s3[2,1] <- 0.7

set.seed(1234)
d1 <- as.data.frame(rmvnorm(500, sigma = s1))
d2 <- as.data.frame(rmvnorm(500, sigma = s2))
d3 <- as.data.frame(rmvnorm(500, sigma = s3))

library(mulgar)
animate_xy(c1)
render_gif(c1,
Expand Down
Binary file added 1-intro_files/figure-html/fig-density-1.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
33 changes: 27 additions & 6 deletions 10-model-based.qmd
Expand Up @@ -2,13 +2,13 @@

\index{cluster analysis!model-based}

Model-based clustering @FR02 fits a multivariate normal mixture model to the data. It uses the EM algorithm to fit the parameters for the mean, variance--covariance of each population, and the mixing proportion. The variance-covariance matrix is re-parameterised using an eigen-decomposition
Model-based clustering @FR02 fits a multivariate normal mixture model to the data. It uses the EM algorithm to fit the parameters for the mean, variance-covariance of each population, and the mixing proportion. The variance-covariance matrix is re-parameterised using an eigen-decomposition

$$
\Sigma_k = \lambda_kD_kA_kD_k^\top, ~~~k=1, \dots, g ~~\mbox{(number of clusters)}
$$

\noindent resulting in several model choices, ranging from simple to complex, as shown in @tbl-covariances.
\noindent resulting in several model choices, ranging from simple to complex, as shown in `r ifelse(knitr::is_html_output(), '@tbl-covariances-html', '@tbl-covariances-pdf')`.

```{r echo=knitr::is_html_output()}
#| label: mc-libraries
Expand All @@ -24,15 +24,32 @@ library(colorspace)
library(tourr)
```

```{r}
#| label: tbl-covariances
::: {.content-visible when-format="html"}

```{r eval=knitr::is_html_output()}
#| label: tbl-covariances-html
#| tbl-cap: "Parameterizations of the covariance matrix."
#| echo: FALSE
#| message: FALSE
readr::read_csv('misc/mclust-covariances.csv') %>%
readr::read_csv('misc/mclust-covariances-html.csv') %>%
knitr::kable(align = c('c', 'c', 'c', 'c', 'c', 'c')) %>%
kableExtra::kable_styling(full_width = FALSE)
```
:::

::: {.content-visible when-format="pdf"}
```{r eval=knitr::is_latex_output()}
#| label: tbl-covariances-pdf
#| tbl-cap: "Parameterizations of the covariance matrix."
#| echo: FALSE
#| message: FALSE
readr::read_csv('misc/mclust-covariances-latex.csv') %>%
knitr::kable(align = c('c', 'c', 'c', 'c', 'c', 'c'),
format="latex", booktabs = T,
escape = FALSE) %>%
kableExtra::kable_styling(full_width = FALSE)
```
:::

\noindent Note the distribution descriptions "spherical" and "ellipsoidal". These are descriptions of the shape of the variance-covariance for a multivariate normal distribution. A standard multivariate normal distribution has a variance-covariance matrix with zeros in the off-diagonal elements, which corresponds to spherically shaped data. When the variances (diagonals) are different or the variables are correlated, then the shape of data from a multivariate normal is ellipsoidal.

Expand Down Expand Up @@ -60,6 +77,9 @@ ggplot(penguins_sub, aes(x=bl,
theme(aspect.ratio = 1)
```

To draw ellipses in any dimension, a reasonable procedure is to sample points uniformly on a sphere, and then transform this into a sphere using the inverse of the variance-covariance matrix. The `mulgar` function `mc_ellipse()` does this for each cluster in the fitted model.


```{r}
#| label: fig-penguins-bl-fl-mc
#| message: FALSE
Expand All @@ -68,6 +88,7 @@ ggplot(penguins_sub, aes(x=bl,
#| fig-height: 4
#| out-width: 100%
#| fig-cap: "Summary plots from model-based clustering: (a) BIC values for clusters 2-9 of top four models, (b) variance-covariance ellipses and cluster means (+) corresponding to the best model. The best model is three-cluster EVE, which has differently shaped variance-covariances albeit the same volume and orientation."
# Fit the model, plot BIC, construct and plot ellipses
penguins_BIC <- mclustBIC(penguins_sub[,c(1,3)])
ggmc <- ggmcbic(penguins_BIC, cl=2:9, top=4) +
scale_color_discrete_divergingx(palette = "Roma") +
Expand Down Expand Up @@ -96,7 +117,7 @@ ggell <- ggplot() +
ggmc + ggell + plot_layout(ncol=2)
```

@fig-penguins-bl-fl-mc summarises the results. All models agree that three clusters is the best. The different variance-covariance models for three clusters have similar BIC values with EVE (different shape, same volume and orientation) being slightly higher. These plots are made from the `mclust` package output using the `ggmcbic` and `mc_ellipse` functions fro the `mulgar` package.
@fig-penguins-bl-fl-mc summarises the results. All models agree that three clusters is the best. The different variance-covariance models for three clusters have similar BIC values with EVE (different shape, same volume and orientation) being slightly higher. These plots are made from the `mclust` package output using the `ggmcbic()` and `mc_ellipse()` functions from the `mulgar` package.

## Examining the model in high dimensions

Expand Down
37 changes: 36 additions & 1 deletion 12-summary-clust.qmd
Expand Up @@ -239,6 +239,41 @@ limn_tour_link(

![Highlighting the penguins where the methods disagree so we can see where these observations are located relative to the two clusters.](images/compare-clusters2.png){#fig-compare-clusters2}

Linking the confusion matrix with the tour can also be accomplished with `crosstalk` and `detourr`.

```{r}
#| eval: false
#| echo: true
library(crosstalk)
library(plotly)
library(viridis)
p_cl_shared <- SharedData$new(penguins_cl)

detour_plot <- detour(p_cl_shared, tour_aes(
projection = bl:bm,
colour = cl_w)) |>
tour_path(grand_tour(2),
max_bases=50, fps = 60) |>
show_scatter(alpha = 0.7, axes = FALSE,
width = "100%", height = "450px")

conf_mat <- plot_ly(p_cl_shared,
x = ~cl_mc_j,
y = ~cl_w_j,
color = ~cl_w,
colors = viridis_pal(option = "D")(3),
height = 450) |>
highlight(on = "plotly_selected",
off = "plotly_doubleclick") %>%
add_trace(type = "scatter",
mode = "markers")

bscols(
detour_plot, conf_mat,
widths = c(5, 6)
)
```

## Exercises {-}

1. Compare the results of the four cluster model-based clustering with that of the four cluster Wards linkage clustering of the penguins data.
Expand All @@ -248,7 +283,7 @@ limn_tour_link(

## Project {-}

Most of the time your data will not neatly separate into clusters, but partitioning it into groups of similar observations can still be useful. In this case our toolbox will be useful in comparing and contrasting different methods, understanding to what extend a cluster mean can describe the observations in the cluster, and also how the boundaries between clusters have been drawn. To explore this we will use survey data that examines the risk taking behavior of tourists. The data was collected in Australia in 2015 [@risk-survey] and includes six types of risks (recreational, health, career, financial, safety and social) with responses on a scale from 1 (never) to 5 (very often). The data is available in `risk_MSA.rds` from the book web site.
Most of the time your data will not neatly separate into clusters, but partitioning it into groups of similar observations can still be useful. In this case our toolbox will be useful in comparing and contrasting different methods, understanding to what extend a cluster mean can describe the observations in the cluster, and also how the boundaries between clusters have been drawn. To explore this we will use survey data that examines the risk taking behavior of tourists, this is the `risk_MSA` data, see the Appendix for details.

1. We first examine the data in a grand tour. Do you notice that each variable was measured on a discrete scale?
2. Next we explore different solutions from hierarchical clustering of the data. For comparison we will keep the number of clusters fixed to 6 and we will perform the hierarchical clustering with different combinations of distance functions (Manhattan distance and Euclidean distance) and linkage (single, complete and Ward linkage). Which combinations make sense based on what we know about the method and the data?
Expand Down
22 changes: 22 additions & 0 deletions 13-intro-class.qmd
Expand Up @@ -121,3 +121,25 @@ print(class1 + class2 + class3 + class4 + plot_layout(ncol=2))
```

@fig-sup-example shows some 2D examples where the two classes are (a) linearly separable, (b) not completely separable but linearly different, (c) non-linearly separable and (d) not completely separable but with a non-linear difference. We can also see that in (a) only the horizontal variable would be important for the model because the two classes are completely separable in this direction. Although the pattern in (c) is separable classes, most models would have difficulty capturing the separation. It is for this reason that it is important to understand the boundary between classes produced by a fitted model. In each of b, c, d it is likely that some observations would be misclassified. Identifying these cases, and inspecting where they are in the data space is important for understanding the model's future performance.

## Exercises {-}

1. For the penguins data, use the tour to decide if the species are separable, and if the boundaries between species is linear or non-linear.
2. Using just the variables `se`, `maxt`, `mint`, `log_dist_road`, and "accident" or "lightning" causes, use the tour to decide whether the two classes are separable, and whether the boundary might be linear or non-linear.

```{r eval=FALSE}
#| echo: false
b_sub <- bushfires |>
select(se, maxt, mint, log_dist_road, cause) |>
filter(cause %in% c("accident", "lightning")) |>
rename(ldr = log_dist_road) |>
mutate(cause = factor(cause))
animate_xy(b_sub[,-5], col=b_sub$cause, rescale=TRUE)
animate_xy(b_sub[,-5], guided_tour(lda_pp(b_sub$cause)), col=b_sub$cause, rescale=TRUE)
```

::: {.content-hidden}
Q1 answer: Not separable, but boundary could be linear.

Q2 answer: Gentoo and others are separable. Chinstrap and Adelie are not separable. All bounaries are linear.
:::