Skip to content

Commit

Permalink
update year
Browse files Browse the repository at this point in the history
  • Loading branch information
zingale committed May 10, 2024
1 parent 451df2d commit f64e0bb
Show file tree
Hide file tree
Showing 23 changed files with 3,198 additions and 36 deletions.
15 changes: 15 additions & 0 deletions content/11-machine-learning/README
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
Note: on older machines, tensorflow generates an illegal instruction
and crashes python on import. The issue is the CPU instructions it
was compiled with. The solution seems to be to drop down to
tensorflow 1.5:

https://github.com/tensorflow/tensorflow/issues/17411

On my system, I need to make sure I got numpy from pip (instead of the
Fedora package manager).



clustering examples:

https://laxmikants.github.io/blog/neural-network-using-make-moons-dataset/
337 changes: 337 additions & 0 deletions content/11-machine-learning/gradient-descent.ipynb

Large diffs are not rendered by default.

18 changes: 18 additions & 0 deletions content/11-machine-learning/ideas.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
Packages to try:

-- keras
-- tensorflow
-- scikit-learn

Some nice tutorials:

-- https://github.com/zotroneneis/machine_learning_basics


https://elitedatascience.com/keras-tutorial-deep-learning-in-python


types of machine learning:

https://towardsdatascience.com/the-mostly-complete-chart-of-neural-networks-explained-3fb6f2367464

747 changes: 747 additions & 0 deletions content/11-machine-learning/keras-clustering.ipynb

Large diffs are not rendered by default.

1,044 changes: 1,044 additions & 0 deletions content/11-machine-learning/keras-mnist.ipynb

Large diffs are not rendered by default.

246 changes: 246 additions & 0 deletions content/11-machine-learning/machine-learning-basics.ipynb

Large diffs are not rendered by default.

83 changes: 83 additions & 0 deletions content/11-machine-learning/machine-learning-libraries.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
# Diving Deeper into Machine Learning

We've focused on neural networks, using labeled data that we
can use to learn the trends in our data. This is an example
of _supervised learning_.

Broadly speaking there are
3 main [approaches to machine learning](https://en.wikipedia.org/wiki/Machine_learning#Approaches)

* [Supervised learning](https://en.wikipedia.org/wiki/Supervised_learning)

This uses labeled pairs (input and output) to train the model
to learn how to predict the outputs from the inputs.

* [Unsupervised learning](https://en.wikipedia.org/wiki/Unsupervised_learning)

No labeled data is provided. Instead the machine learning
algorithm seeks to find the structure on its own. The goal
is to learn patterns and features to be able to produce
new data.

* [Reinforcement learning](https://en.wikipedia.org/wiki/Reinforcement_learning)

As with unsupervised learning, no labeled data is used,
but the model is "rewarded" when it does something right,
and the model tries to maximize rewards (think: self-driving
cars).

## Libraries

There are a number of popular libraries that implement machine learning algorithms.
Their features and performance vary quite a bit. An comparison of their
features is provided by Wikipedia: [Comparison of deep learning software](https://en.wikipedia.org/wiki/Comparison_of_deep_learning_software).

Some additional comparisons are provided here: https://ritza.co/articles/scikit-learn-vs-tensorflow-vs-pytorch-vs-keras/

* [TensorFlow](https://www.tensorflow.org/)

This is an open source machine learning library released by Google. It has support
for CPUs, GPUs, and [TPUs](https://en.wikipedia.org/wiki/Tensor_Processing_Unit),
and provides all the features you need to build deep learning workflows:
[TensorFlow feactures](https://en.wikipedia.org/wiki/TensorFlow#Features).

* [PyTorch](https://pytorch.org/)

This is a machine learning library build off of the Torch library, originally
developed by Facebook.

* [scikit-learn](https://scikit-learn.org/stable/)

This is a python library developed for machine learning. It has a lot of
sample datasets that provide a nice means to learn how different methods work.
It is designed to work with NumPy and SciPy.

General recommendations on the web seem to be to use Scikit-learn to get
started with machine learning and to explore ideas, but to switch to
one of the other packages for computationally-intensive work.

Scikit-learn provides some nice sample datasets:

https://scikit-learn.org/stable/datasets/toy_dataset.html

as well as generators for
datasets:

https://scikit-learn.org/stable/datasets/sample_generators.html

There are also tools that provide higher-level interfaces to these

* [Keras](https://keras.io/)

Keras is built on top of TensorFlow and provides a nice python interface that
hides a lot of the implementation details in TensorFlow.

## Keras / TensorFlow

We'll focus on Keras and TensorFlow.

There are a large number of examples provided by Keras:

https://keras.io/examples/

You should be able to install keras via pip or conda.
3 changes: 3 additions & 0 deletions content/11-machine-learning/machine-learning.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Machine Learning

We'll look at a popular library for machine learning.
Binary file added content/11-machine-learning/model.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
144 changes: 144 additions & 0 deletions content/11-machine-learning/neural-net-basics.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
# Artificial Neural Network Basics

## Neural networks

When we talk about machine learning, we often mean an [_artifical
neural
network_](https://en.wikipedia.org/wiki/Artificial_neural_network). A
neural network mimics the action of neurons in your brain. We'll
follow the notation from _Computational Methods for Physics_ by
Franklin.

Basic idea:

* Create a nonlinear fitting routine with free parameters
* Train the network on data with known inputs and outputs to set the parameters
* Use the trained network on new data to predict the outcome

We can think of a neural network as a map that takes a set of
$N_\mathrm{in}$ parameters and returns a set of $N_\mathrm{out}$
parameters, which we can express this as:

$${\bf z} = {\bf A} {\bf x}$$

where

$${\bf x} = (x_1, x_2, \ldots, x_{N_\mathrm{in}})$$

are the inputs,

$${\bf z} = (z_1, z_2, \ldots, z_{N_\mathrm{out}})$$

are the outputs, and
${\bf A}$ is an $N_\mathrm{out} \times N_\mathrm{in}$ matrix.

Our goal is to determine the matrix elements of ${\bf A}$.

## Nomenclature

We can visualize a neural network as:

![NN diagram](nn_fig.png)

* Neural networks are divided into _layers_

* There is always an _input layer_—it doesn't do any processing.

* There is always an _output layer_.

* Within a layer there are neurons or _nodes_.

* For input, there will be one node for each input variable. In this figure,
there are 3 nodes on the input layer.

* The output layer will have as many nodes are needed to convey the answer
we are seeking from the network. In this case, there are 2 nodes on the
output layer.

* Every node in the first layer connects to every node in the next layer

* The _weight_ associated with the _connection_ can vary—these are the matrix elements.

```{note}
This is called a _dense layer_. There are alternate types of layers
we can explore where the nodes are connected differently.
```

* In this example, the processing is done in layer 2 (the output)

* When you train the neural network, you are adjusting the weights connecting to the nodes

* Some connections might have zero weight

* This mimics nature—a single neuron can connect to several (or lots) of other neurons.

## Universal approximation theorem

A neural network can be designed to approximate any function, $f(x)$. For this to work, there must be a source of non-linearity in the network—this is a result of the [universal approximation theorem](https://en.wikipedia.org/wiki/Universal_approximation_theorem).

We use a nonlinear [_activation function_](https://en.wikipedia.org/wiki/Activation_function) that is applied in a layer. It has
the form:

$$g({\bf v}) = \left ( \begin{array}{c} g(v_0) \\ g(v_1) \\ \vdots \\ g(v_{n-1}) \end{array} \right )$$

```{note}
The activation function, $g({\bf v})$ works element-by-element on the vector ${\bf v}$.
```

Then our neural network has the form: ${\bf z} = g({\bf A x})$

We want to choose a function $g(\xi)$ that is differentiable. A common choice is the _sigmoid function_:

$$g(\xi) = \frac{1}{1 + e^{-\xi}}$$

```{figure} sigmoid.png
---
align: center
---
The sigmoid function
```

```{note}
There are [many choices for the activation function](https://en.wikipedia.org/wiki/Activation_function) which have
different properties. Often the choice of activation function will be empirical, by experimenting with the
performance of the network.
```

## Basic algorithm



* Training

* Loop over the $T$ pairs $({\bf x}^k, {\bf y}^k)$ for $k = 1, \ldots, T$

* Predict the output for ${\bf x}^k$ as:

$$z_i = g([{\bf A x}^k]_i) = g \left ( \sum_{j=1}^{N_\mathrm{in}} A_{ij} x^k_j \right )$$

* Constrain that ${\bf z} = {\bf y}^k$.

This is a minimization problem, where we are minimizing:

\begin{align*}
f(A_{ij}) &= \| g({\bf A x}^k) - {\bf y}^k \|^2 \\
&= \sum_{i=1}^{N_\mathrm{out}} \left [ g\left (\sum_{j=1}^{N_\mathrm{in}} A_{ij} x^k_j \right ) - y^k_i \right ]^2
\end{align*}

We call this function the _cost function_ or _loss function_.

```{note}
This is one possible choice for the cost function, $f(A_{ij})$, but [many others exist](https://en.wikipedia.org/wiki/Loss_function).
```

* Update the matrix ${\bf A}$ based on the training pair $({\bf x}^k, {\bf y^{k}})$.

* Using the network

With the trained ${\bf A}$, we can now use the network on data we haven't seen before, $\boldsymbol \chi$:

$$z_i = g([{\bf A {\boldsymbol \chi}}^k]_i) = g \left ( \sum_{j=1}^{N_\mathrm{in}} A_{ij} \chi^k_j \right )$$

There are a lot of details that we still need to figure out involving the training and minimization.
We'll start with minimization: a common minimization technique used with
neural networks is [_gradient descent_](https://en.wikipedia.org/wiki/Gradient_descent).
86 changes: 86 additions & 0 deletions content/11-machine-learning/neural-net-derivation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
# Deriving the Learning Correction

For gradient descent, we need to derive the update to the matrix
${\bf A}$ based on training on a set of our data, $({\bf x}^k, {\bf y}^k)$.

Let's start with our cost function:

$$f(A_{ij}) = \sum_{i=1}^{N_\mathrm{out}} (z_i - y_i^k)^2 = \sum_{i=1}^{N_\mathrm{out}}
\Biggl [ g\biggl (\underbrace{\sum_{j=1}^{N_\mathrm{in}} A_{ij} x^k_j}_{\equiv \alpha_i} \biggr ) - y^k_i \Biggr ]^2$$

where we'll refer to the product ${\boldsymbol \alpha} \equiv {\bf Ax}$ to help simplify notation.

We can compute the derivative with respect to a single matrix
element, $A_{pq}$ by applying the chain rule:

$$\frac{\partial f}{\partial A_{pq}} =
2 \sum_{i=1}^{N_\mathrm{out}} (z_i - y^k_i) \left . \frac{\partial g}{\partial \xi} \right |_{\xi=\alpha_i} \frac{\partial \alpha_i}{\partial A_{pq}}$$


with

$$\frac{\partial \alpha_i}{\partial A_{pq}} = \sum_{j=1}^{N_\mathrm{in}} \frac{\partial A_{ij}}{\partial A_{pq}} x^k_j = \sum_{j=1}^{N_\mathrm{in}} \delta_{ip} \delta_{jq} x^k_j = \delta_{ip} x^k_q$$

and for $g(\xi)$, we will assume the sigmoid function,so

$$\frac{\partial g}{\partial \xi}
= \frac{\partial}{\partial \xi} \frac{1}{1 + e^{-\xi}}
=- (1 + e^{-\xi})^{-2} (- e^{-\xi})
= g(\xi) \frac{e^{-\xi}}{1+ e^{-\xi}} = g(\xi) (1 - g(\xi))$$

which gives us:

\begin{align*}
\frac{\partial f}{\partial A_{pq}} &= 2 \sum_{i=1}^{N_\mathrm{out}}
(z_i - y^k_i) z_i (1 - z_i) \delta_{ip} x^k_q \\
&= 2 (z_p - y^k_p) z_p (1- z_p) x^k_q
\end{align*}

where we used the fact that the $\delta_{ip}$ means that only a single term contributes to the sum.

Note that:

* $e_p^k \equiv (z_p - y_p^k)$ is the error on the output layer,
and the correction is proportional to the error (as we would
expect).

* The $k$ superscripts here remind us that this is the result of
only a single pair of data from the training set.

Now ${\bf z}$ and ${\bf y}^k$ are all vectors of size $N_\mathrm{out} \times 1$ and ${\bf x}^k$ is a vector of size $N_\mathrm{in} \times 1$, so we can write this expression for the matrix as a whole as:

$$\frac{\partial f}{\partial {\bf A}} = 2 ({\bf z} - {\bf y}^k) \circ {\bf z} \circ (1 - {\bf z}) \cdot ({\bf x}^k)^\intercal$$

where the operator $\circ$ represents _element-by-element_ multiplication (the [Hadamard product](https://en.wikipedia.org/wiki/Hadamard_product_(matrices))).

## Performing the update

We could do the update like we just saw with our gradient descent
example: take a single data point, $({\bf x}^k, {\bf y}^k)$ and
do the full minimization, continually estimating the correction,
$\partial f/\partial {\bf A}$ and updating ${\bf A}$ until we
reach a minimum. The problem with this is that $({\bf x}^k, {\bf y}^k)$ is only one point in our training data, and there is no
guarantee that if we minimize completely with point $k$ that we will
also be a minimum with point $k+1$.

Instead we take multiple passes through the training data (called _epochs_) and apply only a single push in the direction that gradient
descent suggests, scaled by a _learning rate_ $\eta$.

The overall minimization appears as:

<div style="border: solid; padding: 10px; width: 80%; margin: 0 auto; background: #eeeeee">
* Loop over epochs

* Loop over the training data, $\{ ({\bf x}^0, {\bf y}^0), ({\bf x}^1, {\bf y}^1), \ldots \}$. We'll refer to the current training
pair as $({\bf x}^k, {\bf y}^k)$

* Propagate ${\bf x}^k$ through the network, getting the output
${\bf z} = g({\bf A x}^k)$

* Compute the error on the output layer, ${\bf e}^k = {\bf z} - {\bf y}^k$

* Update the matrix ${\bf A}$ according to:

$${\bf A} \leftarrow {\bf A} - 2 \,\eta\, {\bf e}^k \circ {\bf z} \circ (1 - {\bf z}) \cdot ({\bf x}^k)^\intercal$$
</div>

0 comments on commit f64e0bb

Please sign in to comment.