Skip to content

Latest commit

 

History

History
1043 lines (733 loc) · 41.7 KB

GLOSSARY.md

File metadata and controls

1043 lines (733 loc) · 41.7 KB

Glossary

This glossary tries to index the terms used throughout the course. This is a work in progress and contributions are welcomed.

Table of Contents

1x1 Convolutions

This defines an operation where the height and width of the kernel of a convolution operation are set to 1. This is useful because the depth dimension of the convolution operation can still be used to reduce the dimensionality (or increase it). So for instance, if we have batch number of 100 x 100 images w/ 3 color channels, we can define a 1x1 convolution which reduces the 3 color channels to just 1 channel of information. This is often applied before a much more expensive operation to reduce the number of overall parameters.

1-D Gaussian Kernel

The image below depicts a 1-D Gaussian Kernel:

imgs/1d-gaussian.png

In TensorFlow, the 1-D Gaussian can be computed by specifying the two parameters, mean and the standard deviation, which is commonly denoted by the name sigma.

mean = 0.0
sigma = 1.0
z = (tf.exp(tf.negative(tf.pow(x - mean, 2.0) /
                   (2.0 * tf.pow(sigma, 2.0)))) *
     (1.0 / (sigma * tf.sqrt(2.0 * 3.1415))))

See also the 2-D Gaussian Kernel.

2-D Gaussian Kernel

Like the 1-D Gaussian Kernel, the 2-D Gaussian Kernel has its peak in the middle and reduces in value exponentially as you move outside the center. When the 1-D Gaussian is matrix multiplied with the matrix transpose of itself, the 1-D Gaussian can be depicted in 2-dimensions as such:

imgs/2d-gaussian.png

Following from the definition of the 1-D Gaussian Kernel, the 2-D Gaussian Kernel can be computed in TensorFlow as such:

# Let's store the number of values in our Gaussian curve.
ksize = z.get_shape().as_list()[0]

# Let's multiply the two to get a 2d gaussian
z_2d = tf.matmul(tf.reshape(z, [ksize, 1]), tf.reshape(z, [1, ksize]))

Activation Function

The activation function, also known as the non-linearity, describes the non-linear operation in a Neural Network. Neural Networks gain the power to describe very complex functions by performing series of linear + nonlinear operations. These two series of operations, linear followed by a nonlinear operation, are typically grouped together in a single layer, and a neural network is composed of many layers. Typical activation functions include the sigmoid, TanH, or ReLu, as shown below:

imgs/activation.png

This graph depicts three activation functions. Any value on the x, horizontal axis, is transformed "nonlinearly" as the value on the y, vertical axis.

Accuracy

In classification tasks, the accuracy describes how well a network does at predicting the correct class.

In TensorFlow, we might calculate it like so, assuming we have the true output of the network in Y, and a predicted output in Y_pred:

predicted_y = tf.argmax(Y_pred, 1)
actual_y = tf.argmax(Y, 1)

# We can then measure the accuracy by seeing whenever these are equal.
correct_prediction = tf.equal(predicted_y, actual_y)
accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))

Adversarial Network

See Generative Adversarial Network.

Adversarial Training

This describes the process of training a network like the Adversarial Network. It is usually composed of two networks whose loss functions are constructed such that they try to fool one another. It is related to game theory in that there should be an equilibrium that allows both models to get stronger together. If that equilibrium is broken, meaning one of the two networks is stronger than the other, then it is difficult to continue training either of them towards anything useful.

ANN

Abbreviation of Artificial Neural Network.

Artificial Intelligence

Artificial Neural Network

Autoencoders

An autoencoder describes a network which encodes its input to some latent encoding layer of smaller dimensions, and then decodes this latent layer back to the original input space dimensions. The purpose of such a network is usually to compress the information in a large dataset such that the inner most, or the layer just following the encoder, is capable of retaining as much of the information necessary to reconstitute the original dataset. For instance, an image of 256 x 256 x 3 dimensions may be encoded to merely 2 values describing any image's latent encoding. The decoder is then capable of taking these 2 values and creating an image resembling the original image, depending on how well the network is trained/performs.

Back-prop

Abbreviation of Back-propagation.

Back-propagation

This describes the process of the backwards propagation of the training signal, or error, from a neural network, to each of the gradients in a network using the chain rule of calculus. This process is used with an optimization technique such as Gradient Descent.

Back-propagation Through Time

Batch Dimension

The "batch" dimension is often the first, but not necessarily the first, dimension of a Tensor. For example, a 10 x 256 x 256 x 3 dimension Tensor has 10 images of 256 x 256 x 3 dimensions. The batch dimensions indexes all the observations in a "mini-batch", or a small subset of examples from a larger dataset. This is used during Mini Batch Gradient Descent to train a network on the entire contents of a larger dataset.

Batch Normalization

Batch Normalization describes a technique for regularization which effectively smooths the gradient updates during back-propagation. It is suggested by the authors of the technique that it should be applied just before the activation function of a layer.

Batches

Batches describe the individual mini-batches in mini batch gradient descent used during training.

Bias

The bias describes a linear shift in weights.

Blur

Blurring is a technique which effectively smooths a signal, reducing "high-frequencies", or sharp discontinuities in a signal. It is often used a technique for regularization, for instance during Deep Dream or Style Net, on the overall activations of a gradient or the final result.

Celeb Dataset

Celeb Dataset describes a dataset of over 200,000 images of celebrity faces: http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html - a popular choice of dataset for image networks as it contains a version of the dataset which has been "frontally aligned", meaning a computer vision technique was used to find the faces in various photos and align and warp them so that it looks like the faces are looking straight into the camera. This effectively reduces the overall invariances in the dataset, making it a simpler dataset to learn.

Char-RNN

Char-RNN [2] implements a Character Language Model as described in [1] capable of predicting one characters in sequence, one at a time.

[1]. Graves, A. (2013). Generating sequences with recurrent neural networks. arXiv Preprint arXiv:1308.0850, 1–43. Retrieved from http://arxiv.org/abs/1308.0850 [2]. https://github.com/karpathy/char-rnn

Character Language Model

Described in [1], the basic idea is to take one character at a time and try to predict the next character in sequence. Given enough sequences, the model is capable of generating entirely new sequences all on its own. See Char-RNN for an example.

[1]. Graves, A. (2013). Generating sequences with recurrent neural networks. arXiv Preprint arXiv:1308.0850, 1–43. Retrieved from http://arxiv.org/abs/1308.0850

Checkpoint

In TensorFlow, a checkpoint describes a model and its weights during the process of training. These are often created every so many iterations or epochs during training.

Classification

Classification Network

Clip

Complex Cell

Computational Graph

Describes the overall set of operations involved in computing a neural network, including the core operations of the neural network, its loss and optimization functions, gradients, operations involved in saving/restoring and monitoring weights. In TensorFlow, a graph can be created like so:

g = tf.Graph()

However, a default graph is automatically registered and can be obtained by saying:

g = tf.get_default_graph()

And all of its operations can be listed like so:

[print(op.name) for op in g.get_operations()];

Computer Vision

Conditional Probability

Content Features

Content Loss

Context Managers

Convolution

A very common operation in Deep Learning is convolution. Think of it a way of filtering information. For instance, with a Gaussian kernel, convolution acts in a way that allows the Gaussian kernel to be the lens through which we'll see our data. What it does is at every location we tell it to filter, it will average the image values around it based on what the kernel's values are. The Gaussian's kernel is basically saying, take a lot the center, a then decreasingly less as you go farther away from the center. The effect of convolving the image with this type of kernel is that the entire image will be blurred. If you would like an interactive exploration of convolution, this website is great:

http://setosa.io/ev/image-kernels/

Convolutional Autoencoder

Convolutional Networks

A neural network which employs convolution.

Convolve

The act/operation of convolution. I.e. convolution is performed by convolving an image by a kernel.

Covariance

Covariance Matrix

Cost

Cost measures the overall performance of a network and is used during optimization to train the parameters of a network. It can sometimes be used interchangeably with loss. In this course, I've chosen to use cost as the final loss measure as a result of average across batches, while loss is reserved for single observation measures of loss, such as l2, l1, or cross-entropy.

Cross Entropy

Cross entropy is an information theoretic which measures the distance (term is used loosely here) between two vectors when they are probabilities. In TensorFlow, it can be computed as:

cross_entropy = -tf.reduce_sum(Y * tf.log(Y_pred + 1e-12))

assuming that Y is the "true" distribution, and Y_pred is a predicted distribution that is the output from a Neural Network. This distribution should be probabilistic, meaning its sum adds to 1, and there are no negative values. This can be achieved with a softmax layer.

Cross Validation

Cross validation is a common technique for validation.

Dataset

Datasets describe the data used for training, validating, and testing a machine learning model. There are a ton of datasets out there that current machine learning researchers use. For instance, http://deeplearning.net/datasets/ includes MNIST, CalTech, CelebNet, LFW, CIFAR, MS Coco, Illustration2Vec, and there are ton more. And these are primarily image based. But if you are interested in finding more, just do a quick search or drop a quick message on the forums if you're looking for something in particular.

MNIST CalTech CelebNet ImageNet: http://www.image-net.org/ LFW CIFAR10 CIFAR100 MS Coco: http://mscoco.org/home/ WLFDB: http://wlfdb.stevenhoi.com/ Flickr 8k: http://nlp.cs.illinois.edu/HockenmaierGroup/Framing_Image_Description/KCCA.html Flickr 30k

Dataset Augmentation

DCGAN

Abbreviation of Deep Convolutional Generative Adversarial Network.

Decoder

Deep Convolutional Networks

Deep Convolutional Generative Adversarial Network

Deep Dream

Deep Dreaming

Deep Learning vs. Machine Learning

Deep Learning is a type of Machine Learning algorithm that uses Neural Networks to learn. The type of learning is "Deep" because it is composed of many layers of Neural Networks.

Denoising Autoencoder

Deprocessing

Deviation

Discriminator

Distributed Representation

Dot Product

DRAW

Dropout

Early Stopping

Embedding

An embedding typically describes a transformation of input data prior to further learning. For instance, with language models, individual letters may be transformed to a one-hot encoding where each letter is represented by a single feature and a binary value of 0 or 1 denoting which letter it is.

Encoder

Epoch

Equilibrium

Error

Example

Feed-forward Neural Network

Filter

Filter is often used in the field of Signal Processing to describe a similar idea as the convolution kernel.

Fine Tuning

Fine tuning describes the process of loading an existing trained model and continuing the training of this model. Often this continuation can lead to improvements in the model's performance when used alongside model improvements, training modifications, or dataset augmentation. The continuation of the training may be on an entirely different dataset in which case it is also related to Transfer Learning.

Forward Propagation

Forward propagation, or forward prop, or fprop, describes the process of computing all nodes from the input to the outputs of a computational graph. For instance, in an object recognition neural network, the forward prop describes all operations from all layers connected to the an input of an image all the way to the final output layer describing which object it is likely to be.

Fully Connected

Fully connected, sometime denoted as affine or linear layers, are layers which perform a matrix multiplication of an input vector. Mathematically, a row vector, x is right multiplied by a matrix, W.

Gabor

imgs/gabor.png

GAN

Abbreviation of Generative Adversarial Networks.

Gaussian

Gaussian Kernel

Generalized Matrix Multiplication

Generative Adversarial Networks

This network described by [1] is composed of two networks, a Generator and a Discriminator. Together, they are known as the Generative Adversarial Network. The basic idea is the generator is trying to create things which look like the training data. So for images, more images that look like the training data. The discriminator has to guess whether what its given is a real training example. Or whether its the output of the generator. By training one after another, you ensure neither are ever too strong, but both grow stronger together. The discriminator is also learning a distance function! This is pretty cool because we no longer need to measure pixel-based distance, but we learn the distance function entirely!

The Generative Adversarial Network, or GAN, for short, are in a way, very similar to autoencoders. Or at least the implementation of it is. The discriminator is a lot like the encoder part of the network, except it reduces the input down to a single value, yes or no, 0 or 1, denoting yes its a true training example, or no, it's a generated one.

And the generator network is exactly like the decoder of the autoencoder. Except, there is nothing feeding into this inner layer. It is just on its own. From whatever vector of hidden values it starts off with, it will generate a new example meant to look just like the training data. One pitfall of this model is there is no explicit encoding of an input. Meaning, you can't take an input and find what would possibly generate it. However, there are recent extensions to this model which make it more like the autoencoder framework, allowing it to do this, such as the VAEGAN model.

imgs/gan-1.png

[1]. Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., … Bengio, Y. (2014). Generative Adversarial Networks, 1–9. Retrieved from http://arxiv.org/abs/1406.2661

Generator

One component of the Generative Adversarial Network, built using a decoder from a latent feature layer to a Tensor of B x H x W x C images.

Gradient

import tensorflow as tf

# Let's create a simple line with a slope of 5.  The slope is also "gradient" of the line.
# The delta y / delta x = slope = gradient for this simple linear equation.
x = tf.Variable(0.0)
y = x * 5

# This is saying, give us an operation which defines the gradient of y with respect to x.
g = tf.gradients(y, x)

sess = tf.InteractiveSession()
sess.run(tf.global_variables_initializer())
g[0].eval()
# prints 5

Gradient Clipping

Gradient Descent

imgs/gradient-descent.png

Graph Definition

Graphs

GRU

Guided Hallucinations

Hidden Layer

Histogram Equalization

Histograms

Hyperparameters

Image Inpainting

Image Labels

Inception Module

Inception Network

Inference

Information Theory

A field of study popularized in the 1940s by Claude Shannon, Norbert Weiner, Alan Turing, et al.

Invariances

We usually describe the factors which represent something "invariances". That just means we are trying not to vary based on some factor. We are invariant to it. For instance, an object could appear to one side of an image, or another. We call that translation invariance. Or it could be from one angle or another. That's called rotation invariance. Or it could be closer to the camera, or farther. and That would be scale invariance. There are plenty of other types of invariances, such as perspective or brightness or exposure in the case of photographic images. Many researchers/scientists/philosophers will have other definitions of this term.

Kernel

LAPGAN

Laplacian Pyramid

Latent Encoding

Latent Feature Arithmetic

Latent-Space

Layer

A layer is a convenience for grouping together a common set of operations. For instance, a fully-connected layer is usually comprised of a matrix multiplication and a bias addition. A convolution layer is comprised of a convolution with a kernel and a bias addition.

Learning From Data

Learning Rate

imgs/learning-rate.png

The learning rate describes how far along the gradient we should move our parameters.

Linear Regression

Loading a Pretrained Network

Local Minima/Optima

Long Short Term Memory

Loss

The loss helps to define the training objective of a neural network (also used throughout machine learning literature such as in energy-based optimization). The loss tries to assess the performance of a network, for instance by determining how close a prediction is to a known prediction. Typically, the objective is to minimize the loss, though an objective could easily also be thought of as the maximization of the loss.

Typical losses for unsupervised learning are l2 or l1 losses. The l2 loss is defined by the 2nd norm of activations. Simply, it is the square of the values. The l1 loss is similarly defined by the 1st norm, or the absolute value of activations. The cross-entropy loss is typically used in classification tasks. There are numerous other losses however, such as the hinge loss, log loss, ranking losses, or losses based on adversarial processes, to name a few. The loss is then generally summed across all possible features, and then averaged across all observations in a mini-batch to produce a single cost to be optimized.

The final cost is then used to optimize the parameters in a neural network using an optimization algorithm such as gradient descent and backpropagation.

LSTM

Abbreviation for Long Short Term Memory.

Machine Learning

Manifold

Matrix

Matrix Inverse

Matrix Multiplication

Max Pooling

Mean

Mini Batch

Mini Batch Gradient Descent

MNIST

Models

Network

Network Labels

Neural Network

Nonlinearities

Nonlinearities are another way to describe activation functions in a neural network. They are called nonlinearities since its outputs are not linear transformations of an input, i.e. not directly proportional to its input. Visually, its function is not a straight line but one with curves.

Norm

Normalization

Objective

One-Hot Encoding

Operations

Optimization

Optimizers

Overfitting

Overfitting describes what happens when a model tends to model noise more than the underlying cause of the data. This can easily happen in neural networks as there are many parameters. A common technique for ensuring this does not happen is to use regularization.

Preprocess

Preprocessing

Pretrained Networks

Priming

Probabilistic Sampling

Protobuf

Rectified Linear Unit

A common type of Activation Function which performs the nonlinear operation in TensorFlow as:

tf.maximum(0, x)

It is linear except for a discontinuity at 0 and can effectively learn nonlinear patterns with less computation than a sigmoid or tanh function requires. It can also lead to sparse activations, meaning not all the weights in a network are active.

There are also many extensions to ReLus, such as Leaky ReLus, Parametric ReLus, and Noisy ReLus.

Recurrent Neural Networks

Regression

Regularization

Regularization aids in cases of overfitting, or when parameters are tuned in a way that describes noise in the data rather than an underlying cause. This can easily happen in neural networks as there are many parameters. Some common techniques for regularizing networks include applying l2-norm penalties to weights, dropout, batch normalization, and increased mini-batch sizes.

Reinforcement Learning

ReLu

Abbreviation of Rectified Linear Unit.

RNN

Abbreviation of Recurrent Neural Network.

RL

Abbreviation of Reinforcement Learning.

Saturation

Describes what many non-linear activation functions do, by ensuring values are mostly a certain value. In the case of sigmoid, input values are saturated to 0 or 1, whereas in tanh, the values are saturated at -1 or 1. Saturation can also describe what happens to "dead" neurons, whose values are mostly all the same value, either all 0s, Infinity, or NaNs. These neurons often die or become saturated as a result of very large gradients, poor initialization, or large learning rates.

Scalar

A scalar is simply a single value, e.g. -2, 0, 1, 1.4, 500.

Sessions

In order to actually compute anything in TensorFlow, we use a tf.Session to manage and evaluating the computational graph, tf.Graph.

Sigmoid

The sigmoid function is a common activation function: y = 1 / (1 + exp(-x)). It has the range of [0, 1] and is a s-shaped function. In TensorFlow, it can be used by calling tf.nn.sigmoid.

Simple Cell

Softmax

\begin{equation} softmax(x_i) = \frac{e^{x_i}}{\sum_{j=1}^{k}{e^{x_j}}} \end{equation}

The softmax scales a vector exponentially and ensures that its sum adds to 1. This has the effect of rescaling any vector such that it can be treated as a probability distribution over possible classes. It should be used with a loss that allows you to measure the loss of a probability distribution such as cross entropy. In Tensorflow, there are at least 2 ways to use the softmax layer.

  1. Directly with tf.nn.softmax, on the outputs of a sigmoid transformation (ensuring values are between 0-1).

  2. While computing the cross entropy given unscaled outputs (i.e. no non-linearity such as a sigmoid is computed) using tf.nn.softmax_cross_entropy_with_logits.

Softmax Layer

A transformation of a batch number of features using a Softmax transformation.

Sparse

Standard Deviation

Stochastic

Stochastic Mini Batch Gradient Descent

Style Features

Style Loss

Style Net

Supervised Learning

In probabilistic terms, a supervised learning algorithm tries to optimize the joint probability p(x, y) or the conditional probability p(y|x). For example, it may try to optimize the prediction of an image label given the pixels of the image.

TanH

Temperature

Tensor

A Tensor describes a geometric object which is an N-dimensional array. For instance, a 1-dimensional Tensor is also known as a Vector. A 2-dimensional Tensor is also known as a Matrix. Thinking of Tensors as geometric objects affords us to think about linear relationships between Tensors, such as the dot product or the cross product. In TensorFlow, Tensors are described by a name which states what operation they are the result of, their shape, e.g. (100, 100), and their dtype, such as tf.float32 or tf.int32.

Tensor Shapes

A Tensor's shape describes the number of elements in each dimension. It is similar to a numpy array's shape and can be accessed as x.get_shape() or x.get_shape().as_list().

Tensorboard

Tensorboard is a web application provided by Tensorflow for monitoring the training of a computational graph. It requires you to write Summary objects such as tf.scalar_summary or tf.histogram_summary, and to create a tf.train.SummaryWriter for writing summaries to a given destination, e.g. path/to/log-directory. It can then be started using the command line like so: tensorboard --logdir=path/to/log-directory and then accessed using the webbrowser at the default port of 6006, e.g. http://localhost:6006.

Testing

Total Variation Loss

Training

Training Error

Training Parameters

Training vs. Testing

Transpose

Unsupervised Learning

In probabilistic terms, unsupervised learning tries to model p(x), the probability of observing some data x.

Unsupervised vs. Supervised Learning

Machine learning research in deep networks generally performs one of two types of learning. You either have a lot of data and you want the computer to reason about it, maybe to encode the data using less data, and just explore what patterns there might be. That's useful for clustering data, reducing the dimensionality of the data, or even for generating new data. That's generally known as unsupervised learning. In the supervised case, you actually know what you want out of your data. You have something like a label or a class that is paired with every single piece of data. There are other types of learning, such as reinforcement learning, though these are not discussed in this course.

VAEGAN

Validation

Validation describes a method of measuring the performance of an algorithm. There are many kinds of validation, for instance k-fold validation or crossfold validation. In Deep Learning, there is often enough data to perform a simpler form of validation which holds out portions of a dataset for training, validation, and testing. While a model is being trained, only the portion of the dataset designated for training, e.g. 80% of the entire dataset, is used. While training the model in iterations, another portion of the dataset is used to monitor the progress of training, e.g. 10%. Finally, once the model is finished training, the last portion of the dataset can be used to assess a final test performance. Ideally, the process is repeated over a number of k-folds, e.g. k=2 or k=5, such that every k-partitions of a dataset is used as a test. However, Deep Learning datasets are often large enough to assume that a test partition has enough variance, and in practice, it is also often far too painful to consider the idea of training a model in k-folds.

Validation Error

Validation error describes the error of a validation set during training. Ideally, this error stays close to the training error, and both drop over more and more iterations. If this is not the case, for instance if the training error continues to drop but not the validation error, it is possible the model is overfitting. See validation for more details.

Vanishing Gradient

Describes the problem with deep networks whose gradients "vanish" or become closer and closer to 0, eventually becoming 0. This happens as the process of backpropagation requires gradients to be chained together and multiplied together. If a gradient is very small, then the gradients can easily be multiplied in a way that the gradient gets smaller and smaller as it is propagated through each layer, eventually becoming 0. The reverse to this is the exploding gradient, which is when the gradient is above 1, thus quickly exploding to infinity in a similar manner.

Some solutions to this problem have been addressed for recurrent neural networks such as using gating mechanisms (e.g., see LSTM or GRU). Other solutions include making the network shallower or using batch normalization.

Variable

Variance

Variational Auto-Encoding Generative Adversarial Network

Variational Autoencoders

Variational Layer

Vector

VGG Network

Xavier Initialization

Describes a weight initialization procedure described by [1] which sets the weights of a M x N matrix to uniformly sampled values between the range of [-sqrt(6) / sqrt(M + N), sqrt(6) / sqrt(M + N)].

[1]. Glorot and Bengio 2010


Thanks to Golan Levin for suggesting the idea.