Skip to content

Latest commit

History

History

anomaly-detection

Anomaly Detection Using Gaussian Distribution

Gaussian (Normal) Distribution

The normal (or Gaussian) distribution is a very common continuous probability distribution. Normal distributions are important in statistics and are often used in the natural and social sciences to represent real-valued random variables whose distributions are not known. A random variable with a Gaussian distribution is said to be normally distributed and is called a normal deviate.

Let's say:

x-in-R

If x is normally distributed then it may be displayed as follows.

Gaussian Distribution

mu - mean value,

sigma-2 - variance.

x-normal - "~" means that "x is distibuted as ..."

Then Gaussian distribution (probability that some x may be a part of distribution with certain mean and variance) is given by:

Gaussian Distribution

Estimating Parameters for a Gaussian

We may use the following formulas to estimate Gaussian parameters (mean and variation) for ith feature:

mu-i

sigma-i

i

m - number of training examples.

n - number of features.

Density Estimation

So we have a training set:

Training Set

x-in-R

We assume that each feature of the training set is normally distributed:

x-1

x-2

x-n

Then:

p-x

p-x-2

Anomaly Detection Algorithm

  1. Choose features x-i that might be indicative of anomalous examples (Training Set).
  2. Fit parameters params using formulas:

mu-i

sigma-i

  1. Given new example x, compute p(x):

p-x-2

Anomaly if anomaly

epsilon - probability threshold.

Algorithm Evaluation

The algorithm may be evaluated using F1 score.

The F1 score is the harmonic average of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0.

F1 Score

f1

Where:

precision

recall

tp - number of true positives.

fp - number of false positives.

fn - number of false negatives.

Files

  • demo.m - main file that you should run from Octave console in order to see the demo.
  • server_params.mat - training data set.
  • estimate_gaussian.m - this function estimates the parameters of a Gaussian distribution using the data in X.
  • multivariate_gaussian.m - function that computes the probability density function of the multivariate gaussian distribution.
  • select_threshold.m - function that finds the best threshold (epsilon) to use for selecting outliers.
  • visualize_fit.m - Function that visualizes the data set and its estimated distribution.

Demo visualizations

Demo visualization

References