Skip to content

Latest commit

 

History

History

linear-regression

Linear Regression

Linear regression is a linear model, e.g. a model that assumes a linear relationship between the input variables (x) and the single output variable (y). More specifically, that output variable (y) can be calculated from a linear combination of the input variables (x).

Linear Regression

On the image above there is an example of dependency between input variable x and output variable y. The red line in the above graph is referred to as the best fit straight line. Based on the given data points (training examples), we try to plot a line that models the points the best. In the real world scenario we normally have more than one input variable.

Features (variables)

Each training example consists of features (variables) that describe this example (i.e. number of rooms, the square of the apartment etc.)

Features

n - number of features

Rn+1 - vector of n+1 real numbers

Parameters

Parameters of the hypothesis we want our algorithm to learn in order to be able to do predictions (i.e. predict the price of the apartment).

Parameters

Hypothesis

The equation that gets features and parameters as an input and predicts the value as an output (i.e. predict the price of the apartment based on its size and number of rooms).

Hypothesis

For convenience of notation, define X0 = 1

Cost Function

Function that shows how accurate the predictions of the hypothesis are with current set of parameters.

Cost Function

xi - input (features) of ith training example

yi - output of ith training example

m - number of training examples

Batch Gradient Descent

Gradient descent is an iterative optimization algorithm for finding the minimum of a cost function described above. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or approximate gradient) of the function at the current point.

Picture below illustrates the steps we take going down of the hill to find local minimum.

Gradient Descent

The direction of the step is defined by derivative of the cost function in current point.

Gradient Descent

Once we decided what direction we need to go we need to decide what the size of the step we need to take.

Gradient Descent

We need to simultaneously update Theta for j = 0, 1, ..., n

Gradient Descent

Gradient Descent

alpha - the learning rate, the constant that defines the size of the gradient descent step

x-i-j - jth feature value of the ith training example

x-i - input (features) of ith training example

yi - output of ith training example

m - number of training examples

n - number of features

When we use term "batch" for gradient descent it means that each step of gradient descent uses all the training examples (as you might see from the formula above).

Feature Scaling

To make linear regression and gradient descent algorithm work correctly we need to make sure that features are on a similar scale.

Feature Scaling

For example "apartment size" feature (e.g. 120 m2) is much bigger than the "number of rooms" feature (e.g. 2).

In order to scale the features we need to do mean normalization

Mean Normalization

x-i-j - jth feature value of the ith training example

mu-j - average value of jth feature in training set

s-j - the range (max - min) of jth feature in training set.

Polynomial Regression

Polynomial regression is a form of regression analysis in which the relationship between the independent variable x and the dependent variable y is modelled as an nth degree polynomial in x.

Although polynomial regression fits a nonlinear model to the data, as a statistical estimation problem it is linear, in the sense that the hypothesis function is linear in the unknown parameters that are estimated from the data. For this reason, polynomial regression is considered to be a special case of multiple linear regression.

Polynomial Regression

Example of a cubic polynomial regression, which is a type of linear regression.

You may form polynomial regression by adding new polynomial features.

For example if the price of the apartment is in non-linear dependency of its size then you might add several new size-related features.

Polynomial Regression

Normal Equation

There is a closed-form solution to linear regression exists and it looks like the following:

Normal Equation

Using this formula does not require any feature scaling, and you will get an exact solution in one calculation: there is no “loop until convergence” like in gradient descent.

Regularization

Overfitting Problem

If we have too many features, the learned hypothesis may fit the training set very well:

overfitting

But it may fail to generalize to new examples (let's say predict prices on new example of detecting if new messages are spam).

overfitting

Solution to Overfitting

Here are couple of options that may be addressed:

  • Reduce the number of features
    • Manually select which features to keep
    • Model selection algorithm
  • Regularization
    • Keep all the features, but reduce magnitude/values of model parameters (thetas).
    • Works well when we have a lot of features, each of which contributes a bit to predicting y.

Regularization works by adding regularization parameter to the cost function:

Cost Function

Note that you should not regularize the parameter theta zero.

regularization parameter - regularization parameter

In this case the gradient descent formula will look like the following:

Gradient Descent

Files

Demo visualizations

Demo visualization

References