Linear Regression

Written by Juzer Shakir

Description

A Mathematical intuition and quick guide and understanding of how Linear Regression Algorithms works. Given links to other study materials in order to understand the concepts more concretly.

Notations

m 👉 Number of Training Examples.
x 👉 "input" variable / features.
y 👉 "ouput" variable / "target" variable.
n 👉 Number of feature variable (x)
(x, y) 👉 One training example.
x_i , y_i 👉 i^th training example.
x_{i_j} 👉 i^th training example of the j^th column / feature.

Definition

A linear equation that models a function such that if we give any x to it, it will predict a value y , where both x and y are input and output varaibles respectively. These are numerical and continous values.

It is the most simple and well known algorithm used in machine learning.

Flowchart

The above Flowchart represents that we choose our training set, feed it to an algorithm, it will learn the patterns and will output a function called Hypothesis function 'H(x)'. We then give any x value to that function and it will output an estimated y value for it.

For historical reasons, this function H(x) is called hypothesis function.

Univariate Linear Regression

Definition for Univariate Linear Regression

When you have one feature / variable x as an input to the function to predict y, we call this Univariate Linear Regression problem.

Formula for Univariate Linear Regression

H(x) = θ₀ + θ₁x

Other way of representing this formula as what we are familiar with:

H(x) = b + mx

Where :

b = θ₀ 👉 y intercept

m = θ₁ 👉 slope

x = x 👉 feature / input variable

Source

Help ✍🏼

Intuition behind linear equation.

Need to Practice?

Cost Function for Univariate Linear Regression

All that said, how do we figure out the best possible straight line to the data that we feed?

This is where Cost Function will help us:

The best fit line to our data will be where we have least distance between the predicted 'y' value and trained 'y' value.

Formula for Cost Function

Where :

h(x_i) 👉 hypothesis function

y_i 👉 actual values of y

1/m 👉 gives Mean of Squared Errors

1/2 👉 Mean is halved as a convenience for the computation of the Gradient Descent.

The above formula takes the sum of the distances between predicted values and actual values of training set, sqaure it, take the average and multiply it by 1/2.

This cost function is also called as Squared Error Function or Mean Squared Error.

🙋‍ Why do we take squares of the error's?
The MSE function is commonly used and is a reasonable choice and works well for most Regression problems.

Let's subsititute MSE function to function J :

Help ✍🏼

Intuition behind Cost Function.

Gradient Descent for Univariate Linear Regression

So now we have our hypothesis function and we have a way of measuring how well it fits into the data. Now we need to estimate the parameters in the hypothesis function. That's where Gradient Descent comes in.
Gradient Descent is used to minimize the cost function J, minimizing J is same as minimizing MSE to get best possible fit line to our data.

Formula for Gradient Descent

Where :

:= 👉 Is the Assignment Operator

α 👉 is Alpha, it's the number which is called learning rate. If its too high it may fail to converge and if too low then descending will be slow.

'θ_j' 👉 Taking Gradient Descent of a feature or a column of a dataset.

∂/(∂θ_j) J(θ₀,θ₁) 👉 Taking partial derivative of MSE cost function.

Additional Resources ✍🏼

Intuition behind Gradient Descent.

Partial Derivative.

Now Let's apply Gradient Descend to minmize our MSE function.
In order to apply Gradient Descent, we need to figure out the partial derivative term.
So let's solve partial derivative of cost function J.

Now let's plug these 2 values to our Gradient Descent:

Note : 🚩

Cost Function for Linear Regression is always going to be Convex or Bowl Shaped Function, so this function doesn't have any local minimum but one global minimum, thus always converging to global minimum.

The above hypothesis function has 2 parameters, θ₀ & θ₁, so Gradient Descent will run on each feature, hence here two times, one for feature and one for base (y-intercept), to get minimum value of j. So if we have n features, Gradient Descent will run on all n+1 features.

Multivariate Linear Regression

Linear Algebra ✍🏼

Intro to Vectors & Scalars.

Combined Vector Operations.

Intro to Matrices.

Representing linear systems with matrices.

Add & subtract matrices.

Multipling matrices.

Identity Matrix.

Properties of Matrix Multiplication.

Matrix Inverses.

Matrix Transpose.

Definition for Multivariate Linear Regression

Its same as Univariate Linear Regression, except it has more than one feature variable (x) to predict target variable (y).

Formula for Multivariate Linear Regression

Our hypothesis function for n = 4 :

Where :

θ₀ 👉 y intercept

And rest are features x to help predict y value.

Intuition:
In order to develop an intuition about this function, let's imagine that this function represents price of a house (y) based on the features given (x), then we can think of this function as:

θ₀ as the basic price of a house.

θ₁ as price/m².

x₁ as area of a house (m²).

θ₂ as price/floor.

x₂ as number of floors.

etc (You get the idea)

Let's set all the parameters:

θ₀, θ₁, θ₂, θ₃.......θ_n = θ

And Let's set all the features:

x₀, x₁, x₂, x₃.......x_n = x

Where :

θ 👉 will be n+1 dimensional vector because we have θ₀ which is not a feature.

x₀ 👉 is added just for convenience so that we can take matrix multiplication of θ as θ^T and x and we will set x₀ value to 1, so this doesn't change the values.

x 👉 will also be n+1 dimensional vector.

Cost Function for Multivariate Linear Regression

So,

J(θ₀, θ₁, θ₂, θ₃.......θ_n) = J(θ)

Where,

Gradient Descent for Multivariate Linear Regression

Appling Gradient Descend to minmize our MSE function after solving partial derivative of J(θ), we get :

For Example : If we had 2 features, this is how gradient descent would run on each parameter:

Feature Scaling and Mean Normalization

We can speed up Gradient Descent by having each of our input values in roughly the same range. This is because θ will descend quickly on small ranges and slowly on large ranges. If we have large ranges, it will oscilate inefficiently down to optimum (minimum) when the variables are very uneven.

The way to prevent this is to modify the ranges of our input variables so that they are all roughly the same. Ideally between :

-1 ≤ x_i ≤ 1

OR

-0.5 ≤ x_i ≤ 0.5

These aren't exact requirements, we are only trying to speed things up. The goal is to get all input variables into roughly one of these ranges.
This can make Gradient Descent run much faster and converge in a lot few iterations.

Two techniques to help with this are Feature Scaling and Mean Normalization.

Feature Scaling involves diving the input values by the range (i.e max value - min value) of the input variable, resulting in a new values.

Mean Normalization involves subtracting the average of a feature variable from the values of the feature, dividing by range of values or by standard deviation, resulting in a new values.

Where:

μ_j 👉 Average of a Feature variable j.

s_i 👉 Either Range or Standard Deviation of a Feature j.

Bias - Variance

High Bias

Consider a problem of predicting y. The figure below shows the result of fitting a hypothesis function θ₀ + θ₁x to a dataset.

We see that the data doesn't really lie on a straight line and so the fit is not very good.
This figure is an instance of Underfitting, in which the data clearly shows structure not captured by the model h(x).
Underfitting or high bias is when the form of our h(x) function maps poorly to the trend of the data. It is usually caused by a function that is too simple or uses too few features.

Just Right

If we add an extra feature x² and fit hypothesis function θ₀ + θ₁x + θ₂x², then we obtain a slightly better fit to the data.

High Variance

If we add more features, it would intuitively seem that it could perform much better, however there's also a danger in adding too amny features. The result of fitting 4th order polynomial where our h(x) is θ₀ + θ₁x + θ₂x² + θ₃x³ + θ₄x⁴

We see that even though the fitted curve passes through the data perfectly, we would not expect this to be a very good predictor of for example housing prices y for different areas x.

Overfitting or High Variance, is caused by h(x) function that fits the available data but does not generalize (unable to accurately predict) well to predict new data. It is usually caused by giving too many features or having a complicated h(x) function that creates lots of unnecessory curves and angles unrelated to the data.

Resolving High Variance

There are 2 main options to address the issue of Overfitting:

Regularization
- Keep all features, but reduce the magnitude of parameters θ_j.
- It works well when we have a lot of slightly useful features.
Reduce number of features
- Manually select which features to keep.
- Use model selection algorithm.

Cost Function for Regularization

If we have overfitting from our hypothesis function, for example like in figure 3, we can reduce the weight that some of the terms in our function carry by increasing their cost.
For example, let's say we wanted to make the following function more quadratic:
θ₀ + θ₁x + θ₂x² + θ₃x³ + θ₄x⁴
We'll want to eliminate the influence of θ₃x³ and θ₄x⁴, without actually getting rid of these features or changing the form of our hypothesis, we can instead modify our cost function.

We've added two extra terms at the end to inflate the cost of θ₃ and θ₄. Now, in order for the cost function to get close to 0, we will have to reduce the values θ₃ and θ₄ to near 0. This will in turn greatly reduce the values of θ₃x³ and θ₄x⁴ in our hypothesis function.

As a result we see that the new H(x) looks like a quadratic function but fits the data better compared to figure 3 due to extra small terms θ₃x³ and θ₄x⁴.

We can also regularize all of our parameters in a single summation as:

The λ, or lambda, is regularization parameter. It determines how much costs of our θ parameters are inflated.
Using the above cost function with the extra summation, we can smooth the output of our hypothesis function to reduce overfitting. If lambda is choosen to be too large, it may smooth our function too much and causing High bias or Underfitting.

Note:
We penalize the parameters from θ₁ .... θ_n , but we don't penalize θ₀. We treat this differently.

Gradient Descent for Regularization

We will modify our gradient descent function to separate out θ₀ from rest of the parameters because we do not want to penalize θ₀.

The term λ/m times θ_j performs our regularization.

With some manipulation our updated Gradient Descent for θ_j can also be represented as:

The first term in the equation will always be less than 1. Intuitively you can see it as reducing the value θ_j by some amount on every update. And the second term is exactly the same as it was in Gradient Descent without applying regularization.

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
Formulas		Formulas
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Formulas

Formulas

README.md

README.md

Repository files navigation

Linear Regression

Written by Juzer Shakir

Table of Contents

Description

Notations

Definition

Flowchart

Univariate Linear Regression

Definition for Univariate Linear Regression

Formula for Univariate Linear Regression

Cost Function for Univariate Linear Regression

Formula for Cost Function

Gradient Descent for Univariate Linear Regression

Formula for Gradient Descent

Multivariate Linear Regression

Definition for Multivariate Linear Regression

Formula for Multivariate Linear Regression

Cost Function for Multivariate Linear Regression

Gradient Descent for Multivariate Linear Regression

Feature Scaling and Mean Normalization

Bias - Variance

High Bias

Just Right

High Variance

Resolving High Variance

Cost Function for Regularization

Gradient Descent for Regularization

About

Releases

Packages

JuzerShakir/Linear_Regression

Folders and files

Latest commit

History

Formulas

Formulas

README.md

README.md

Repository files navigation

Linear Regression

Written by Juzer Shakir

Table of Contents

Description

Notations

Definition

Flowchart

Univariate Linear Regression

Definition for Univariate Linear Regression

Formula for Univariate Linear Regression

Cost Function for Univariate Linear Regression

Formula for Cost Function

Gradient Descent for Univariate Linear Regression

Formula for Gradient Descent

Multivariate Linear Regression

Definition for Multivariate Linear Regression

Formula for Multivariate Linear Regression

Cost Function for Multivariate Linear Regression

Gradient Descent for Multivariate Linear Regression

Feature Scaling and Mean Normalization

Bias - Variance

High Bias

Just Right

High Variance

Resolving High Variance

Cost Function for Regularization

Gradient Descent for Regularization

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages