Linear regression
analysis is used to predict the value of a dependent variable based on the value of another independent variable.
- The market price of a house vs. the square footage of a house. Can we predict how much a house will sell for, given its size?
- The tax rate of a country vs. its GDP. Can we predict taxation based on a country’s GDP?
- The amount of chips left in the bag vs. number of chips taken. Can we predict how much longer this bag of chips will last, given how much people at this party have been eating?
m
is the slope
b
is the intercept
y
is a given point on the y-axis, and it corresponds to a given x
on the x-axis
N
is the number of points we have in our dataset
m
is the current gradient guess
b
is the current intercept guess
N
is the number of points you have in your dataset.
m
is the current gradient guess.
b
is the current intercept guess.
Multiple Linear Regression
uses two or more independent variables to predict the values of the dependent variable.
- Predicting the housing prices with many factors.
- Medical research to analyze the relationship between a dependent variable and various independent variables
Here, m1
, m2
, m3
, … mn refer to the coefficients, and b
refers to the intercept that you want to find.
Training set
: the data used to fit the modelTest set
: the data partitioned away at the very start of the experiment (to provide an unbiased evaluation of the model)
One of the technique can evaluate the accuracy of our multiple linear regression model.The difference between the actual value y, and the predicted value ŷ is the residual e
. The equation is:
e = y - ŷ
Logistic regression
is a supervised machine learning algorithm that predicts the probability, ranging from 0 to 1, of a datapoint belonging to a specific category, or class.
- Disease identification — Is a tumor malignant?
- Real or spam email?
- Customer conversion — Will a customer arriving on a sign-up page enroll in a service?
Once we have this probability, we need to make a decision about what class a datapoint belongs to. This is where the classification threshold
comes in!
K-Nearest Neighbors (KNN)
is a classification algorithm. The central idea is that data points with similar attributes tend to fall into similar categories.
- Customer Segmentation: Group customers with similar characteristics together.
- Anomaly Detection: KNN can be used to detect anomalies or outliers in a dataset.
- Breast Cancer Classifier
Euclidean Distance
was used to find the distance between each points and find the nearest point.
- Normalize the data
- Find the k nearest neighbors
- Classify the new point based on those neighbors
Decision trees
are machine learning models that try to find patterns in the features of data points.
About the example below, the red points represent students that didn’t get an A on a test and the green points represent students that did get an A on a test.
- Credit Risk Assessment: Decision trees can be used to assess the credit risk of individuals or businesses.
- Disease Diagnosis: In the field of healthcare, decision trees can aid in diagnosing diseases.
- Product Recommendation: Decision trees can be utilized in recommendation systems.
Gini impurity
is a measure of the impurity or disorder in a set of data used in the context of decision tree algorithms. It quantifies how well a randomly selected element from a set would be classified incorrectly if it were randomly labeled according to the distribution of labels in the set.
where p(i)
represents the probability of an element.
Naive Bayes classifier
is a supervised machine learning algorithm that leverages Bayes’ Theorem to make predictions and classifications.
Bayes’ Theorem:
This equation is finding the probability of A
given B
- Spam Detection: Naive Bayes can be used to classify emails as either spam or non-spam.
- Sentiment Analysis: Naive Bayes can be employed to determine the sentiment of a given text, such as a product review or a social media post.
- Document Categorization: Naive Bayes can be used to automatically categorize documents into predefined categories.
Support Vector Machine
makes classifications by defining a decision boundary and then seeing what side of the boundary an unclassified point falls on. Decision boundaries are easiest to wrap your head around when the data has two features.
Binary Classification:
Multi-Classification:
- Text Classification: SVMs are commonly used for text classification tasks, such as sentiment analysis, spam detection, or topic categorization.
- Image Classification: SVMs can be used for image classification tasks, such as recognizing objects or distinguishing between different types of images.
- Handwritten Digit Recognition: SVMs have been successfully applied to recognize handwritten digits.
Support vectors
are the points in the training set closest to the decision boundary. Margin
is the distance between a support vector and the decision boundary.
Random forest
is an ensemble machine learning technique. A random forest
contains many decision trees that all work together to classify new points.
- Feature Importance: Random Forest can provide insights into the importance of different features in a dataset.
- Missing Value Imputation: Random Forest can be used for imputing missing values in a dataset.
- Recommender Systems: Random Forest can be used in recommender systems to suggest items or content to users based on their preferences.
Random forests
create different trees using a process known as bagging
, which is short for bootstrapped aggregating
.
Boosting
is a sequential learning technique where each of the base models builds off of the previous model. Each subsequent model aims to improve the performance of the final ensemble model by attempting to fix the errors in the previous stage.
There are two important decisions that need to be made to perform boosted ensembling:
- Sequential Fitting Method
- Aggregation Method
Adaptive Boosting
(or AdaBoost
) is a sequential ensembling method that can be used for both classification and regression. It can use any base machine learning model, though it is most commonly used with decision trees.
For AdaBoost
, the Sequential Fitting Method
is accomplished by updating the weight attached to each of the training dataset observations as we proceed from one base model to the next. The Aggregation Method
is a weighted sum of those base models where the model weight is dependent on the error of that particular estimator.
Gradient Boosting
is a sequential ensembling method that can be used for both classification and regression. It can use any base machine learning model, though it is most commonly used with decision trees, known as Gradient Boosted Trees.
For Gradient Boost
, the Sequential Fitting Method
is accomplished by fitting a base model to the negative gradient of the error in the previous stage. The Aggregation Method
is a weighted sum of those base models where the model weight is constant.