Module 2: Classification

Classification:

The separation of data into two or more categories, or (a point’s classification) the category a data point is put into.

Examples:

Bank wants to differentiate between those who will repay loans and those who will default
Security agency wants to differentiate between those who are regular visitors and those who are potential terrorists
Automated email filter wants to differentiate between genuine and spam email
A legal document system wants to differentiate between documents that are relevant or irrelevant to a certain case.
Surgeon wants to know whether an organ is safe to transplant or carry infectious disease
Political consultant wants to differentiate between supportive, opposition and undecided voters
Paleontologist wants to differentiate betweent the many species of dinosaurs

Classification questions require data to answer.

For loan applicants, companies may collect data on: Income, Credit History, Age, Family SIze, Assets, Liabilities, etc..

The line that separates the data is a Classifier

Classifier

A boundary that separates the data into two or more categories. Also (more generally) an algorithm that performs classification.

How do we know what line/classifier to draw?

We want to find a line that is as far away from the two different types of data as possible. This would result in a lower likelihood of misclassification.

What if there's no line that can separate the data perfectly?

In this case, we use a soft classifier as opposed to a hard classifier that separates the data perfectly

Pre-tilt: Less mistakes but more near mistakes Overall - Higher total mistakes	Post-tilt: More mistakes but less near mistakes. Overall - Lower total mistakes

What if the cost of misclassification is different for different types of data? i.e. what if giving away a bad loan has a higher cost than mistakenly turning away a good applicant, accepting red as green (Type 1: False positive) has a higher cost than rejecting green as red (Type 2: False negative).

Hence, if giving away a bad loans is twice as costly as mistakenly turning away a good applicant, we will shift the classifier/line away from the red data and closer to the green data.

Data Terminology

Ex.1 Loan Applicants	Ex2. Daily Car Sales

Row - Data point refers to each row. I.e. if we're looking at loan applications, then each applicant would be a datapoint. If we are looking at future sales, then each day will be a datapoint.

Column
- Attribute, feature, covariate predictor
- Response/Outcome - the answer for each datapoint (For loan applicants, it's the observation of whether they repaid the full loan or not or it could be the fraction of the loan they repaid from zero to 100%. For sales, the response could be the number of sales recorded on each day.

Structured data:

Quantitative data
- Numbers with a meaning
  - e.g.: age, sales, temperature, income
  - HIgher means more, lower means less
Categorical data
- Numbers without meaning
  - e.g. zip codes
  - higher/lower is not meaningful
- Non-numeric
  - Ex: Hair color: Black, brown, red, blonde, gray
Binary data (subset of categorical data)
- Can only take upto two values
  - E.g. M/F, Repaid in full (Y/N), ON/OFF
  - In some cases we can treat this as a quantitative measure
Unrelated data
- No relationship between data points
- e.g. Each loan applicant is a different person with no relationship to most other loan applicants.
Time series data
Same data recorded over time
- Often recorded at equal intervals
- E.g. Daily sales, stock prices, child's height on each birthday

Support Vector Machines

Maximimal Margin Classifier

What's the issue with Maximal Margin Classifiers?

High Variance from Outliers- in the event that there is an outlier, then the Maximum Marginal Classifier may be super compressed (and close to one of the classifications), hence may perform poorly when provided with new data (test data) & result in high variance

Maximal Margin	Soft Margin

- Lower Bias - Higher Variance	- Higher Bias - Lower Variance

How do we test whether one soft-margin is better than another?

The answer is simple: We use Cross Validation to determine how many misclassifications and observations to allow inside of the Soft Margin to get the best classification (minimum error).

Soft Margin Classifier / Support Vector Classifier - is a model used to classify observations using a Soft Margin

The name Support Vector Classifier comes from the fact that the observations on the edge and within the Soft Margin are called Support Vectors.

What happens when the data is 3-Dimensional?

Support Vector Classifer forms a plane, instead of a line

Mathematical jargon:

If the data are 1-Dimensional, the support vector Classifier is a single point on a 1-Dimensional Number line. The point is called a "flat affine 0-Dimensional subspace"

If the data are 2-Dimensional, the support vector Classifier is line on a 2-Dimensional Number space. The line is called a "flat affine 1-Dimensional subspace"

If the data are 3-Dimensional, the Support Vector Classifier is a 2-Dimenional plane on a 3-Dimensional space. The line is called a "flat affine 2-Dimensional subspace

And when the data are in 4 or more Dimensions, the Support Vector Classifier is a hyperplane. psst! In mathematical jargon, a hyperplane is a "flat affine subspace".

1-D Data	2-D Data	3-D Data	4-D Data

However, Support Vector Classifiers don't perform well with data that have classifications grouped in the middle.

Support Vector Machines

Start with a data in relatively low dimension
Move data into a higher dimension
Find a Support Vector Classifier that separates the higher dimensional data into two groups.

In order to make the mathematics possible, Support Vector Machines use something called Kernel Functions to systematically find Support Vector Classifiers in higher dimensions.

In essence, the algorithm iteratively increases the d, dimensionality of the data, and hence the Soft-Vector Classifier.
We then use cross-validation to find an optimal value for d

Kernel Trick

The Kernel Trick reduces the amount of computation required for Support Vector Machines by avoiding the math that transforms the data from low to high dimensions...
$\ldots$ and it makes calculating relationships in the infinite dimensions used by the Radial Kernel possible.

Bias/Variance Trade Off

The blue curved like represents the true relationship of the data points. The red line represents a linear regression. As you can see, no matter how you slice-or-dice the regression, you will never be able to capture the true relationship, hence is has a large amount of bias.
The inability for a machine learning method (like linear regression) to capture the true relationship is called bias.

Consider the Squiggly line, it hugs the training set along the arc of the true relationship. This means that it contains little bias.

We compare the Straight line with the Squiggly Line with the Straight line with a sums-of-squared formula that shows how 'biased' a model is.

Compared to the Straight line, the Squiggly line has a Sums-of-Squared of 0, hence, it is lower, and supposedly a 'better' model, hence less bias.

However, that was with the training set of the data, not the test set.

Using the test set of the data, we can see that even thought the Squiggly line won the training set contest, it lost the test set contest.

In Machine Learning, the difference between fits between the data sets is called Variance.

The Squiggly Line has low bias, since it is flexible and can adapt to the curve in the relationship between weight and hight but the Squiggly Line has high variability, because it results in vastly different Sums of Squares for different datasets. Since this line fits the training set well and not the test set, we say that the line is overfit
In contrast, the Straight Line has relatively high bias, since it can not capture the curve in the relationship between weight and height but the Straight Line has relatively low variance, because the Sums of Squares are very similar for different datasets.
This shows the advantages and disadvantages of using a Complicated (Squiggly ) and Simple (Straight) model.

How do we find the best model then?

Three commonly used methods for finding the sweet spot between simple and complicated models are: regularization, boosting and bagging.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Notes.md

Notes.md

Module 2: Classification

Data Terminology

Support Vector Machines

Bias/Variance Trade Off

Files

Notes.md

Latest commit

History

Notes.md

File metadata and controls

Module 2: Classification

Data Terminology

Support Vector Machines

Bias/Variance Trade Off