Skip to content

baranylcn/SolvingOutliersProblem

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

OUTLIERS

An outlier is a single data point that goes far outside the average value of a group of statistics. Outliers may be exceptions that stand outside individual samples of populations as well. In a more general context, an outlier is an individual that is markedly different from the norm in some respect.

Titanic Dataset

image

This is the boxplot of the "age" variable :

image

As seen there are outliers in this dataset.

How to find outlier thresholds ?

Formula:

image

image

The formula functionalized and applied.

Now, we can grab the outliers and take a look.

How to Solve the Outlier Problem?

-Trimming (Deletion)

Outliers can be deleted. We have implemented it but this is not recommended as it will often result in data loss.

-Imputation

As with the methods of dealing with missing data, the method of assigning values ​​can be preferred instead of outliers. It is more advantageous than the problems caused by data loss in deletion operations. The values to be assigned instead of outliers can be representative statistics such as mean, median, mode, or any fixed value.

-Data suppression (re-assignment with thresholds)

Data suppression refers to the various methods or restrictions that are applied to ACS estimates to limit the disclosure of information about individual respondents and to reduce the number of estimates with unacceptable levels of statistical reliability.

Multivariate Outlier Analysis: Local Outlier Factor (LOF)

When we looked at the variables separately, we detected outliers. So, if we look at the variables together, can we get outlier variables? For example, if a person was married 3 times at the age of 18.

Being 18 years old or getting married 3 times are not problems, but being 18 years old and married 3 times can be an outlier.

Applied to the "LocalOutlierFactor" dataset. If the values close to -1, it indicates that it is INLIER.

Elbow Method

image

A graph was created according to the threshold values, and when we examined the graph, the point where the slope change was the hardest was determined. The determined slope change point was chosen as threshold.

The individual variables may appear as outliers, but we found outliers depending on the situation between the variables.

Note : If working with tree methods, these values should not be changed.

You can find the mentioned titles and explanations in detail in SolvingOutliersProblem.py

About

No description or website provided.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages