DataPreprocessing-Framework ML

Data Preparation is most important part of a machine learning project which is least discussed and most time consuming

machine learning algorithms have expectations regarding
- data types
- scale
- probability distribution and
- relationships between input variables, data must be changed meet these expectations.
Challenge of data processing is that each dataset is unique and different (data preprocessing is chalenging because of)
- Datasets differ in number of variables (tens, hundreds, thousands, or more),
- Types of variables (numeric, nominal, ordinal, boolean),
- Scale of variables,
- Drift in values over time etc...

Project can be different but steps on path to a good or even best result are generally same from project to project

Sometimes referred to as applied machine learning process, data science process

`Applied ML Process`

Step 1: Define Problem
1.1. Gather data from problem domain
1.2. Discuss project with subject matter experts
1.3. Select those variables to be used as inputs and outputs for a predictive model

1.4. Review data that has been collected
1.5. Summarize collected data using statistical methods
1.6. Visualize collected data using plots and charts

Step 2: Data Preparation [Tasks]
Transform collected raw data as to make it more suitable for model
2.1. Data Cleaning
2.2. Feature Selection
2.3. Data Transformation
2.4. Feature Engineering
2.5. Dimensionality Reduction

Step 3: Evaluate Models
3.1. Select performance metric for model evalution
3.2. Select model evaluation procedure
3.3. Select algorithms to Evaluate
3.4. make a baseline to compare with other model
3.5. reshampling technique to split data
* k-fold is often used
3.6. get most out of well performing model by tuning
* Hyperparameters
* Combine predictive models into ensembles

Step 4: Finalize Model

slect best performing model
production and all

`Data Prepration Techniques with Common Task's`

`2.1. Data Cleaning`:

Identifying and correcting mistakes or errors in data | Data can be mistyped, corrupted, duplicated | Messy, noisy, corrupt, or erroneous values must be addressed
Might involve
- Removing a row or a column
- Replacing observations with new values

`2.1.1. Common Data Cleaning Operations`:

Using statistics to
- Define Normal data and
- Identify Outliers
Identifying & Removing columns which have
- Same value or
- No variance
Identifying and Removing
- Duplicate rows of data
Marking empty values as missing
Imputing missing values using
- Statistics or
- A learned model

`2.2. Feature Selection`:

Identifying those input variables that are most relevant to target variable
Feature Selection technique is generally grouped into Supervised(having targets) and Unsupervised(not having targets)

Supervised Technique is further divided into Models that
- Intrinsic : Automatically select features as part of model fitting [Trees]
- Wrapper Model : Explicitly choose features which result in best performing model [Recursive Feature Elemination]
- Filter Model : score each input feature and allow a subset to be selected [Feature Importance, Stats]
- Statistical Methods such as
  - Correlation is popular for Scoring Input features

`2.2.1. Common Feature Selection Use Cases`:

Categorical Inputs for a Classification Target Variable
Numerical Inputs for a Classification Target Variable
Numerical Inputs for a Regression Target Variable

2.3. Data Transformation:
Changing scale,type or distribution of variables

Numeric Data Type: Number values

Integer : Integers with no fractional part
Float : Floating point values

Categorical Data Type: Label values

Ordinal : Labels with a rank ordering
Nominal : Labels with no rank ordering
Boolean : Values True and False

NOTE: Important consideration with data transforms is that operations are generally performed separately for each variable
Discretization Transform: Encode (to convert) a numeric variable to an ordinal variable
Ordinal Transform: Encode a categorical variable into an integer variable
One Hot Transform: Encode a categorical variable into binary variables(boolean), required on most classification tasks

If the data has a Gaussian probability distribution, it may be more useful to shift data to a standard Gaussian with a mean of zero and a standard deviation of one
Normalization Transform: Scale a variable to range 0 and 1
Standardization Transform: Scale a variable to a standard Gaussian
Powe Transform: Changes distribution of a variable which is nearly Gaussian, but is skewed or sifted can be made more Gaussian
Quantile Transform: Force a probability distribution, such as Uniform or Gaussian on a variable with an unusual natural distribution

2.4. Feature Engineering:
Deriving new variables from available data
Technuques to reuse:

Adding a boolean flag variable for some state
Adding a group or global summary statistic, such as a mean
Adding new variables for each component of a compound variable, such as a date-time
Polynomial Transform:Create copies of numerical input variables that are raised to a power

2.5. Dimensionality Reduction:
Creating compact projections of data by creating projection of data in lower-dimensional that still preserves most important properties of original data
Common approach to dimensionality reduction is to use a Matrix Factorization Technique

Principal Component Analysis (PCA)
Singular Value Decomposition (SVD) Model-based methods:
Linear Discriminant Analysis
Autoencoders These techniques removes Linear Dependencies b/w input variables
Sometimes Manifold Learning Algorithms can also be used
Self-organizing maps (SOME)
t-Distributed Stochastic Neighbor Embedding (t-SNE)

`Data Processing without Leakage`

A naive approach to preparing data applying transform on entire dataset before evaluating performance of model
This results in a problem referred to as data leakage where knowledge of the hold-out test set leaks into dataset used to train model
Careful application of data preprocessing is required depending on model evalution scheme used such as

train-test-split
k-fold cross validation

Data processing must be done on Training set only in order to avoide data leakage

Problem with Naive Data Processing This could happen when test data is leaked into training set, or when data from future is leaked to past
For example

Consider case where we want to Normalize data, that is scale input variables to range 0-1
When we normalize input variables, this requires that we first calculate minimum and maximum values for each variable before using these values to scale variables
Dataset is then split into train and test datasets, but examples in training dataset know something about data in test dataset; they have been scaled by global minimum and maximum values, so they know more about the global distribution of variable then they should

Data preprocessing on train_test_split, processing must be fit on training dataset only

Split Data
Fit Data Preparation on Training Dataset
Apply Data Preparation to Train and Test Datasets
Evaluate Models

Data preprocessing on k-Fold Cross Validation
Defines sequence(list) of Data Preparation Steps to apply by fitting model and evaluate it
Each sequence(step) in list is a tuple having 2 element
1st element: name of step (string)
2nd element: configured object of step such as Transform or Model

`I am trying to understand`

Technique to prepare data so that it avoids data leakage, which lea to result of incorrect model evaluation
Technique to identify and handle problems with messy data, such as outliers and missing values
Technique to identify and remove irrelevant and redundant input variables with feature selection methods
Technique to know which feature selection method to choose based on data types of variables
Technique to scale range of input variables using normalization and standardization technique
Technique to encode categorical variables as numbers and numeric variables as categories
Technique to transform probability distribution of input variables
Technique to transform a dataset with different variable types and how to transform target variables
Technique to project variables into a lower-dimensional space to captures salient data relationships

Part 1: Basic

Importance of Data preparation and its techniques, best practices to use in order to avoid data leakage

Part 2: Data Cleaning

Transform messy data into clean data by identifying outliers, handling missing values with statistical and modeling techniques

Part 3: Feature Selection

Statistical and Modeling techniques for feature selection and feature importance with how to choose technique to use for different variable types.

Part 4: Data Transforms

Transform variable types and variable probability distributions with a suite of standard data transform algorithms

Part 5: Advanced Transforms

Handle some of trickier aspects of data transforms, such as handling multiple variable types at once, transforming targets, and saving transforms after choosing a final model

Part 6: Dimensionality Reduction

Remove input variables by projecting data into a lower dimensional space with dimensionality-reduction algorithms

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Repository files navigation

DataPreprocessing-Framework ML

`Applied ML Process`

`Data Prepration Techniques with Common Task's`

`2.1. Data Cleaning`:

`2.1.1. Common Data Cleaning Operations`:

`2.2. Feature Selection`:

`2.2.1. Common Feature Selection Use Cases`:

`Data Processing without Leakage`

`I am trying to understand`

About

Releases

Packages

MvMukesh/DataPreprocessing-Framework-ML

Folders and files

Latest commit

History

README.md

README.md

Repository files navigation

DataPreprocessing-Framework ML

Applied ML Process

Data Prepration Techniques with Common Task's

2.1. Data Cleaning:

2.1.1. Common Data Cleaning Operations:

2.2. Feature Selection:

2.2.1. Common Feature Selection Use Cases:

Data Processing without Leakage

I am trying to understand

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

`Applied ML Process`

`Data Prepration Techniques with Common Task's`

`2.1. Data Cleaning`:

`2.1.1. Common Data Cleaning Operations`:

`2.2. Feature Selection`:

`2.2.1. Common Feature Selection Use Cases`:

`Data Processing without Leakage`

`I am trying to understand`

Packages