Skip to content

Latest commit

 

History

History
1869 lines (1528 loc) · 83.6 KB

ML-FeatureEng.md

File metadata and controls

1869 lines (1528 loc) · 83.6 KB

Feature Engineering - Machine Learning

Summary

  • Imputation

    • numerical variable
      • mean or median imputation
      • arbitrary value imputation
      • end of tail imputation
    • categorical variables
      • frequent category imputation
      • add a missing category
    • both
      • complete case analysis
      • add a missing indicator
      • random sample imputation
      • iterative imputation (multivariate)
      • K-nearest neighbor imputation
  • techniques for data encoding (categorical variable)

    • traditional techniques
      • one-hot encoding
      • count or frequency encoding
      • ordinal or label encoding
    • monotonic relationship
      • ordered label encoding
      • mean (target) encoding
      • probability ratio encoding
      • weight of evidence $\left(\text{WOE} = \ln \left(\frac{p(1)}{p(0)}\right)\right)$
      • Catboost encoder - target-based
      • Leave-one-out encoder (LOO/LOOE) - target-based
      • James-Stein Encoder $\left( \widehat{x}^k = (1-B) \cdot \frac{n^+}{n} + B \cdot \frac{y^+}{y} \right)$ - target-based
    • alternative techniques
      • rare labels encoding
      • binary encoding
  • Most common-used transformer

    • logarithmic transformation $\left(f(x) = \ln(x), x > 0 \right)$
    • square root transformation $\left(f(x) = \sqrt{x}, x \ge 0 \right)$
    • reciprocal transformation $\left(f(x) = \frac1x, x \ne 0 \right)$
    • exponential or power transformation $\left(f(x) = x^2, x^3, \dots, x^n, \exp(x) \right)$
    • Box-Cox transformation $\left(x_i^(\lambda) = \begin{cases} (x_i^\lambda -1) / \lambda & \text{if } \lambda =\ne 0 \ \ln(x_i) & \text{if } \lambda = 0 \end{cases} \right)$
    • Yeo-Johnson transformation $\left(x_i^{(\lambda)} = \begin{cases} [(x_i + 1)^\lambda - 1] / \lambda & \text{if } \lambda \ne 0, x_i \ge 0, \ \ln(x_i) + 1 & \text{if } \lambda = 0, x_i \ge 0, \ -[(-x_i + 1)^{2-\lambda} - 1]/(2-\lambda) & \text{if } \lambda \ne 2, x_i < 0, \ -\ln(-x_i + 1) & \text{if } \lambda = 2, x_i < 0 \end{cases}\right)$
  • Variable discretization approaches

    • supervised approach
      • discretization w/ decision tree
    • unsupervised approaches
      • equal-width discretization ($\text{width} = \frac{\max(x) - \min(x)}{N}$)
      • equal-frequency discretization
      • K-means discretization
    • other
      • custom discretization
  • Outlier Detection

    • visualization plots like box plot and scatter plot
    • normal distribution ($\mu \pm 3 \times \text{s.d.}$)
    • Inter-quantal range proximity rule (upper bound = $Q_3(x) + 1.5 \times \text{IQR}$, lower bound = $Q_1(x) - 1.5 \times \text{IQR}$)
    • Density-Based Spatial Clustering of Application w/ Noise (DBSCAN)
    • Isolation Forest - tree-based
    • Local Outlier Factor (LOF)
  • Handling outliers

    • trimming: simply removing the outliers from dataset
    • imputing: treating outliers as missing data and applying missing data imputation techniques
    • discretization: placing outliers in edge bins w/ higher or lower values of the distribution
    • censoring: capping the variable distribution at the maximum and minimum values
  • Scaling methods

    • mean normalization $\left(\overline{x} = \frac{x - \mu}{\max(x) - \min(x)]} \right)$
    • standardization $\left(\overline{x} = \frac{x - \mu}{\text{std}(x)} \right)$
    • robust scaling (scaling to median and IQR) $\left(\overline{x} = \frac{x - \text{median}(x)}{Q_3(x) - Q_1(x)} \right)$
    • robust to maximum and minimum $\left(\overline{x} = \frac{x - \min(x)}{\max(x) - \min(x)} \right)$
    • scale to absolute maximum $\left(\overline{x} = \frac{x}{\max(x)}\right)$
    • scale to unit norm $\left(\overline{x} = \frac{x}{|x|} \right)$

Overview

General

  • Feature engineering

    • the process of using data domain knowledge to create features or variables
    • purpose: making ML algorithms effectively
    • very time consuming process
    • a number of processes
      • filling missing values within a variable
      • encoding categorical variables into numbers
      • variable transformation
      • creating or extracting new features from the ones available in the dataset
  • Feature engineering

    • simply making data better suited to the problem at hand
  • Purposes of feature engineering

    • raw data: messy and unsuitable for training a model
    • solution: data exploration and cleaning
    • involving changing data types and removing or imputing missing values
    • requirements: a certain understanding of the data acquired through exploration
    • solving these challenges and building high-performing models
    • solutions
      • removing outliers or specific features
      • creating features from the data that represent the underlying problem better
    • algorithms often hinging on how the input features engineered
  • Reasons for feature engineering

    • improving a model's predictive performance
    • reducing computational or data needs
    • improving interpretability of the results
  • Principle of feature engineering

    • useful feature: relationship to the target that your model is able to learn
    • linear model: transforming the features to make features' relationship to the target linear
    • key idea: a transformation applied to a feature becoming in essence a part of model itself
    • high return on time invested in feature engineering
  • Tips to discovering new features

    • understand the features: referring to data documentation if available
    • acquire domain knowledge: research the problem domain
    • study previous work
    • use data visualization:
      • revealing pathologies in the distribution of a feature
      • simplifying complicated relationships
      • a must step for feature engineering process
  • Feature Engineering vs. Feature Selection

    • feature engineering:
      • creating new features from the existing ones
      • helping ML model more effective and accurate predictions
    • feature selection
      • selecting from the feature pool
      • helping ML models to predict on target variables more efficiently
      • typical ML pipeline: completing feature engineering then feature selection
  • Categorical encoding w/ category_encoders lib

    • the process of transforming a categorical column into one (or more) numerical column(s)

    • Python library: category_encoders

      !pip install category_encoders
      
      import category_encoders as ce
      
      ce.OrdinalEncoder().fit.transform(x)
    • classification of endcoders

      Classification of encoding methods fpr categorical variables
    • local demo notebook

Variables Types

  • Variables

    • any characteristic, number, or quantity measured or counted
    • major types of variables
      • numerical variables
      • categorical variables
      • datetime variables
      • mixed variables
    • get the type of each variable from a Pandas dataframe
  • Numerical variables

    • (predictably) numbers
    • categories of numerical variables
      • continuous variables
      • discrete variables
  • Continuous variables

    • an uncountable set of values
    • probably containing any value within a given range
    • visualization
      • density plot
      • histogram
      • box plot
      • scatter plot
  • Discrete variables

    • a finite number of values
    • integers, counts
    • visualization
      • count plot
      • pie chart
  • Categorical variables

    • selected from a group of categories
    • a.k.a. labels
    • categories: ordinal & nominal
    • ordinal variables: variables existed within meaningfully ordered categories
    • nominal variables: not a natural order in the labels
    • in some scenarios, categorical variables coded as numbers when the data was recorded
  • Dates and Times

    • particular type of categorical variable
    • containing dates, time, or data and time
    • usually not working w/ datetime variables in their raw format
      • date variables containing a considerable number of different categories
      • able to extract much more information from datetime variables by preprocessing them correctly
    • date variable issues
      • containing dates not present in the dataset used to train the learning model
      • containing dates placed in the future, w.r.t. the dates in the training dataset
  • Mixed variables

    • containing both numbers and labels
    • occurring in a given dataset, especially when filling its values

Common Issues in Datasets

  • General issues

    • missing data
    • categorical variable - cardinality
    • categorical variable - rare labels
    • linear model assumptions
    • variable distribution
    • outliers
    • feature magnitude
  • Missing data

    • when no data stored for a particular observation in variable
    • basically just the absence of data
    • data missing for multiple reasons: lost & not exist
    • many features not mandatory
    • solution: missing data imputation techniques
    • issues:
      • probably distort the original variable distribution
      • alter the way variables interact w/ each other
      • affect the machine learning model's performance $\gets$ many models make assumptions about the variable distribution
    • carefully choosing the right missing data imputation technique
    • main mechanisms lead to missing data
      • missing data completely at random (MCAR)
      • missing data at random (MAR): the probability of an observation being missing depends on available information
      • missing data not at random (MNAR): a mechanism or a reason why values introduced in the dataset
    • labels: the values of a categorical variable selected from a group of categories
    • cardinality: the number of different labels
    • cardinality on models: issues w/ multiple labels in a categorical variable
    • high cardinality
  • Categorical variable - rare labels

    • rare labels: appear only in a small proportion of the observation in a dataset
    • impacts and considerations on rare labels
      • causing overfitting and generalization problems
      • hard to understand the role of the rare label in the final prediction
      • removing rare labels may improve model performance
  • Linear model assumptions

    • linearity
      • the relationship btw the variables ($X$s) and the target ($Y$) is linear and accessed w/ scatter plot

        [ Y = \beta_0 + \beta_1 \cdot x_1 + \beta_2 \cdot x_2 + \cdots + \beta_n \cdot x_n ]

      • homoscedasticity

        • homogeneity of variance: independent variables w/ the sam variance
        • tests and plots to determine homogeneity
          • residual plot
          • Levene's test
          • Barlett's test
        • not homoscedasticity:
          • performing non-linear transformations (e.g., logarithm-transformation)
          • feature scaling to improve the homogeneity of variance
    • normality
      • assessment: histograms and Q-Q plots
      • Q-Q plot: if variable is normally distributed,, the values of the variable falls in a 45-degree line when plotted against the theoretical quantities
      • variable not normal distribution: non-linear transformation to fix
    • independent: observations independent of each other
  • Variable/Probability distribution

    • a function describing the likelihood of obtaining the possible values that a variable can take
    • properties of probability distribution
      • $P(x)$: the likelihood that the random variable takes a specific value of $x$
      • unitary: $\sum P(x) = 1$
      • non-negative: $P(x) \in [0, 1]$
    • different probability distributions
      • discrete, like Binomial and Poisson
      • continuous, like Gaussian, skewed, and many others
    • distributions and model performance
      • linear models assumption: the independent variables w/ normal distribution
      • other models not assume the distribution of variables, a better spread of the values may improve their performance
  • Outliers

    • a data point significantly different from the remaining data
    • Ref: How to Make Your Machine Learning Models Robust to Outliers
    • an observation deviateing so much from the other observations
    • should outliers be removed?
      • depending on context
      • deserve special attention
      • simply ignore entirely
    • algorithms susceptible to outliers mostly linear models
    • detecting outliers
      • using an extreme value analysis w/ a normal distribution to detect outliers
      • approximately 99% of the observations of a normally-distributed variable lie within the $\text{mean} \pm 3 \cdot \text{standard deviations}$
      • values outside mean $\pm 3 \times$ standard deviations considered outliers
    • visualization: box plot
  • Feature magnitude

    • examples
      • a dataset w/ a column for age and another one for income
      • the number of rooms in a given house and its price
    • feature magnitude matters
      • the scale of the variable directly influences the regression coefficient
      • variable w/ a more significant magnitude range (e.g., income) dominate over the ones w/ a smaller magnitude range (age)
      • gradient descent converges faster when features are on similar scale
      • features scaling helps secrease he time to find support vectors for SVMs
      • Euclidean distances are sensitive to feature magnitude
    • models affected by feature magnitude
      • linear and logistic regression
      • neural networks
      • support vector machines
      • K-nearest neighbors
      • K-mean clustering
      • linear discriminant analysis (LDA)
      • principle component analysis (PCA)
    • tree-based models insensitive to feature magnitude
      • classification and regression tress
      • random forests
      • gradient-boosted trees

Mutual Information

  • Handling features
    • issue: hundreds and thousands of features w/o description
    • procedure to resolve
      • constructing a ranking w/ a feature utility metric, a function measuring associatiions btw a feature and a target
      • choosing a smaller set of the most useful features to develop initially and having more confidence to spend time on them

+Mutual information

  • metric used to measure associations btw a feature and a target

  • a lot like correlation to measure a relationship btw two quantities

  • MI detecting any kind of relationship while correlation only detecting linear relationship

  • a great general-purpose metric and specially useful at the start of feature development

  • advantages

    • easy to use and interpret
    • computationally efficient
    • theoretically well-founded
    • resistant to overfitting
    • able to detect any kind of relationship
  • Mutual information and measurement

    • MI describing relationships in terms of uncertainty
    • mutual information (MI) btw two quantities: a measure of the extent to which knowledge of one quantity reduces uncertainty about the other
    • scikit-learn algorithm for MI
      • two mutual information metrics in feature_selection module
      • continuous features
        • float dtype
        • real value targets: mutual_info_regression
      • categorical features
        • object or categorical dtype
        • treated as discrete by giving them a label encoding
        • categorical targets: mutual_info_classif
    • data visualization: a great toolbox for feature ranking, e.g., bar chart
  • Mutual information scores

    • MI = 0.0

      • least possible value
      • independent: unable to tell anything about the other
    • MI maximum value

      • theory: no upper bound
      • practice: MI > 2.0 uncommon
      • MI: a logarithm quantity
      Left: Mutual information increases as the dependence between feature and target becomes tighter. Right: Mutual information can capture any kind of association (not just linear, like correlation.)
  • Considerations when using mutual information

    • relative potential:
      • MI helping to understand the relative potential of a feature
      • the potential as a predictor of the target
    • univariate metric
      • possible for a feature very informative when interacting w/ other features
      • not so informative for the feature itself
      • MI unable to detect interaction btw features
    • feature and model
      • the usefulness of feature depending on the model it used w/
      • feature probably only useful to the extent related to the target
      • a feature w/ high MI score $\nRightarrow$ model able to do anything w/ that information

Imputing Missing Values

Overview

  • Data imputation
    • the act replacing missing data w/ statistical estimates of missing values
    • goal: producing a complete dataset to use in the process of training ML models
    • python library:
      • sklearn.impute: transformers for missing value imputation
      • feature-engine: simplify the process of imputing missing values
    • classification of methods
      • numerical variable
        • mean or median imputation
        • arbitrary value imputation
        • end of tail imputation
      • categorical variables
        • frequent category imputation
        • add a missing category
      • both
        • complete case analysis
        • add a missing indicator
        • random sample imputation

Mean and Median Imputation

  • Mean and median imputation
    • replacing all occurrences of missing values (NA) within a variable w/ the mean and median of the variable
    • scenarios
      • suitable for numerical variables
      • missing completely at random (MCAR)
      • more than 5% of the variable containing missing data
    • applied to both training and test sets
    • considerations
      • normal distribution: the mean and median approximately the same
      • skewed distribution: median as a better representation
    • assumption
      • missing data at random
      • missing observations most likely like the majority of the observations in the variable
    • advantages
      • easy to implement
      • easy way of obtaining complete datasets
      • used in production
    • limitations
      • distortion of the original variable distribution and variance
      • distortion of the covariance w/ the remaining dataset variable
      • the higher the percentage of missing values, the higher the distortions
    • Python: from sklearn.impute import SimpleImputer

Arbitrary Value Imputation

  • Arbitrary value imputation
    • replacing all occurrences of missing values (NA) within a variable w/ an arbitrary value
    • arbitrary value different from the mean and median and not within the normal values of the variable
    • typical arbitrary values: 0, 999, -999, (or other combinations of 9s) or -1 (positive distribution)
    • scenarios
      • suitable for numerical and categorical variables
    • assumptions
      • no missing data at random
    • advantages
      • easy to implement
      • a fast way to obtain complete datasets
      • used in production, i.e., during model deployment
      • capturing the importance of a value being "missing", if existed
    • limitations
      • distortion of the original variable distribution and variance
      • distortion of the covariance w/ the remaining dataset variable
      • arbitrary value at the end of the distribution $\to$ mask or create outliers
      • carefully not choose an arbitrary value too similar to the mean or median (or any other typical value of the variable distribution)
      • the higher the percentage of NA, the higher the distortion
    • Python: from sklearn.impute import SimpleImputer

End of Tail Imputation

  • End of tail imputation
    • roughly equivalent to arbitrary value imputation
    • automatically selecting the arbitrary values at the end of the variable distributions
    • scenarios
      • suitable for numerical variables
    • ways to select arbitrary values
      • normal distribution: using the $\mu \pm 3 \cdot \text{s.d.}$
      • skewed distribution: using the IQR proximity rule
    • replacing missing data calculated only on the train set
    • normal distribution
      • most of the observation (~99%) of a normally-distributed variable lie within $\pm 3 \times$ s.d.
      • the selected value = $\mu \pm 3 \times$ s.d.
    • skewed distribution
      • general approach: calculate the quantiles and the inter-quantile range (IQR)
        • IQR = 75th Quantile - 25th Quantile
        • upper limit = 75th Quantile + IQR x 3
        • lower limit = 25th Quantile - IQR x 3
      • selected value for imputation: upper limit or lower limit
    • Python: from feature_engine.missing_imputers import EndTailImputer

Frequent Category Imputation

  • Frequent category imputation
    • a.k.a. mode imputation
    • replacing all occurrences of missing values (NA) within a variable w/ the mode, or the most frequent value
    • scenarios
      • suitable for numerical and categorical variables
      • in practice, using the technique w/ categorical variables
      • using w/ data as missing complete at random (MCAR)
      • no more than 5% of the variable contains missing data
    • applied only to train and test sets
    • assumption
      • missing data at random
      • missing observations most likely like the majority of the observations (i.e., the mode)
    • advantages
      • easy to implement
      • a fast way to obtain a complete dataset
      • used in production
    • limitations
      • distort the relation of the most frequent label w/ other variables within dataset
      • may lead to an over-representation of the most frequent label if a lot of missing observations existed
    • Python: from sklearn.impute import SimpleImputer

Missing Category Imputation

  • Missing category imputation
    • treating missing data as an additional label or category of the variable
    • create a new label or category by filling the missing observations w/ a Missing category
    • most widely used method of missing data imputation for categories variables
    • advantages
      • easy to implement
      • fast way of obtaining complete datasets
      • integrated into production
      • capturing the importance of "missingness"
      • no assumption mad on the data
    • limitations: small number of missing data $\to$ creating an additional category just adding another rare label to the variable
    • Python: from sklearn.impute import SimpleImputer

Complete Case Analysis

  • Complete case analysis (CCA)
    • discarding observations where values in any of the variables are missing
    • keep only those observations for which there's information in all of the dataset variables
    • observations w/ any missing data excluded
    • scenarios
      • missing data complete at random (MCAR)
      • no more than 5% of the total dataset containing missing data
    • assumption: missing data at random
    • advantages
      • simple to implement
      • no data manipulation required
      • preserving the distribution of the variables
    • limitation
      • excluding a significant fraction of the original dataset (if missing data significant)
      • excluding informative observations for the analysis (if data not missing at random)
      • create a biased dataset if the complete cases differ from the original data (if MAR or MNAR)
      • used in production $\to$ not knowing how to handle missing data
    • Python: data.dropna(inplace=True)

Missing Indicator

  • Missing indicator
    • an additional binary variable indicating whether the data was missing for an observation (1) or not (0)
    • goal: capture observations where data is missing
    • used together w/ methods assuming MAR
      • mean, median, mode imputation
      • random sample imputation
    • scenario: suitable for categorical and numberic variables
    • assumptions
      • NOT missing at random
      • predictive missing data
    • advantages
      • easy to implement
      • capture the importance of missing data
      • integrated into production
    • limitations
      • expanding the feature space
      • original variable still requiring to be imputed
      • many missing indicators may end up being identical or very highly corrrelated
    • Python: from sklearn.imput import MissingIndicator

Random Sample Imputation

  • Random sample imputation
    • taking a random observation from the pool of available observations of the variable and using those randomly selected values to fill in the missing one
    • scenario:
      • suitable for numerical and categorical variables
    • assumptions
      • missing data at random
      • replacing the missing values within the same distribution of the original value
    • advantages
      • easy to implement
      • a fast way of obtaining complete dataset
      • used in production
      • preserving the variance of the variable
    • limitations
      • randomness
      • relationship btw imputed variables and other variables probably affected if a lot of missing values
      • requiring massive memory for deployment to store the original training set to extract values from and replace the missing values w/ the randomly selected values
    • Python: from feature_engine.missing_data_imputer import RandomSampleImputer

Iterative Imputation

  • Iterative imputation
    • a multivariate imputer that estimates feature from all the other ones in a round-robin manner
    • using a strategy for imputing missing values by modeling each feature w/ missing values as a function of other features
    • dtermining misssing values by discovering patterns from its neighbors
    • using round-robin at each step
      1. choosing a feature as output $y$ and all the other feature columns as imput $x$
      2. training a regressor and fitting it on $(x, y)$ for known $y$
      3. the regressor used to predict the missing values of $y$
      4. repeating until the defined max_iteration reached
    • IterativeImputer still experimental in Sklearn
    • Python: from sklearn.experimental import enable_iterative_imputer & from sklearn.imputer import IterativeImputer

K-Nearest Neighbor Imputation

  • K-nearest neighbor (KNN) imputing
    • using the famous KNN algorithm to predict the missing values from the neighbors
    • any point value approximated by the nearest point values of ofther variables
    • Python: from sklearn.impute import KNNImputer

Encoding Categorical Variables

Overview

  • Categorical encoding

    • permanently replacing category strings w/ numerical representations
    • goal: producing variables used to train machine learning models and build predictive features from categories
    • techniques for data transformation
      • traditional techniques
        • one-hot encoding
        • count or frequency encoding
        • ordinal or label encoding
      • monotonic relationship
        • ordered label encoding
        • mean encoding
        • probability ratio encoding
        • weight of evidence
      • alternative techniques
        • rare labels encoding
        • binary encoding
    • Python library: category_encoders - containing a lot of basic and advanced methods for categorical variable encoding
  • Supervised feature encoding engineering

    • a method of encoding categories as integer number
    • example: one-hot or label encoding
  • Target encoding

    • any kind of encoding replacing a feature's categories w/ some number derived from the target

    • simple and effect version: applying a group aggregation, like the mean

    • Automobiles: average price of each vehicle's make

      autos["make_encoded"] = autos.groupby("make")["price"].transform("mean")
    • mean encoding: applying a group aggregation w/ mean

    • other encodings: likelihood encoding, impact encoding, and leave-one-out encoding

One-Hot Encoding

  • One-hot encoding
    • consisting of encoding each categorical variable w/ a set of boolean variables, that take values of 0 or 1
    • the value indicating if a category is present for each observation
    • multiple variants
    • one-hot encoding into $k-1$ variables
      • creating $k-1$ binary variables, where $k$ is the number of distinct categories
      • using one less dimension and still represent the data fully
      • e.g., medical test w/ $k=2$ (positive/negative), creating only one ($k - 1 =1$) binary variable
      • most ML algorithms considering the entire dataset while training
      • encoding categorical variables into $k-1$ binary values better $\to$ avoid introducing redundant information
    • one-hot encoding into $k$ variables:
      • occasions better to encode variables into $k$ variables
        • building tree-based algorithms
        • making feature selection w/ recursive algorithms
        • interested in determining the importance of every single category
    • one-hot encoding of most frequent categories
      • only considering the most frequent categories in a variable
      • avoid overextending the feature space
    • advantages
      • not assuming the distribution of categories of the categorical variable
      • keeping all the information of the categorical variable
      • suitable for linear models
    • limitations
      • expanding the feature space
      • not adding extra information while encoding
      • many dummy variables probably identical $\to$ introducing redundant information
    • Python: data_with_k_df = pd.get_dummies(data_df)

Integer (Label) Encoding

  • Integer (Label) Encoding
    • replacing the categories w/ digits from $1$ to $n$ (or $0$ to $n-1$, depending on the implementation)
    • $n$: the number of the variable's distinct categories (the cardinality)
    • the number assigned arbitrary
    • advantages
      • straightforward to implement
      • not expanding the feature space
      • working well enough w/ tree-based algorithms
      • allowing agile benchmarking of ML models
    • limitations
      • not adding extra information while encoding
      • not suitable for linear models
      • not handling new categories in the test set automatically
      • creating an order relationship btw the categories

Count or Frequency Encoding

  • Count or frequency encoding
    • replacing categories w/ the count or percentage that show each category in the dataset
    • capturing the representation of each label in the dataset
    • advantages
      • straightforward to implement
      • not expanding the feature space
      • working well w/ tree-based algorithms
    • limitations
      • not suitable for linear models
      • not handling new categories in the test set automatically
      • losing valuable information if there are two different categories w/ the same amount of observations count

Ordered Label Encoding

  • Ordered label encoding
    • replacing categories w/ integers from 1 to n
    • $n$: the number of distinct categories in the variable (the cardinality)
    • using the target mean information of each category to decide how to assign these numbers
    • advantages
      • straightforward to implement
      • not expanding the feature space
      • creating a monotonic relationship btw categories and the target
    • limitation: probably leading to overfitting

Mean (Target) Encoding

  • Mean (target) encoding
    • replacing the category w/ the mean target value for that category
    • procedure
      • grouping each category alone
      • for each group, calculating the mean of the target in the corresponding observations
      • assigning mean to that category
      • encoded the category w/ the mean of the target
    • advantages
      • straightforward to implement
      • not expanding the feature space
      • creating a monotonic relationship btw categories and the target
    • limitations
      • probably leading to overfitting
      • probably leading to a possible loss of value if two categories have the same mean as the target

Smoothing

  • Issues of encoding

    • unknown categories
      • creating a special risk of overfitting
      • required to be trained on an independent "encoding" split
      • imputation: filling in missing values for any categories
    • rare categories
      • any statistics on this group unlikely very accurate
      • solution: smoothing
  • Smoothing technique

    • blending the in-category average w/ the overall average

    • rare categories: less weight on their category average

    • missing categories: the overall average

    • pseudocode

      encoding = weight * in_category + (1 - weight) * overall

    • weight

      • a value btw 0 and 1 calculated from the catgory frequency
      • determining weight by computing m-estimate: $\text{weight } = n / (n + m)$
        • $n$: the total number of times the category occurred in the data
        • $m$: hyperparameter to determine the "smoothing factor"
      • value for $m \to$ how noisy expecting the categories to be
        • target values varying a great deal $\implies$ choosing a larger value for $m$
        • target values relatively stable $\implies$ choosing a smaller value
    • larger values of $m$ $\to$ more weight on the overall estimate

      M-estimate w/ categories count and smoothing factor
  • Use cases for target encoding

    • high-cardinality features:
      • a feature w/ large number of categories: troublesome to encode
      • one-hot encoding:
        • generating too many features and alternative
        • not appropriate for that feature
      • target encoding: deriving numbers for the categories w/ the relationship w/ the target
    • domain-motivated feature
      • prior experience: categorical feature probably not so important even if scored poorly w/ a feature metric
      • target encoding revealing a feature's true information

Weighted of Evidence Encoding

  • Weight of evidence encoding (WOE)
    • used to encode categorical variables for classification

    • apply the natural logarithm ($\ln$) of the probability that the target equals 1 divided by the probability of the target values 0

    • math formula

      [ \text{WOE} = |\ln(p(1)/p(0))| ]

      • $p(1)$: the probability of the target being 1
      • $p(0)$: the probability of the target being 0
    • WOE value

      • WOE > 0: the probability of the target being 0 is more significant
      • WOE < 0: the probability of the target being 1 is more significant
    • creating an excellent visual representation of the variable

    • observation: category favoring the target being 0 or 1

    • advantages

      • creating a monotonic relationship btw the target and the variables
      • ordering the categories on the 'logistic' scale, nature for logistic regression
      • comparing the transformed variables because they are on the same scale $\to$ determine which one is more predictive
    • limitations

      • probably lead to overfitting
      • not defined when the denominator is 0

Probability Ratio Encoding

  • Probability ratio encoding
    • suitable for classification problems only, where the target is binary
    • similar to WOE, but not applying the natural logrithm
    • each category, the mean of the target = 1
      • $P(1)$: the probability of the target being 1
      • $P(0)$: the probability of the target being 0
    • calculating the ratio = P(1)/P(0) and replacing the categories by that ratio
    • advantages
      • capturing information within the category, and therefore creating more predictive features
      • creating a montonic relationship btw the variables and the target, suitable for linear mdoels
      • not expanding the feature space
    • limitations
      • likely to cause overfitting
      • not defined when the denominator is 0

Rare Label Encoding

  • Rare label encoding
    • rare label: appearing only in a tiny proportion of the observations in a dataset
    • causing some issues, especially w/ overfitting and generation
    • solution: group those rare labels into a new category like other or rare

Binary Encoding

  • Binary encoding
    • using binary code
    • procedure
      • converting each integer to binary code
      • each binary digits gets one column in the dataset
    • $n$ unique categories $\implies$ binary encoding results in the only $\log_2 n$ features
    • advantages
      • straightforward to implement
      • not expanding the feature space too much
    • limitations
      • exposing the loss of info during encoding
      • lacking the human-readable sense
    • Python: from category_encoders import BinaryEncoder

Catboost Encoder

  • Catboost encoder
    • similar to target encoding
    • replacing the category w/ the mean target value for that category
    • the order of observations in the dataset
    • the target probability: calculated only from the rows before it
    • similar to leave-one-out encoding but calculated the values
    • procedure
      • repeating training numerous times on shuffled copies of the dataset
      • averaging the results
    • Python: from category_encoder import CatBoostEncoder

Leave-One-Out Encoder

  • Leave-one-out encoder (LOO/LOOE)
    • an example of target-based encoding
    • preventing target data leakage, unlike other target-based methods
    • consisting of calculating the mean target of a given category $k$ for observation $j$ w/o using the corresponding target of $j$
    • calculating the per-category means w/ the typical target encoder
    • Python: from category_encoder import LeaveOneOutEncoder

James-Stein Encoder

  • James-Stein encoder
    • another example of a target-based encoder, defined for normal distribution

    • shrinking the average toward the overall average

    • intended to improve the estimation of the category's mean target by shrinking them towards a more median average

    • getting the mean target for category $k$

      [ \widehat{x}^k = (1 - B) * \frac{n^+}{n} + B * \frac{y^+}{y} ]

      • $\frac{n^+}{n}$: the estimation of the category's mean target
      • $\frac{y^+}{y}$: the central average of the mean target
      • $B$: a hyperparameter, representing the power of shrinking
    • Python: from category_encoders import JamesSteinEncoder

Transforming Variables

Overview

  • Transforming variables

    • assumption of linear and logistic regression: normal distribution w/ variable
    • in practice, real datasets following more a skewed distribution
    • purpose:
      • mapping skewed distribution to a normal distribution
      • increasing the performance of models
    • tools to estimate normality: histogram and Q-Q plot
    • most common-used methods
      • logarithmic transformation
      • square root transformation
      • reciprocal transformation
      • exponential or power transformation
      • Box-Cox transformation
      • Yeo-Johson transformation
    • Python: from sklearn.preprocessing import FunctionTransformer
  • Q-Q plot

    • variable following a normal distribution $\implies$ the variable's values fall in a 45-degree line against the theoretical quantiles

    • Python snippet

      # import the libraries
      import matplotlib.pyplot as plt
      import scipy.stats as stats
      import pandas as pd
      
      # read data
      data_df = pd.read_csv("dataset.csv")
      
      # create and show the plot
      stats.probplot(data_df["variable"], dist="norm", plot=plt)
      plt.show()
      
  • Representing feature relationships

    • relationship among numerical features usually expressed mathematical formulas
    • ratio:
      • features describing a car's engine in Automobile dataset
      • a variety of formulas for creating potentially useful new feature
      • e.g., stroke ratio: a measure of how efficient an engine vs how performant
    • combination
      • complicated formulation among features
      • the more complicated combination is, the more difficult it will be for a model learn
      • e.g., engine;s "displacement" as a measure of its power
    • data visualization
      • able to suggest transformations
      • often a "reshaping" of a feature through powers or logarithms
      • e.g., highly skewed distribution of Windspeed in US Accidents
  • Counting features

    • features describing presence or absence
    • representing such features w/ binary (1 for presence , 0 for Absence) or Boolean (True or False)
    • dealing such features in sets
    • new "counts" features: aggregating such features
    • able to create Boolean values w/ dataframe built-in methods
  • Manipulating structure data

    • complex strings usually broken into simpler pieces
    • common examples of structure data
      • ID numbers: '123-45-6789'
      • Phone numbers: '(999) 555-0123'
      • Street addresses: '8241 Kaggle Ln., Goose City, NV'
      • Internet addresses: 'http://www.kaggle.com'
      • Product codes: '0 36000 29145 2'
      • Dates and times: 'Mon Sep 30 07:06:05 2013'
    • able to apply string methods, like split, directly to columns
    • able to join simple features into a composed feature

Logarithmic Transformation

  • Logarithmic transformation
    • formula: $ f(x) = \ln(x), x > 0$
    • simplest and most popular among the different types of transformations
    • involving a substantial transformation that significantly affects distribution shape
    • making extremely skewed distribution less skewed, especially for right-skewed distributions
    • constraint: only for strictly positive numbers
    • Python: logarithm_transformer = FunctionTransformer(np.log, validate=True)

Square Root Transformation

  • Square root transformation
    • formula: $f(x) = \sqrt{x}, x \ge 0$
    • simple transformation w/ average effect on distribution shape
    • weaker than logarithmic transformation
    • used for reducing right-skewed distributions
    • advantage: able to apply to zero values
    • constrain: only for positive numbers
    • Python: sqrt_transformer = FunctionTransformer(np.sqrt, validate=True)
    • alternative: cubic root function

Recipocal Transformation

  • Recipocal transformation
    • formula: $f(x) = \frac{1}{x}, x \ne 0$
    • a powerful transformation w/ a radical effect
    • positive reciprocal: reversing the order among values of the same sign $\to$ large values $\to$ smaller
    • negative reciprocal: preserving the order among values of the same ign
    • constraint: not defined for zero
    • Python: recipocol_transformer = FunctionTransformer(np.recipocol, validate=True)
    • alternative; negative reciprocal function

Exponential or Power Transformation

  • Exponential or Power transformation
    • formula:

      [ \begin{align*} f(x) &= x^2 \ g(x) &= x^3 \ h(x) &= x^n \ k(x) &= \exp(x) \end{align*} ]

    • a reasonable effect on distribution shape

    • applying power transformation (power of two usually) to reduce left skewness

    • Python: exponential_transformer = FunctionTransformer(lambda x: x**(3), validate=True)

Box-Cox Transformation

  • Box-Cox transformation
    • formula: ($x_i &gt; 0$)

      [ x_i^{(\lambda)} = \begin{cases} \frac{x_i^{\lambda}-1}{\lambda} & \text{if } \lambda \ne 0, \ \ln(x_i) & \text{if } \lambda = 0 \end{cases} ]

    • one of the most successful transformations

    • evolution of the exponential transformation by looking through various exponents instead of trying them manually

    • process

      • searching and evaluating all the other transformations and choosing the best one
      • hyperparameter ($\lambda$): varying over the range (-5, 5)
      • examining all values of $\lambda$
      • choosing the optimal value (resulting in the best approximation to a normal distribution)
    • constraint: only for positive number

    • Python: boxcox_transformer = PowerTransformer(method='box-cox', standardize=False)

Yeo-Johnson Transformation

  • Yeo-Johnson transformation
    • formula

      [ x_i^{(\lambda)} = \begin{cases} [(x_i + 1)^\lambda - 1] / \lambda & \text{if } \lambda \ne 0, x_i \ge 0, \ \ln(x_i) + 1 & \text{if } \lambda = 0, x_i \ge 0, \ -[(-x_i + 1)^{2-\lambda} - 1]/(2-\lambda) & \text{if } \lambda \ne 2, x_i < 0, \ -\ln(-x_i + 1) & \text{if } \lambda = 2, x_i < 0 \end{cases} ]

    • an adjustment to the Box-Cox transformation

    • able to apply to negative numbers

    • Python: yeo_johnson_transformer = PowerTransformer(method='yeo-johnson', standardize=False)

Group Transforms

  • Group transforms

    • aggregating information across multiple rows grouped by some category
    • good practice: category interaction $\to$ group transform over the category
    • aggregation function to combine two features
      • grouping categorical feature
      • aggregating feature values
    • built-in dataframe method as aggregation function, e.g., mean, max, min, median, var, std, count
    • preventing inappropriate data splitting
      • using training and validation splits to preserve their independence
      • best practice
        • creating a grouped feature using only the training set
        • joining it to the validation set
        • using the validation set's merge set after creating a unique set of values w/ drop_duplicates on the training set
  • Tips for creating features

    • linear models
      • learning sums and differences naturally
      • unable to learn anything more complex
    • ratio:
      • difficult for most models to learn
      • ratio combinations leading to some easy performance gains
    • normalization
      • linear models and Neural Nets generally doing better w/ normalized features
      • NN: features scaled to values not too far from 0
      • tree-based models also beneficial from normalization but limited
    • tree models
      • learning to approximate almost any combination of features
      • combination especially important when limited data
    • counts:
      • especially helpful for tree models
      • tree models w/o natural way of aggregating information across many features at once

Variable Discretization

Overview

  • Variable Discretization

    • transforming a continuous variable into a discrete one
    • essentially creating a set of contiguous intervals spanning the variable's value range
    • binning = discretization, bin = interval
    • approaches
      • supervised approach
        • discretization w/ decision tree
      • unsupervised approaches
        • equal-width discretization
        • equal-frequency discretization
        • K-means discretization
      • other
        • custom discretization
  • Using the newly-created discrete variable

    • usually encoding w/ ordinal, i.e., integer encoding as 1, 2, 3, tec.
    • two major methods
      • using the value of the interval straight away if using intervals as numbers
      • treating numbers as categories, applying any of the encoding technique that creates a monotone relationship w/ the target
    • advantageous way of encoding bins: treating bins as categories to use an encoding technique that creates a monotone relationship w/ the target

Equal-Width Discretization

  • Equal-width discretizatoin
    • the most simple form of discretization
    • dividing the range of possible values into $N$ bins of the same width
    • width of intervals: $\text{width} = \frac{\max - \min}{N}$
    • $N$ parameter:
      • the number of intervals
      • determined experimentally - no rules of thumb here
    • considerations
      • not improving the values spread
      • handling outliers
      • creating a discrete variable
      • useful when combined w/ categorical encoding
    • Python: from sklearn.preprocessing import KBinsDiscretizer

Equal-Frequency Discretization

  • Equal-frequency discretization
    • dividing the scope of possible values of the variable into $N$ bins
    • each bin holding the same number (or approximately the same number) of observation
    • considerations
      • the interval boundaries corresponding to quantile
      • improving the value spread
      • handling outliers
      • disturbing the relationship w/ the target
      • useful when combined w/ categorical encoding
    • Python: from sklearn.preprocessing import KBinsDiscretizer

K-Means Discretization

  • K-means discretization

    • consisting of applying k-means clustering to the continuous variable
    • bin = cluster
    • reviewing the k-mean algorithms
      1. creating $K$ random points, center of cluster
      2. associating every data point w/ the closest center (using some distance metric, like Euclidean distance)
      3. re-computing each center position in the center of its associated points
      4. repeat step 2 & 3 until convergence
    • tutorials about k-means
    • considerations
      • not improving the values spread
      • handling outliers, though outliers may influence the centroid
      • creating a discrete variable
      • useful when combined w/ categorical encoding
    • Python: from sklearn.preprocessing import KBinDiscretizer
  • Unsupervised learning algorithms

    • not making use of a target
    • purpose:
      • learning some property of the data
      • representing the structure of the features in a certain way
    • a "feature discovery" technique in terms of feature engineering
  • Clustering

    • the assigning of data points to groups
    • group based on how similar the points to each other
    • making "birds of a feather flock together"
    • used for feature engineering: an attempt to discover
      • groups of customers representing a market segment
      • geographic areas sharing similar weather patterns
    • adding a feather of cluster labels $\to$ untangle complicated relationships of space and proximity
  • Feature w/ clustered labels

    • clustering: like a traditional "binning" or "discretization" transform
    • multiple features:
      • a.k.a. vector quantization
      • multi-dimensional binning
    • motivation for adding cluster labels
      • clusters breaking up complicated relationships across features in simple chunks
      • applying divided and conquer strategy to handle different clusters
      • learning the simpler chunks one-by-one instead learning the complicated one
  • Clustering algorithms

    • classification
      • how they measure "similarity" or "proximity"
      • what kinds of features working with
    • k-means: intuitive and easy to apply in a feature engineering context
    • selection of algorithm: depending on application
  • K-means clustering

    • measuring similarity using ordinary straight-line distance (Euclidean distance)
    • creating clusters by placing a number of points, called centroids, inside the feature space
    • each point assigning to the cluster whatever centroid it closest to
    • $k$: the parameter about how many centroids
    • Voronoi tessallation
      • imaging each centroid capturing points through a sequence of radiating circles
      • a line formed w/ the overlapped sets of circles from competing centroids
      • analogy: which cluster to assigned w/ future data
  • K-means w/ scikit-learn's implementation

    • hyperparameters: n_clusters, max_iter, n_init
    • procedure
      • init: randomly initializing n_cluster centroids
      • assign points to the nearest cluster centroid
      • move each centroid to minimize the distance to its points
      • repeat the above 2 steps until the centroids converged or reaching the maximum iteration (max_iter)
    • issue: initial random position of the centroids $\to$ poor clustering
    • solution:
      • repeat the algorithm a number of times (n_init)
      • return the clustering w/ the least total distance btw each point and its centroid, the optimal clustering
    • increasing the max_iter for a large number of clusters
    • increasing n_init for a complex dataset
    • sensitive to scale:
      • rescale or normalize data w/ extreme values
      • depending on domain knowledge and predicting target
      • rule of thumb: feature
        • ready directly comparable, e.g., test result at different time $\to$ not rescale
        • not on comparable scales, e.g., height and weight $\to$ usually benefit for rescale
        • not clear $\to$ use common sense
      • features w/ larger values weighted more heavily
      • comparing different schemes through cross-validation probably helpful
    • best partitioning for a set of features depending on
      • model used
      • what to predict

Discretization w/ Decision Trees

  • Discretization w/ Decision Trees
    • consisting of a decision tree to identify the optimal bins
    • a decision tree making a decision $\to$ assigning an observation to one of $N$ end leaves
    • generating a discrete output, the predictions at each of its $N$ leaves
    • procedure
      • training a decision tree of limited depth (2, 3, or 4) using only the variable to discretize and the target
      • replacing the variable's value w/ the output returned by the tree
    • considerations
      • not improving the values spread
      • handling outliers since trees are robust to outliers
      • creating a discrete variable
      • prone to overfitting
      • cost some time to tune the parameters effectively (e.g., tree depth, the minimum nimber of samples in one partition minimum info gain)
      • observations within each bin more similar to each other
      • creating a monotonic relationship
    • Python: from sklearn.proprecessing import DecisionTreeClassifier

Custom Discretization

  • Custom discretization
    • engineering variables in a custom environment (i.e., for a particular business use case)
    • determining the intervals where the variable divided so that it makes sense for the business
    • example: Age divided into groups like [0-10] as kids, [10-25] as teenagers, and so on
    • Python: labels = ['0-10', '10-25', '25-65', '>65']

Principal Component Analysis

  • Principal Component Analysis and feature engineering

    • a partitioning of the variation in the data
    • a great tool to help to discover important relationship in the data
    • used to create more informative features
    • typically applied to standardized data
    • variation meaning
      • standardized data: correlation
      • non-standardized data: covariance
  • Visualization for Principal Component Analysis

    • axes of variation

      • describing the ways the abalone tend to different from one another
      • axes: perpendicular lines along the natural dimensions of the data
      • each axis for one original feature
    • idea of PCA: instead of describing the data w/ the original features, describing it w/ axes of variation

    • dataset: Abalone data set

      • physical measurements taken from several thousand Tasmanian abalone
      • only focusing on Height and Diameter of their shells
    • axes of variation for abalone

      • Size component
        • the longer axis
        • small height and small diameter (lower left) contrasted w/ large height and large diameter (upper right)
      • Shape component
        • the shorter axis
        • small height and large diameter (flat shape) contrasted w/ large height and small diameter (round shape)
  • PCA as new features

    • new features PCA: liner combinations (weighted sums) of the original features

      df["Size"] = 0.707 * X["Height"] + 0.707 * X["Diameter"]
      df["Shape"] = 0.707 * X["Height"] - 0.707 * X["Diameter"]

      • principal components of the data: Size, Shape
      • loadings: weights, 0.707
    • number of principal components = features in the original dataset

    • component's loadings expressed through signs and magnitudes

      • table of loadings

        Features \ Components Size (PC1) Shape (PC2)
        Height 0.707 0.707
        Diameter 0.707 -0.707
      • Size component: Height and Diameter varying in the same direction (same sign)

      • Shape component: Height and Diameter varying in opposite direction (opposite sign)

      • all loadings w/ the same magnitude $\to$ features contributing equally

  • Percent of explained variance

    • PCA represents the amount of variation in each component
    • more variation in the data along the Size component than along the Shape component
    • making the precise comparison though each component's percent of explained variation
    • Size component: the majority of variation btw Height and Diameter
    • the amount of variance in a component
      • not necessarily correspond to how good it is as a predictor
      • depending on what to predict
  • Ways to use PCA for feature engineering

    • use as a descriptive technique
      • computing the MI scores for the components
      • what kind of variation most predictive of the target
      • ideas for kinds of features to create
        • Size: product of Height and Diameter
        • Shape: ratio of Height and Diameter
      • try clustering on one or more of the high scoring components
    • use components themselves as features
      • the components exposing the variational structure of the data directly
      • often more informative than the original features
      • use cases
        • dimensionality reduction
          • highly redundant features, in particular, multicolinear
          • partitioning out the redundancy into one or more near-zero variance components
        • anomaly detection
          • unusual variation often w/ the low-variance components
          • unusual variation: not apparent from the original features
          • components highly informative in an anomaly or outlier detection task
        • noise reduction
          • sensor reading often w/ common background noise
          • able to collect the (informative) signal into a smaller number of features while leaving out the noise
          • boosting the signal-to-noise ratio
        • decorrelation
          • ML sometimes struggling w/ highly-correlated features
          • transforming correlated features into uncorrelated components
  • PCA best practices

    • only working w/ numeric features, including continuous quantities or counts
    • sensitive to scale: standardizing data before applying PCA
    • removing or constraining outliers for undue influence on the results

Handling Outliers

Overview

  • Outliers
    • a data point significantly different from the remaining data
    • an observation deviating so much from the other observations
    • arousing suspicion that a different mechanism produced it
    • handling outliers
      • trimming: simply removing the outliers from dataset
      • imputing: treating outliers as missing data and applying missing data imputation techniques
      • discretization: placing outliers in edge bins w/ higher or lower values of the distribution
      • censoring: capping the variable distribution at the maximum and minimum values

Detection

  • Detecting Outliers
    • using visualization plots like box plot and scatter plot
      • box plot: black points as outliers
      • scatter plot: most points located in center but one far from center might be outlier
    • using a normal distribution (mean and s.d.)
      about 99.7% of the data lie within 3 s.d. of the mean

IQR Proximity Rule

  • Inter-quantal range proximity rule
    • Interquartile range (IQR)
      • used to build boxplot graphs
      • dividing data into four parts and each part is a quartile
      • IQR = he difference between the 3rd quantile Q3 (75%) and the 1st quantile or Q1 (25%)
    • outliers defined w/ IQR
      • below Q1 - 1.5 x IQR
      • above Q3 + 1.5 x IQR

DBSCAN

  • Density-Based Spatial Clustering of Application w/ Noise (DBSCAN)
    • a clustering algorithm used to group points in the same clusters
    • choosing two htperparameters
      • epsilon > 0 for the distances btw points: the maximum distance btw two examples for one to be considered in the neighborhood of the other
      • min_samples $\in \Bbb{N}$: serving as the number of samples in a neighborhood for a point to be considered as a core point
    • algorithm
      1. randomly selecting a point not assigned to a cluster
      2. determining if it belongs to a cluster by seeing if there are at least min_samples points around it within epsilon distance
      3. creating a cluster of this point w/ all other samples within epsilon distance to it
      4. finding all points that are within epsilon distance of each point in that cluster and adding them to the same cluster
      5. finding all points that are within epsilon distance of all recently added points and adding these to the same cluster
      6. repeating steps 1~5
    • all points not reachable from any other point aare considered outliers
    • Python: from sklearn.cluster import DBSCAN

Isolation Forests

  • Isolation forests
    • built on the foundation of decision trees and using tree assemble methods
    • algorithm examining how quickly a point isolated
    • normal point: more partition to isolate
    • outliers
      • isolated quickly in the first splits
      • less frequent than regular observations
      • lying further away from the regular observations in the feature space
      • w/ random partitioning identified closer to the root of the tree
    • Python: from sklearn.ensemble import IsolatedForest

Local Outlier Factor

  • Local outlier factor (LOF)
    • measuring the local variation of density of a given sample taking just its neighbors into considerations and not the global data distribution
    • outlier: the density around that points is significantly different from the density around its neighbors
    • algorithm
      1. calculating the distances btw a randomly selected point and every other point
      2. finding the farest $k$ cloest point (k-th nearest-neighbor)
      3. fidning the other $k$ closest points, like a normal KNN
      4. calculating the point density (local reachability density) using the inverse of the average distance btw that point and its neighbors (the lower the density, the farther the point is from its neighbors)
      5. calculating the LOF, essentially the average local reachability density of the neighbors divided by the point's own local reachability density
    • imterpretation of the final LOF score
      • LOF(k) = 1: similar density as neighbors
      • LOF(k) < 1: higher density than neighbors (inlier)
      • LOF(k) > 1: lower density than neighbors (outlier)
    • Python: from sklearn.neighbor import LocalOutlierFactor

Trimming

  • Trimming outliers
    • merely removing outliers from the dataset
    • deciding on a metric to determine outliers
    • considerations
      • fast method
      • removing a significant amount of data
    • Python: outliers = np.where(data_df[variable] > upper, True, np.where(data_df[variable] < lower, True, False)) & data_df = data_df.loc[~(outliers,)]

Censoring

  • Censoring outliers
    • setting the maximum and/or the minimum of the distribution at any arbitrary value
    • values bigger or smaller than the arbitrarily chosen value are replaced by the value
    • concerns about capping
      • not removing data
      • distorting the distributions of the variables
    • arbitrarily replacing the outliers
    • inter-quantal range proximity rule
    • Gaussian approximation
    • using quantiles

Imputer

Transformation

Feature Scaling

Overview

  • Feature scaling
    • methods used to normalize the range w/ values of independent variables
    • ways to set the feature value range within a similar scale
    • concerns
      • the scale of the variable directly influencing the regression coefficient
      • variable w/ a more significant magnitude dominate over the ones w/ a smaller magnitude range
      • gradient descent converges faster when features are on the same scales
      • feature scaling helps decrease the time to find support vector of SVMs
      • Euclidean distances are sensitive to feature magnitude
    • algorithms sensitive to feature magnitude
      • linear and logistic regression
      • Neural networks
      • support vector machine
      • KNN
      • K-means clustering
      • linear discriminant analysis (LDA)
      • principle component analysis (PCA)
    • algorithm insensitive to feature magnitude
      • classification and regression trees
      • random forest
      • gradient boosted trees
    • scaling methods
      • mean normalization
      • standardization
      • robust scaling (scaling to median and IQR)
      • robust to maximum and minimum
      • scale to absolute maximum
      • scale to unit norm

Mean Normalization

  • Mean normalization
    • centering the variable at 0 and rescaling the variable's value range to the range -1 and 1

    • scaling formula:

      [ \overline{x} = \frac{X - \text{mean}(X)}{\max(X) - \min(X)} ]

    • not normalizing the variable distribution

    • characteristics

      • centering the mean at 0
      • different resulting variance
      • modifying the shape of original distribution
      • normalizing the minimum and maximum values w/ the range [-1, 1]
      • preserving outliers if existed

Standardization

  • Standardization
    • centering the variable at 0 and standarizing the variance to 1

    • scaling formula:

      [ \overline{x} = \frac{X - \text{mean}(X)}{\text{std}(X)} ]

    • not normalized the variable distribution

    • characteristics

      • scaling the variance at 1
      • centering the mean at 0
      • preserving the shape of the original distribution
      • preversing outliers if existed
      • minimum and maximum values varying
    • Python: import sklearn.preprocessing import StandardScaler

Robust Scaling

  • Robust scaling (scaling to median and IQR)
    • using median instead of mean

    • scaling formula

      [ \overline{x} = \frac{X - \text{median}(X)}{Q3(x) - Q1(x)} ]

    • characteristics

      • centering the median at 0
      • resulted variance varying across variables
      • not preserving the shape of the original distribution
      • minimum and maximum values varying
      • robust to outliers
    • Python: from sklearn.preprocessing import RobustScaler

Min-Max Scaling

  • Min-Max scaling
    • compressing the value between 0 and 1

    • scaling formula:

      [ \overline{x} = \frac{X - \min(X)}{\max(X) - \min{X}} ]

    • not normalizing the variable distribution

    • characteristics

      • not centering the mean at 0
      • making the variance vary across variables
      • not maintaining the shape of the original distribution
      • maximum and minimum values in the range of [0, 1]
      • sensitive to outliers
    • Python:from sklearn.preprocessing import MinMaxScaler

Maximum Absolute Scaling

  • Maximum absolute scaling
    • scaling the variable btw -1 and 1

    • scaling formula:

      [ \overline{x} = \frac{X}{\max(X)} ]

    • characteristics

      • the resulting mean not centered
      • not scaling the variance
      • sensitive to outliers
    • Python: from sklearn.preporcessing import MaxAbsScaler

Scaling to Vector Unit Norm

  • Scaling to vector unit norm
    • scale to vector unit norm

    • scaling formula:

      [ \overline{x} = \frac{X}{|X|} ]

    • the distance measure for unit norm

    • characteristics

      • the length of the resulting vector is 1
      • normalizing the feature vector and not the observation vector
      • sensitive to outlier
      • recommended for text classification and clustering
    • Python: from sklearn.preprocessing import Normalizer

Handling Date-Time and Mixed Variables

Date and Time Variables

  • Engineering variables of date and time

    • date and time: good resource of information
    • each number corresponding to a specific part of the date and time
    • date-time varables in many formats, e.g.,
      • time of birth: 19:45:57
      • birthday date: 16-08-1995, 16-04-1997
      • invoice date: 03-06-2020 19:47:29
  • Time/date components: pd.Series.dt

    Date / Time Components
    Property Description
    yearThe year of the datetime
    monthThe month of the datetime
    dayThe days of the datetime
    hourThe hour of the datetime
    minuteThe minutes of the datetime
    secondThe seconds of the datetime
    microsecondThe microseconds of the datetime
    nanosecondThe nanoseconds of the datetime
    dateReturns datetime.date (does not contain timezone information)
    timeReturns datetime.time (does not contain timezone information)
    timetzReturns datetime.time as local time with timezone information
    dayofyearThe ordinal day of year
    weekofyearThe week ordinal of the year
    weekThe week ordinal of the year
    dayofweekThe number of the day of the week with Monday=0, Sunday=6
    weekdayThe number of the day of the week with Monday=0, Sunday=6
    quarterQuarter of the date: Jan-Mar = 1, Apr-Jun = 2, etc.
    days_in_monthThe number of days in the month of the datetime
    is_month_startLogical indicating if first day of month (defined by frequency)
    is_month_endLogical indicating if last day of month (defined by frequency)
    is_quarter_startLogical indicating if first day of quarter (defined by frequency)
    is_quarter_endLogical indicating if last day of quarter (defined by frequency)
    is_year_startLogical indicating if first day of year (defined by frequency)
    is_year_endLogical indicating if last day of year (defined by frequency)
    is_leap_yearLogical indicating if the date belongs to a leap year

Mixed Variables

  • Engineering mixed variables types

    • solution: extracting the categorical part in one variable and the numerical part in a different variable
    • two special formats in a mixed variable
      • different observations
      • same observation
  • Labels and numbers in different observations

    • either numbers or labels in their values

    • example

      Example of mixed numerical & labels w/ different observations
    • resulted in a lot of nan values $\to$ applying missing data techniques

    • Python: data_df[mixed_num] = pd.to_number(data_df[mixed], errrors='coerce', downcast='integer')

  • Labels and numbers in the same observation

    • variables containing both numbers and labels in their values
    • tricky to extract categorical and numerical values
    • depending on a number of factors, e.g., number of letters, locations, etc.
    • Python: data_df[mixed_num] = data_df[mixed].str.extract('(\d+)')
    • Regular Expression (regex): detect patterns in mixed variables and easily extract categorical and numerical parts

Periodicity

  • Cyclical feature problem
    • cyclical or periodic data
      • data following a cycle
      • e.g., hours, minutes, seconds, days of the month, days of weeks, and months
    • preserving the cyclical info in datasets for models to learn accurate and behave correctly
    • one solution: projecting the cyclical feature on a circle, specifically the unit circle
    • unit circle: using $\cos$ and $\sin$ functions to express the periodicity
    • the $\sin$ and $\cos$ as new created features to transform the cyclical feature
    • Python: data_df['payment_hour_sin'] = np.sin(data_df['payment_hour'] * (2. * np.pi / 24.)) & data_df['payment_hour_cos'] = np.cos(data_df['payment_hour'] * (2. * np.pi / 24.))

Advanced Feature Engineering

Automated feature engineering

  • Deep feature synthesis

    • automatically generating features for relational dataset
    • relationships in the data to a base field
    • sequentially applying mathematical functions along that path to create the final feature
  • Featuretools

    • an open-source framework for implementing automated feature engineering
    • a comprehensive tool intended to make the feature generation process fast-forward
    • components
      • deep feature synthesis: the backbone of featuretools
      • entities: multiple entities result in an EntitySet
      • feature primitives: Deep Feature Synthesis applied to EntitySet - transfrmations or aggregations like count or average

Geospatial data

  • Geospatial feature
    • represented as longitude and latitude
    • features influcing predictive model's results by a large margin if well-engineered
    • procedure
      • visualizing the features to obtain valuable insight
      • exploring different methods to extract and design new features

Resampling imblanced data

  • Resampling
    • issue: classes not represented equally
    • causing problems for some algorithms
    • resampling engineering and reducing this effect on machine learning algorithms

Examples

  • Concrete Formulations - Counting Feature Used

    • task: illustrating how adding a few synthetic to dataset to improve the predictive performance of a random forest model
    • dataset: Concrete
      • containing a variety of concrete formulations and the resulting product's comprehensive strength
      • comprehensive strength: a measure of how much load that kind of concrete can bear
  • 1985 Automobiles - Mutual Information

    • dataset: Automobile dataset
    • goal: predicting a car's price (the target) from 23 pf the car's features
    • task: ranking the features w/ mutual information and investigating the results by data visualization
  • Ames House Price - Mutual Information

    • dataset: Ames data set
    • task: identify initial set of features w/
      • mutual information
      • interaction plots
  • Ames House Price - creating features

    • dataset: Ames dataset
    • task: developing mathematical transforms
      • features describing areas
      • same units (square-feet)
      • using XGBoots (a tree-based model) $\to$ focus on ratios and sums
    • create mathematical transforms
      • LivLotRatio: the ratio of GrLivArea to LotArea
      • Spaciousness: the sum of FirstFlrSF and SecondFlrSF divided by TotRmsAbvGrd
      • TotalOutsideSF: the sum of WoodDeckSF, OpenPorchSF, EnclosedPorch, Threeseasonporch, and ScreenPorch
  • California Housing - K-Means

    • data set: California Housing
      • Latitude and Longitude: natural candidates for k-means clustering
      • MedInc: creating economic segments in different regions of California
    • training w/ K-means
  • 1985 Automobiles - PCA

    • dataset: Automobile
    • task: descriptive technique to discover features
  • PCA for feature engineering

    • dataset: Ames
    • task:
      • using PCA results to discover one or more new features
      • new features to improve the performance of the model
        • inspired by the loadings
        • using the components themselves as features
  • MovieLens1M - Target Encoding

    • dataset: MovieLens1M
      • 1 million movie rating by users of the MovieLens website
      • features describing each user and movie
    • tasks:
      • identifying features for encoding
      • applying M-estimate encoding
  • Ames - Target Encoding

    • dataset: Ames
    • task: encode Ratings w/ SalePrice
  • House Prices - XGBoost

    • dataset: Ames
    • task: predict SalePrice w/ XGBoost