-
Filter Methods
- Basic feature
- constant feature
- quasi-constant feature
- duplicated features
- Correlation filter
- Pearson's correlation coefficient
$\left( r_{xy} = \frac{\sum_{i=1}^n (x_i - \overline{x})(y_i - \overline{y})}{\sqrt{\sum_{i=1}^n (x_i - \overline{x})^2} \sqrt{\sum_{i=1}^n (y_i - \overline{y})^2}} \right)$ - Spearman's rank correlation coefficient
$\left( \rho = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)} \right)$ - Kendall's rank correlation coefficient
$\left( \tau = \frac{2(n_c - n_d)}{n(n-1)} \right)$
- Pearson's correlation coefficient
- Statistical and Ranking filter
- Mutual information
$\left( \text{Entropy: }I(X; Y) = \sum_{x, y} P_{XY} (x, y) \log \frac{P_{XY}(x, y)}{P_X(x)P_Y(y)} \right)$ - Chi-squared score
- ANOVA univariate test
- Univariate ROC-AUC / RMSE
- Mutual information
- Basic feature
-
Wrapper Methods
- Forward feature addition, sequential forward feature selection (SFS)
- Backward feature elimination, sequential backward feature selection (SBS)
- Exhaustive feature selection
- LRS or Plus-L, Minus-R
- Sequential floating
- step floating forward selection (SFFS)
- step floating backward selection (SFBS)
- Bidirectional search (BDS)
-
Embedded methods
- Regularization
- Tree-based feature importance
- Feature importance
- Random forests
- Permutation importance
-
Hybrid Methods
- Filter & wrapper
- Embedded & wrapper
- Recursive feature elimination
- Recursive feature addition
-
Advanced methods
- Dimensionality reduction
- Principle component analysis (PCA)
- Linear discriminant analysis (LDA)
- Heuristic search algorithms
- Genetic algorithm (GA)
- Simulation annealing
- Deep learning
- Autoencoder for feature selection (AEFS)
- Dimensionality reduction
-
- a.k.a. variable selection, attribute selection or subset selection
- the process by which a data scientist selects automatically and manually a subset of relevant features to use in ML model building
- selecting the best subset of attributes
- most important
- high contribution at the time of prediction making
- a critical process in any ML pipeline
- designed to remove irrelevant, redundant, and noisy features
- preserving a small subset of features from the primary feature space
- impact: model's performance
- advantages
- reducing computational complexity
- improving model accuracy
- increasing model interpretability
-
- not always true: the more data features you have, the better the resulting model is going to be
- irrelevant features
- redundant features
- result: overfitting
- reasons to select features
- simple models easier to interpret much easier to understand the output of a model w/ less variables
- shorter training time
- enhanced generalization by reducing overfitting
- variable redundancy
- not always true: the more data features you have, the better the resulting model is going to be
-
Feature selection vs. feature engineering
- feature engineering:
- creating new features
- helping the ML model make more effective and accurate predictions
- feature selection:
- selecting features from the feature pool
- helping ML models more efficiently make predictions on target variables
- feature engineering:
-
Feature selection vs. Dimensionality reduction
- dimensionality reduction
- tending to lump together w/ feature selection
- using unsupervised algorithms to reduce the number of feature in a dataset
- differences
- feature selection: a process to select and exclude some features w/o modifying them at all
- dimensionality reduction:
- modifying or transforming features into a slower dimension
- creating a whole new feature space that looks approximately like the first one, but smaller in terms of dimensions
- dimensionality reduction
-
Procedure of feature selection
- process
- combination of a search technique for proposing a new feature subset
- an evaluation measuring that scores how well is the different feature subsets
- computational expensive
- looking for the best combination of feature subsets from all the available features
- process
-
- filter methods
- wrapper methods
- embedded methods
-
- definitions:
- selecting features from a dataset independently for any ML algorithms
- relying only on the characteristics of these variables
- filtered out of the data before learning begins
- advantages
- used in any ML algorithm
- computationally inexpensive
- good for eliminating irrelevant, redundant, constant, duplicated, and correlated features
- types
- univariate
- multivariate
- the methods: proposed for univariate and multivariate filter-based feature selection
- basic feature methods
- correlation filter methods
- statistical & ranking filter methods
- definitions:
-
- evaluating and ranking a single feature according to certain criteria
- treating each feature individually and independently of the feature space
- procedures
- ranking features according to certain criteria
- selecting the highest ranking features according to those criteria
- issue: not considering relation to other ones in the dataset
-
Multivariate of filter methods
- evaluating the entire feature space
- considering the relations to other ones in the dataset
-
- showing single values in all the observations in the dataset
- no information
- Python:
from sklearn.feature_selection import VarianceThreshold
-
- a value occupying the majority of the records
- hyperparameter: threshold
-
Duplicated features: Python snippet
# select the duplicated features columns names duplicated_columns = train_features_T[train_features_T.duplicated()].index.values
-
- correlation:
- a measure of the linear relationship btw two quantitative variables
- a measure of how strongly one variable depending on another
- high correlation w/ target
- a useful property to predict one from another
- goal: highly correlated w/ the target, especially for linear ML models
- high correlation btw variables:
- providing redundant information in regards to the target
- making an accurate prediction on the target w/ just one of the redundant variables
- removing redundant ones to reduce the dimensionality but add noise
- methods to measure the correlation btw variables
- Pearson correlation coefficient
- Spearman's rank correlation coefficient
- Kendall's rank correlation coefficient
- correlation:
-
Pearson's correlation coefficient
-
popular measure used in ML
-
used to summarize the strength of the linear relationship btw two data variables
-
$r_{xy} \in [-1, 1]$ -
$r_{xy} = 1$ : the values of one variable increase as the values of another increases -
$r_{xy} = -1$ : the values of one variable decrease as the values of another decreases -
$r_{xy} = 0$ : no linear correlation btw them
-
-
assumptions
- both normally distributed
- straight-line relationship btw the two variables
- equally distributed around the regression line
-
Pearson correlation coefficient
[ r_{xy} = \frac{\sum_{i=1}^n (x_i - \overline{x})(y_i - \overline{y})}{\sqrt{\sum_{i=1}^n (x_i - \overline{x})^2} \sqrt{\sum_{i=1}^n (y_i - \overline{y})^2}} ]
-
-
Spearman's rank coefficient coefficient
-
variables w/ a nonlinear relationship
-
stronger or weaker across the distribution of the variables
-
a non-parametric test
-
used to measure the degree of association btw two variables w/ a monotonic function
-
$\rho \in [-1, 1]$ -
Pearson correlation assessing linear relationships while Spearman's correlation assessing monotonic relationship (whether linear or not)
-
suitable for both continuous and discrete ordinal variables
-
not carrying any assumptions about the distribution of the data
-
Spearman's rank correlation coefficient
[ \rho = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)} ]
-
-
-
a non-parametric test that measures the strength of the ordinal association btw two variables
-
calculating a normalized score for the number of machine or concordant rankings btw the two data samples
-
$\tau \in [-1, 1]$ -
$\tau = 1$ (high): observations w/ a similar rank btw two variables -
$\tau = -1$ (low): observation w/ a dissimilar rank btw two variables
-
-
best suitable for discrete data
-
Kendall's rank correlation
[ \tau = \frac{2(n_c - n_d)}{n(n-1)} ]
-
-
Python related functions
-
dataframe.corr()
function -
DataFrame.duplicated()
function
-
-
- evaluating each feature individually
- evaluating whether the variable to discriminate against the target
- procedure
- ranking the feature base on certain criteria or metrics
- selecting the features w/ the highest rankings
- methods
- mutual information
- Chi-square score
- ANOVA univariate test
- Univariate ROC_AUC / RMSE
-
-
a measure of the mutual dependence of two variables
-
measuring the amount of info obtained about one variable through observing the other variable
-
determining how much knowing one variable by understanding another
-
a little bit like correlation, but more general
-
measuring how much info the presence/absence of a feature contributes to making the correct prediction on
$Y$ -
MI values
-
$MI = 0$ : X and Y independent -
$MI = E(X)$ :$X$ is deterministic of$Y$ ,$E(X)$ as entropy of$X$
-
-
Entropy: measuring or quantifying the amount of info within a variable
[ I(X; Y) = \sum_{x, y} P_{XY} (x, y) \log \frac{P_{XY}(x, y)}{P_X(x)P_Y(y)} ]
-
Python:
from sklearn.feature_selection import mutual_info_classif
-
-
- commonly used for testing relationships btw categorical variables
- suitable for categorical variables and binary targets only
- constraint: non-negative variables and typically boolean, frequencies, or counts
- simply comparing the observed distribution btw various features in the dataset and the target variable
- Python:
from sklearn.feature_selection import chi2
-
- ANOVA = Analysis Of VAriance
- similar to chi-squared score
- measuring the dependence of two variables
- assumptions
- linear relationship btw variables and the target
- both normally distributed
- suitable for continuous variables and requiring a binary target
- Python:
from sklearn.feature_selection import f_classif
-
- using ML models to measure the dependence of two variables
- suitable for all variables
- no assumptions about the distribution
- measuring scenarios
- regression problem: RMSE
- classification problem: ROC-AUC
- procedure
- build a decision tree using a single variable and target
- rank features according to the model RMSE or ROC-AUC
- select the features w/ higher ranking scores
- Python:
from sklearn.tree import DecisionTreeClassifier
-
Wrapper methods
- disadvantages of filter methods
- tend to ignore the effect of the selected feature subset on the performance of the algorithm
- evaluate features individually
$\to$ some variables useless for prediction in isolation, but quite useful when combined w/ other variables
- Definition
- evaluating a subset of features using a ML algorithm that employs a search strategy to look through the space of possible feature subsets
- evaluating each subset based on the quality of the performance of a given algorithm
- greedy algorithms
- aiming to find the best possible combination of features that result in the best performance model
- computationally expensive
- impractical in the case of exhaustive search
- advantages
- detecting the interaction btw variables
- finding the optimal feature subset for the desired ML algorithm
- usually result in better predictive accuracy than filter methods
- process
- search for a subset of features: using a search method to select a subset of features from the available ones
- build a ML model: choosing ML algorithm trained on the previously-selected subset of features
- evaluate model performance: evaluating the newly-trained ML models w/ a chosen metric
- repeat: starting against w/ a new subset of features, a new ML model trained, and so on
- stopping criteria
- defined by ML engineer
- examples of criteria
- model performance increases
- model performance decreases
- predefined number of features reached
- search methods
- forward feature selection: starting w/ no features and adding one at a time
- backward feature selection: starting w/ all features present and removing one feature at a time
- exhaustive feature selection: trying all possible feature combinations
- bidirectional search: both forward and backward feature selection simultaneously to get one unique solution
- library to install:
mlxtend
containing useful tools for a number of day-to-day data science tasks
- disadvantages of filter methods
-
- a.k.a. step forward selection, sequential forward feature selection (SFS)
- an interactive method
- procedure
- start by evaluating all feature individually and then select the one that results in the best performance
- test all possible combinations of the selected feature w/ the remaining features and retain the pair that procedures the best algorithmic performance
- a loop continues by adding one feature at a time in each iteration until the pre-set criterion reached
-
mlxtend.feature_selection.SequentialFeatureSelector()
functionk_features
: the maximum feature to be reached when starting from 0floating
: using a variant of the step forward selection called step floating forward selectionscoring
: the scoring function to evaluate model performancecv
: the number of folds of K-fold cross-validation, no cross-validation if cv=0 or False
-
Backward feature elimination method
- a.k.a. step backward selection, sequential backward feature selection (SBS)
- starting w/ all the features in the dataset
- evaluating the performance of the algorithm
- removing one feature at a time
- producing the best performing algorithm using an evaluation metric
- the least significant feature among the remaining available ones
- removing feature after feature until a certain criterion satisfied
-
mlxtend.feature_selection.SequentialFeatureSelector()
functionk_features
: the maximum feature to be reached when starting from Nforward
: using the object for step forward or step backward feature selectionfloating
: using a variant of the step backward selection called step floating backward selection
-
- finding the best performing feature subset
- a brute-force evaluation of feature subsets
- creating all the subsets of features from 1 to N
- building a ML algorithm for each subset and selecting the subset w/ the best performance
- parameter 1 and N: the minimum number of features and the maximum number of features
-
mlxtend.feature_selection.SequentialFeatureSelector()
functionmin_features
: the lower bound of the number of features to search frommax_features
: the upper bound of the number of features to search from
-
- SFS adding features at each iteration
- adding up a feature useful in the beginning but non-useful after adding more ones
- unable to remove the feature
- SBS removing features at each iteration
- removing a feature useless in the beginning but useful after removing more ones
- unable to add the feature in the feature subsets
- solutions: LRS or sequential floating
- SFS adding features at each iteration
-
- using two parameters L and R (both integer)
- repeatedly adding and removing features from the solution subset
-
$L > R$ : LRS starting from the empty set of features- repeatedly adding L features
- repeatedly removing R features
-
$L < R$ : LRS starting from the full set of features- repeatedly removing R features
- repeatedly adding L features
- compensating for the weaknesses of SFS and SBS w/ some backtracking capabilities
- constrain: carefully set L and R parameters w/o well-known theory to choose the optimal values
-
- an extension of LRS
- determining values from the data directly than by adding and removing features for L and R
- the size of subset floating up and down by adding and removing features
- context of the floating methods
- step floating forward selection (SFFS): performing backward steps as long as the objective function increases
- step floating backward selection (SFBS): performing forward steps as long as the objective function increases
- algorithm for SFFS
- start from an empty set
- select the best feature and adding in the feature subset as SFS
- select the worse feature from the subset
- evaluate and check whether the objective function whether improving or not by deleting the feature.
- deleting the feature if improving
- keep the feature if not improving
- repeat from step 2 until stop criteria reached
- algorithm for SBFS
- start w/ a full one
- process w/ normal SBS
- add feature to improve the object function
- Bidirectional Search (BDS)
- applying SFS and SBS concurrently
- SFS: performing from the empty set of features
- SBS: performing from the full set of features
- issue: converge to different solutions
- resolving by the constrains
- features already selected by SFS not removing by SBS
- features already removed by SBS not added by SFS
- Embedded methods
- characteristics of wrapper methods:
- a good way to ensure the selected features are best for a specific ML model
- providing better results in terms of performance
- costing a lot of computation time/resources
- definition
- including the feature selection process in ML model training itself
- result in even better features of the mode in a short amount of time
- complete the feature selection process within the construction of the ML algorithm itself
- performing feature selection during the model training
- advantages
- solving both model training and feature selection simultaneously
- taking into consideration the interaction of feature like wrapper methods do
- faster like filter methods
- more accurate than filter methods
- finding the feature subset for the algorithm being trained
- much less prone to overfitting
- process
- train a ML model
- derive feature importance from this model, a measure of how much is feature important when making a prediction
- remove non-important features using the derived feature importance
- method: regularization & tree-based methods
- characteristics of wrapper methods:
- Regularization
- adding a penalty to the different parameters of a model to reduce its freedom
- penalty applied to the coefficient that multiplies each of the features in the linear model
- advantages
- avoid overfitting
- make the model robust to model
- improve its generalization
- main types of regularization for linear models
- Lasso regression / L1 regularization
- shrinking some of the coefficients to zero
- indicating a certain predictor or certain features multiplied by zero to estimate the target
- not added to the final prediction of the target
- i.e., features removed due to not contributing to the final prediction
- ridge regression / L2 regularization
- not setting the coefficient to zero, but approaching to zero
- elastic nets / L1/L2 regularization
- a combination of the L1 and L2
- incorporating their penalties and ending up w/ features w/ zero as a coefficient
- Lasso regression / L1 regularization
- Python:
from sklearn.feature_selection import SelectFromModel
- Tree-based feature importance
- tree-based algorithms and models
- well-established algorithms
- offering good predictive performance
- able to provide the feature importance as a way to select features
- feature importance
- indicating which variables more important in making accurate predictions on the target variable/class
- identifying which features mostly used by the ML algorithm to predict the target
- random forests:
- providing feature importance w/ straightforward methods
- mean decrease impurity
- mean decrease accuracy
- a group of decision trees
- each decision tree established over a random extraction of samples and features from the dataset
- individual unable to see all the features or access all the observations
- the importance of each feature derived by how "pure" each of the sets is
- impurity: measure based on the optimal condition chosen
- classification: typically either the Gini impurity or information gain/entropy
- regression: variance
- providing feature importance w/ straightforward methods
- training a tree
- feature importance calculated as the decrease in node impurity weighted in a tree
- the higher the value, the more important the feature
- alternatives:
- able to use any other tree-based algorithm the same way =
- best tree model types: gradient boosting algorithms (like XH=GBoost, CatBoost, and any more)
- providing accurate feature importance
- tree-based algorithms and models
- Hybrid methods
- definitions
- combining the difference approaches to get the best possible feature subset
- combinations of approaches up to engineer
- taking the best advantages from other feature selection methods
- advantages
- high performance and accuracy
- better computational complexity than wrapper methods
- more flexible and robust models against high dimensional data
- process
- depending on what engineers choose to combine
- main priority: select the methods to use, then follow their processes
- definitions
- Hybrid method: Filter + Wrapper methods
- filter method
- ranking methods, including mutual information and Chi-square score
- order features independently involving any learning algorithm
- the best features selected from the ranking list
- procedure
- using ranking methods to generate a feature list
- using the top k features from this list to perform wrapper method (SFS or SBS)
- advantages
- reducing the feature space of datset using these filter-based rangers
- improving the time complexity of the wrapper methods
- filter method
- Hybrid method: Embedded & Wrapper methods
- embedded methods
- establishing feature importance
- used to select top features
- procedure:
- embedded methods to get top k features
- performing a wrapper methods search
- methods:
- recursive feature elimination
- recursive featire addition
- embedded methods
-
- procedure
- train a model on all the data features
- possible candidate: tree-based model, lasso, logistic regression, or others offering feature importance
- evaluating its performance on a suitable metric
- derive the feature importance to rank features accordingly
- delete the least important feature and re-train the model on the remaining ones
- use the previous evaluation metric to calculate the performance of the resulting model
- test whether the evaluation metric decreases by an arbitrary threshold to remain or remove
- repeat step 3~5 until all features removed
- difference w/ SBS
- SBS: eliminate all the features to determine which one is the least important
- recursive feature elimination:
- getting this info from the ML models derived importance
- removing the feature only once rather than removing all the features at each step
- faster than pure wrapper methods and better than pure embedded methods
- limitations:
- use an arbitrary threshold value to decided whether to keep a feature or not
- the smaller this threshold value, the more features will be included in the subset and vice versa
- Python:
from sklearn.feature_selection import RFECV
- procedure
-
RFECV
: recursive feature elimination w/ corss-validation-
min_features_to_select
: int (default=1)
The minimum number of features to be selected. This number of features will always be scored, even if the difference between the original feature count andmin_features_to_select
isn’t divisible bystep
. -
step
: int or float, optional (default=1)-
$\ge 0$ : correspond to the (integer) number of features to remove at each iteration -
$\in (0.0, 1.0)$ : correspond to the percentage (rounded down) of features to remove at each iteration
-
-
cv
: int, cross-validation generator or an iterable, optional
Determines the cross-validation splitting strategy.-
None
: using the default 5-fold cross-validation -
integer
: specifying the number of folds - CV splitter
-
cross-validation generator
: a non-estimator family of classes used to split a dataset into a sequence of train and test portions , by providing split and get_n_splits methods -
cross-validation estimator
: an estimator that has built-in cross-validation capabilities to automatically select the best hyper-parameters -
scorer
: a non-estimator callable object which evaluates an estimator on given test data, returning a number
-
- an iterable yielding (train, test) splits as arrays of indices
-
-
scoring
: string, callable or None, optional, (default=None)
A string or a scorer callable object / function with signaturescorer(estimator, X, y)
-
- Recursive feature additions
- starting w/ no features and adding one feature at the time
- procedure
- train a model on all the data and derive the feature importance to rank it accordingly
- possible models: tree-based model, Lasso, logistic regression, or others offering feature importance
- from the initial model, create another w/ the most important feature and evaluate it w/ an evaluation metric of your choice
- add another important feature and use it to re-train the model, along w/ any feature from the previous step
- use the previous evaluation metric to calculate the performance of the resulting model
- test whether the evaluation metric increases by an arbitrary-set threshold
- repeat step 3-5 until features added
- Python snippet
-
Dimensionality reduction vs. feature selection
- both tried to reduce the number of features
- feature selection: select and exclude some features w/o making any transformation
- dimensionality reduction: transform features into a lower dimension
-
Principal component analysis (PCA)
- a dimensionality reduction technique
- using linear algebra to transform a dataset into a compressed form
- starting by calculating the Eigen decomposition (or singular value decomposition, SVD) of the covariance matrix of the features
- procedure
- searching the correlation btw features
- building new features that preserve the same explained variance of the original ones
- resulting in a lower-dimensional projection of the data, the maximal data variance
- measuring the importance of a given variable
- observing how much its contributing to the reduced feature space that PCA obtains
- feature selection w/ PCA
- calculating the explained variance of each feature
- using it as feature importance to rank variable accordingly
- Python:
from sklearn.decomposition import PCA
- alternative: linear discriminant analysis (LDA)
-
- attempt to perform optimization tasks in an iterative fashion to find an approximate solution if classical models failed to find an exact solution
- not always find the best or even the optimal solution, but find a good or acceptable solution within a reasonable amount of time and memory space
- performing a heuristic search across feature subsets to find the best one
-
- global optimization techniques for searching very large spaces
- sort of randomized search
- inspired by the biological mechanisms of natural selection and reconstruction
- working throughout populations of possible solutions (generations)
- each solution in the search space represented as a finite length string (chromosome) over some finite set of symbols
- using an objective (or fitness) function to evaluate the suitability of each solution
- feature selection
- each chromosome representing a feature subset
- represented w/ binary encoding: feature w/ 1 to choose and 0 to eliminate from
- conducting many iterations to create a new generation (new feature subset) of possible solutions from the current generations using a few operators
- operators
- selection:
- probabilististically filtering out solutions that perform poorly
- choosing high perfroming solutions to exploit
- cross over:
- the GA way to explore new solutions and exchange info btw strings
- applied to selected pairs of chromosomes randomly
- the probability equal to a given crossover rate
- generating new chromosomes that hopefully will retain good features from the previous generations
- mutation:
- protecting GAs against the irrecoverable loss of good solution features
- changing a symbol of some chromosomes
- changing ratio as a probability equal to a very low given mutation rate to restore lost genetic material
- selection:
- advantages
- working w/ a population of solutions
- more effective to escape local minima
- procedures
- initializing a population w/ randomly-generated individuals
- different feature subsets
- creating a machine learning algorithm
- evaluating the fitness of each feature subset w/ an evaluation metric of choice depending on the chosen algorithm
- reproducing high-fitness chromosomes (feature subsets) in the new population
- removing poor-fitness chromosomes (selection)
- constructing new chromosomes (crossover)
- recovering lost features (mutation)
- repeating step 2~6 until a stopping criterion met (or the number of iterations)
- open-source implementation:
sklearn-genetic
- Python:
from genetic_selection import GeneticSelectionCV
- parameters of GeneticSelectionCV
- estiamtor: model used to evaluate the suitability of the feature subset, alongside an evaluation metric
- cv: int, generator, or an iterable used to determine the cross-validation splitting strategy
- scoring: the evaluation metric
- the fitness function to evaluate the performance of a chromosome
- ML model's performance against a subset of features
- max_features: determine the maximum number of features selected
- n_population: number of populations for the genetic algorithm, different feature subsets
- crossover_proba: probability value of a crossover operation for the genetic algorithm
- mutation_proba: probability value of a mutation operation for the genetic algorithm
- n_generation: an integer describing the number of generations for the genetic algorithm - the stopping criterion
- crossover_independent_proba:
- the independent probability for each attribute to be exchanged
- offering much more flexibility for the generic algorithm to search
- mutation_independent_proba: the independent probabiility for each attribute to be mutated by the generic algorithm
- n_gen_no_change: the number of generations needs to terminate the search if no change w/ the best individuals
-
Simulation annealing
- a heuristic approach to search for the feature subsets
- a global search algorithm allowing a suboptimal solution to be accepted in the hope that better solution wiill show up eventually
- Permutation Importance
- basically randomly shuffle the values of a feature (w/o touching the other variables of the targets) to see how this permutation affects the performance metric of the ML meodel
- the choice of metric upon the engineer
- measuring the importance of a featrue by measuring the increase in the model's prediction error after permutation the feature values
- breaking the relationship btw the feature and the true outcome
- important feature:
- shuffling values increasing the model error
- relying on the feature for the prediction
- non-important feature:
- shuffling values leaving the model error unchanged
- model ignoring the feature for the prediction
- important feature:
- model-agnostic permutation feature importance: a more advanced technique in permutation importance
-
- involving the use of NN to build high-performing ML models
- NN able to learn nonlinear relationships among features
- most traditional embedded-based methods only exploring linear relationships across features
-
- able to derive feature importance from NN to help to select good feature subsets
- procedure
- learning to compress and encode data
- learning how to reconstruct the data back from the reduced encoded representation
- goal: resulting representation as close as possible to the original input
- taking the feature space and reducing its dimensionality
$\to$ reconstructing inputs from its reduced format - principle: by reducing the data dimensions
- learn how to ignore the noise
- which feature best helping to reconstruct the data
- Auttoencoder for Feature Selection (AEFS)
- Autoencoder Inspired Unsupervised Feature Selection
- a solution for feature selection
- uncovering existing nonlinear relationships btw features
- using a single-layer autoencoder to reconstruct data
- after reconstructing the data
- use the first weight matrix of the autoencoder that connects the input feature layer to the reduced layer
- squared weight (
$w^2$ )$\to 0$ : feature contributing little to the representation of others - significant corresponding weight: important feature in the representation of other features
- limitation: a simple single-layer autoencoder unable to model complex nonlinear feature dependencies
- improvement: Deep Feature Selection using a Teacher-Student Network