Skip to content

ck37/featurerank

Repository files navigation

Feature Rank: ensemble feature ranking for variable selection

Ensemble feature ranking for variable selection in SuperLearner ensembles (Polley et al. 2021), based on Effrosynidis and Arampatzis (2021). Multiple algorithms estimate a ranking of the strength of the relationship between predictors and the outcome in the training set, and these rankings are combined into a single ranking via an aggregation method (reciprocal ranking currently). The final ranking can then be cut at a certain number of variables (e.g. top 10 predictors, top 70%, etc.) to create one or more feature selection wrappers for SuperLearner. The result should generally be more robust and stable than feature selection using a single algorithm. See also (Neumann, Genze, and Heider 2017) for a similar method.

Install

# install.packages("remotes")
remotes::install_github("ck37/featurerank")

Algorithms

Currently implemented algorithms are:

  • Feature ranking: correlation, glm, glmnet, random forest, bart, xgboost + shap, variance
  • Rank aggregation: reciprocal ranking

Example

A minimal example to demonstrate how the package can be used.

Prepare dataset

# TODO: switch to a less problematic demo dataset.
data(Boston, package = "MASS")

# Use "chas" as our outcome variable, which is binary.
y = Boston$chas
x = subset(Boston, select = -chas)

Create feature ranking library

Specify the feature ranking wrappers for the ensemble library.

library(featurerank)

# Modify RF feature ranker to use 100 trees (faster than default of 500).
featrank_randomForest100 =
  function(...) featrank_randomForest(ntree = 100L, ...)

# Specify the set of feature ranking algorithms.
ensemble_rank_custom =
  function(top_vars, ...)
    ensemble_rank(fn_rank = c(featrank_cor, featrank_randomForest100,
                              featrank_glm, featrank_glmnet),
                              #featrank_shap, # too verbose currently
                              #featrank_dbarts), # skip for speed
                  top_vars = top_vars,
                  ...)

# There are 13 total vars so try dropping 1 of them.
top12 = function(...) ensemble_rank_custom(top_vars = 12, ...)

# Try dropping worst 2 predictors.
top11 = function(...) ensemble_rank_custom(top_vars = 11, ...)

# Drop worst 3 predictors.
top10 = function(...) ensemble_rank_custom(top_vars = 10, ...)

Use in SuperLearner

library(SuperLearner)

set.seed(1)
# Takes 93 seconds with 1 core.
sl = SuperLearner(y, x, family = binomial(),
                  # 10-fold cross-validation stratified on the outcome.
                  cvControl = list(V = 10L, stratifyCV = TRUE),
                  SL.library =
                    list("SL.glm", # Baseline estimator uses all predictors.
                         # Try three ensemble screening options, giving the
                         # screened variable list to logistic regression (SL.glm).
                         c("SL.glm", "top12", "top11", "top10")))

# Review timing.
sl$times$everything
##    user  system elapsed 
##  90.393   0.637  91.407
# We do achieve a modest AUC benefit.
ck37r::auc_table(sl, y = y)[, -6]
##        learner       auc         se  ci_lower  ci_upper
## 1   SL.glm_All 0.7426862 0.02930653 0.6852464 0.8001259
## 2 SL.glm_top12 0.7485151 0.02852544 0.6926062 0.8044239
## 3 SL.glm_top11 0.7535018 0.02760091 0.6994050 0.8075986
## 4 SL.glm_top10 0.7613032 0.02585664 0.7106251 0.8119813
# Which features were dropped (will show FALSE below)?
t(sl$whichScreen)
##          All top12 top11 top10
## crim    TRUE  TRUE  TRUE FALSE
## zn      TRUE FALSE FALSE FALSE
## indus   TRUE  TRUE  TRUE  TRUE
## nox     TRUE  TRUE  TRUE  TRUE
## rm      TRUE  TRUE  TRUE  TRUE
## age     TRUE  TRUE  TRUE  TRUE
## dis     TRUE  TRUE  TRUE  TRUE
## rad     TRUE  TRUE  TRUE  TRUE
## tax     TRUE  TRUE  TRUE  TRUE
## ptratio TRUE  TRUE  TRUE  TRUE
## black   TRUE  TRUE FALSE FALSE
## lstat   TRUE  TRUE  TRUE  TRUE
## medv    TRUE  TRUE  TRUE  TRUE

Assess ranking stability

# Check if we see stability across multiple runs,
# especially for comparison to individual feature ranking algorithms.
# (See stability scores in Table 3 of paper.)
set.seed(2)

# Takes about 90 seconds using 1 core.
system.time({
results =
  do.call(rbind.data.frame,
          lapply(1:10,
                 function(i) top12(y, x, family = binomial(),
                                   # Default replications is 3 - more replications increases stability.
                                   replications = 10,
                                   detailed_results = TRUE)$ranking))
})
##    user  system elapsed 
##  90.368   0.648  91.309
names(results) = names(x)
# Stability looks excellent.
results
##    crim zn indus nox rm age dis rad tax ptratio black lstat medv
## 1    11 13     8   5  9  10   6   3   7       4    12     2    1
## 2    11 13     7   4  9  10   8   3   6       5    12     2    1
## 3    11 13     7   4  9  10   6   3   8       5    12     2    1
## 4    11 13     7   4 10   9   6   3   8       5    12     2    1
## 5    11 13    10   4  7   9   6   3   8       5    12     2    1
## 6    11 13     8   4  9   7  10   3   6       5    12     2    1
## 7    11 13     9   5 10   7   6   3   8       4    12     2    1
## 8    11 13     9   4  6  10   7   3   8       5    12     2    1
## 9    11 13    10   4  6   8   7   3   9       5    12     2    1
## 10   11 13     9   4  6   8  10   3   7       5    12     2    1
# What if we treated each iteration as its own ranking and then aggregated?
agg_reciprocal_rank(t(results))
##    crim      zn   indus     nox      rm     age     dis     rad     tax ptratio 
##      11      13       9       4       8      10       6       3       7       5 
##   black   lstat    medv 
##      12       2       1

References

Effrosynidis, Dimitrios, and Avi Arampatzis. 2021. “An Evaluation of Feature Selection Methods for Environmental Data.” Ecological Informatics 61: 101224.

Neumann, Ursula, Nikita Genze, and Dominik Heider. 2017. “EFS: An Ensemble Feature Selection Tool Implemented as r-Package and Web-Application.” BioData Mining 10 (1): 1–9.

Polley, Eric, Erin LeDell, Chris J. Kennedy, Sam Lendle, and Mark van der Laan. 2021. “SuperLearner: Super Learner Prediction.” CRAN. https://CRAN.R-project.org/package=SuperLearner.