You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Models like GradientBoosting and RandomForest currently do not allow any prior feature importance to be encoded, yet this could improve model performance.
Use case: learning from panel data augmented by large embeddings
Currently, an existing best practice is to reduce the embedding size with transformations like PCA so that they do not dominate the model's features (too much noise for small datasets), but information is lost. Better would be a learned hyperparameter, which impacts how frequently these features of the data are considered using max_features.
Approach
This could be made possible simply by increasing the probability of selecting these features when considering features and max_features < n_features.
For linear models, the prior feature importance of subsets of features can be expressed prior to fitting by scaling features. For example, ColumnTransformer allows transformer_weights to be specified.
How can we similarly make tree based models pay more attention to certain features, when we know that tree based models like RandomForest do not depend on feature scaling? The simplest way appears to be with feature-wise subsampling (non-random selection of features to consider for splitting) during fit.
Implementation outline
Option 1: As noted above, we can already scale feature sizes using ColumnTransformer. Extend its usefulness to tree-based models by allowing estimators to infer feature importance using the relative average variance of each feature. Add a boolean property (default=False) to enable this behavior.
Option 2: Extend the tree model's properties by allowing feature_weights to be explicitly specified with a list corresponding to the number of features. The probability of subsampling each feature for each tree is in proportion to the specified relative feature_weights.
Integration consideration
To benefit from sklearn's hyperparameter search optimization abilities (e.g. GridSearchCV), one approach would be simply to allow the user to specify as a prior the different categories of features whose relative category-level importance needs to be learned during fit, each with a reasonable weight prior over their feature category groups. Categories are already naturally partitioned when ColumnTransformer is used, so an elegant implementation perhaps would use this information. Note the distinction here: the user would merely specify a prior over what are considered to be different feature categories, vs specifying their importance; a natural prior, for instance would be to specify "embeddings" (e.g. for a 1024-length feature vector) and "other" (for traditional panel data also included as part of X).
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Models like GradientBoosting and RandomForest currently do not allow any prior feature importance to be encoded, yet this could improve model performance.
Use case: learning from panel data augmented by large embeddings
Currently, an existing best practice is to reduce the embedding size with transformations like PCA so that they do not dominate the model's features (too much noise for small datasets), but information is lost. Better would be a learned hyperparameter, which impacts how frequently these features of the data are considered using
max_features
.Approach
This could be made possible simply by increasing the probability of selecting these features when considering features and
max_features < n_features
.For linear models, the prior feature importance of subsets of features can be expressed prior to fitting by scaling features. For example, ColumnTransformer allows
transformer_weights
to be specified.How can we similarly make tree based models pay more attention to certain features, when we know that tree based models like RandomForest do not depend on feature scaling? The simplest way appears to be with feature-wise subsampling (non-random selection of features to consider for splitting) during
fit
.Implementation outline
feature_weights
to be explicitly specified with a list corresponding to the number of features. The probability of subsampling each feature for each tree is in proportion to the specified relative feature_weights.Integration consideration
To benefit from sklearn's hyperparameter search optimization abilities (e.g.
GridSearchCV
), one approach would be simply to allow the user to specify as a prior the different categories of features whose relative category-level importance needs to be learned during fit, each with a reasonable weight prior over their feature category groups. Categories are already naturally partitioned when ColumnTransformer is used, so an elegant implementation perhaps would use this information. Note the distinction here: the user would merely specify a prior over what are considered to be different feature categories, vs specifying their importance; a natural prior, for instance would be to specify "embeddings" (e.g. for a 1024-length feature vector) and "other" (for traditional panel data also included as part of X).Beta Was this translation helpful? Give feedback.
All reactions