Why does sklearn tree model "split" feature importance within identical features? #19569

MaxwellLZH · 2021-02-26T06:38:43Z

MaxwellLZH
Feb 26, 2021
Collaborator

When the dataset have identical features, to me it makes more sense that the tree model just pick one of it, so the rest of them will have a feature importance of 0 (like XGBoost does).

But sklearn tree model seems to be "splitting" feature importance between those identical features, could anyone help explain why sklearn does it this way? Thank you !

example code:

import pandas as pd
import numpy as np 
import random

X = pd.DataFrame({
    'x1': np.arange(1000),
    'x2': np.arange(1000)
})
y = np.array([int(i * random.random() > 250) for i in X['x1']])

from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier

TREE = DecisionTreeClassifier(max_depth=4, max_features=X.shape[1], random_state=1025)
TREE.fit(X, y)
print('Decision Tree:', TREE.feature_importances_)

RF = RandomForestClassifier(n_estimators=10, max_depth=4, random_state=1025)
RF.fit(X, y)
print('Random Forest:', RF.feature_importances_)

XGB = XGBClassifier(max_depth=None, subsample=1.0, random_state=1024)
XGB.fit(X, y)
print('XGBoost:', XGB.feature_importances_)

# Decision Tree: [0.12664008 0.87335992]
# Random Forest: [0.71095324 0.28904676]
# XGBoost: [1. 0.]

NicolasHug · 2021-02-26T11:54:02Z

NicolasHug
Feb 26, 2021
Maintainer

It probably comes from the fact that when 2 splits have the same gain, the one that gets selected is arbitrary.

The features are always randomly permuted at each split, even if splitter is set to "best". When max_features < n_features, the algorithm will select max_features at random at each split before finding the best split among them. But the best found split may vary across different runs, even if max_features=n_features. That is the case, if the improvement of the criterion is identical for several splits and one split has to be selected at random. To obtain a deterministic behaviour during fitting, random_state has to be fixed to an integer. See Glossary for details.

It'd be good have a deterministic rule here, like select the feature with the lowest index.

1 reply

MaxwellLZH Feb 27, 2021
Collaborator Author

Also if we have more identical features (happens sometimes when using automated feature engineering tools), the importance score gets lower and lower for those features even if they're very useful.

I think having a deterministic rule like you said will be very helpful, Shall I open an issue for this?

jnothman · 2021-02-27T11:54:41Z

jnothman
Feb 27, 2021
Maintainer

Does xgboost have specific logic to deal with this? Presumably the same should apply to features that have identical order (as in rank transform), as these too will have equal importance with trees.

1 reply

MaxwellLZH Mar 1, 2021
Collaborator Author

yes, features with same order also have this issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why does sklearn tree model "split" feature importance within identical features? #19569

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Why does sklearn tree model "split" feature importance within identical features? #19569

MaxwellLZH Feb 26, 2021 Collaborator

Replies: 2 comments · 2 replies

NicolasHug Feb 26, 2021 Maintainer

MaxwellLZH Feb 27, 2021 Collaborator Author

jnothman Feb 27, 2021 Maintainer

MaxwellLZH Mar 1, 2021 Collaborator Author

MaxwellLZH
Feb 26, 2021
Collaborator

Replies: 2 comments 2 replies

NicolasHug
Feb 26, 2021
Maintainer

MaxwellLZH Feb 27, 2021
Collaborator Author

jnothman
Feb 27, 2021
Maintainer

MaxwellLZH Mar 1, 2021
Collaborator Author