Gini Impurity and Best split #27597

gerryake · 2023-10-16T22:36:34Z

gerryake
Oct 16, 2023

Hi there,

Currently I am trying to rebuild a random forest and have some problems that the runtime is considerably higher than with the Sklearn. Therefore I had a look into the code and unfortunately do not understand an important detail (Gini impurity and best split)
When calculating the best split, the Gini impurity is calculated. However, you would have to calculate it for virtually every possible split, which makes it quite time consuming and thus costly. How exactly does Sklearn do this? Finding the best split is quite a time consuming task and this is done in the trees per node, how does this work so fast and where is the trick? Looking forward to an answer.

glemaitre · 2023-10-31T16:52:34Z

glemaitre
Oct 31, 2023
Maintainer

We use Cython and compile code for doing this search. If you do it in pure Python then this will be extremely slow. Another trick (that is not implemented in scikit-learn) is to bins the features to evaluate only a subset of splits (as done in HistGradientBoosting.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gini Impurity and Best split #27597

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Gini Impurity and Best split #27597

gerryake Oct 16, 2023

Replies: 1 comment

glemaitre Oct 31, 2023 Maintainer

gerryake
Oct 16, 2023

glemaitre
Oct 31, 2023
Maintainer