[WIP] ENH: Hellinger distance tree split criterion for imbalanced data classification #437

EvgeniDubov · 2018-07-11T19:37:53Z

Reference Issue

[sklearn] Feature Request: Hellinger split criterion for classificaiton trees #9947

What does this implement/fix? Explain your changes.

Hellinger Distance as tree split criterion, cython implementation compatible with sklean tree based classification models

Any other comments?

This is my first submission, sorry in advance for the many possible things I've missed.
Looking forward for your feedback.

…e with sklearn

pep8speaks · 2018-07-11T19:37:55Z

Hello @EvgeniDubov! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-12-23 09:29:09 UTC

codecov · 2018-07-11T20:16:42Z

Codecov Report

Merging #437 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master     #437   +/-   ##
=======================================
  Coverage   98.83%   98.83%           
=======================================
  Files          86       86           
  Lines        5317     5317           
=======================================
  Hits         5255     5255           
  Misses         62       62

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fc9e483...7acac1f. Read the comment docs.

glemaitre · 2018-07-11T21:22:12Z

Cool. I am looking forward for this contribution. Could you make a quick todo list in the first summary.

glemaitre · 2018-07-11T21:25:50Z

From what I see:

We did not have any Cython up to now. So you will need to setup the project for it. You can refer to https://github.com/jakevdp/cython_template
You have to solve the issues raised by PEP8.
I think that we need to improve the example with some narrative documentation.
We need to have a section in the User Guide such that people can find out about this feature and in which case to use it. Probably it could be embedded in the same section than the BalancedBaggingClassifier.
You can add a what's new entry as well.

…thon_template

…eniDubov/imbalanced-learn into hellinger_distance_criterion # Conflicts: # imblearn/tree_split/setup.py

EvgeniDubov · 2018-07-16T12:57:30Z

I've pushed the code for cython build support, and all the automatic checks failed.
LGTM failed because my implementation assumes the existence of sklearn/tree/_criterion.pxd and it is missing.
AppVeyor and Travis failed because Cython package is not installed.
I assume that the last will fail on the missing .pxd file too after Cython dependency will be resolved.

Please let me know if there is a way for me to configure these tools or are they administrated by the maintainers only.

glemaitre · 2018-08-30T13:47:48Z

@EvgeniDubov I am getting some time to look at this PR.
I will probably make a force push to have a different building system for the Cython but this is not the most important thing.

I was looking at the literature and the original paper. I did not find a clear statement on how to compute the distance in a multiclass problem which is actually supported by the trees in scikit-learn.

@EvgeniDubov @DrEhrfurchtgebietend do you have a reference for multiclass?

Gitman-code · 2018-08-30T18:15:48Z

I have not done much multi-class classification. I do not even know how it is implemented with the traditional split criterion. Is it possible to set this up to only work for binary classification? Can we release without solving this?

EvgeniDubov · 2018-09-03T11:14:48Z

@glemaitre ineed sklearn's 'gini' and 'entropy' support multiclass but hellinger requires some modification to support it.
Here is a quote from an abstract of a paper on this subject

In this paper we study the multi-class imbalance problem as it relates to decision trees (specifically C4.4 and HDDT), and develop a new multi-class splitting criterion. From our experiments we show that multi-class Hellinger distance decision trees, when combined with decomposition techniques, outperform C4.4.

I can contribute it as a separate Cython implementation, preferably in a separate PR.

glemaitre · 2018-09-03T12:44:08Z

Oh nice, I did not look at this paper but the 2009 and 2011. It seems that I missed it.

I can contribute it as a separate Cython implementation, preferably in a separate PR.

I think that we should have the multi-class criterion directly. The reason is that we don't have a mechanism for raising an error if the criterion is used for a multiclass problem. However, it seems that it is quite feasible to implement the algorithm 1 in the paper that you attached.

Regarding the cython file, could you take all the cython setup from glemaitre@27fffea and just paste your criterion file and tests at the right location (+documentation). I prefer to be closer to those Cython setup (basically the major change is about the package naming).

Gitman-code · 2018-09-10T22:58:14Z

I am curious if we need to do something specific for how feature importance will be calculated after this change is done. There are two questions here. First, does the standard method of the sum of improvement in the criterion really generalize to all criterion. I think the answer is yes but if so then it might not be the case that this is really the definition we want. In an imbalanced case we would in theory have imbalanced features (ie nearly all the same value) which if important would be used high in the tree but not frequently. This would result in a low weight under the current definition. Would a definition of the average gain when used instead of the total gain across all uses be better? To limit discussion here I put this into a SO post.

glemaitre · 2018-09-11T08:39:47Z

There is 3 points to consider:

The feature importance at a node is normalized by the weighted number of samples at that node.
The information gain on the top of the tree will be more important than on the bottom of the tree. So a feature used on the top of the tree might be more important than a feature used several time at the bottom of the tree, if the information gain is actually lower.
The feature importance computed in the tree is a bias estimator: http://explained.ai/rf-importance/index.html and there is probably nothing to do about that apart of doing a permutation test.

Gitman-code · 2018-09-11T18:16:44Z

Thanks for the feedback @glemaitre .

So you agree that different split criteria could be used to calculate the feature importance in general? It seems to intuitively make sense that if it is used to build the tree it makes sense to use it to define importance.

The weighting of the importance by the number of samples at the node was sort of what got me thinking down this path. Hellinger distance is designed to be less sensitive to number of samples but I think that is only a factor in finding the split.

The permutation feature importance is a great method. I see that there are discussions to move it to sklearn.

The purpose of thinking of feature importance in this way is to make sure one does not eliminate features which are unimportant in general but crucial in a few rare outlier cases. When doing feature selection is it easy to justify dropping such features when looking at aggregate metrics like RMSE since changes to only a few predictions will only alter it by a tiny amount. Permutation feature importance would not be sensitive to this either. Or at least it will only be as sensitive as metric you use for evaluation is to such outliers. Do you know of any standard metric for identifying features of this type? Sorry this has gotten a little off topic.

…eniDubov/imbalanced-learn into hellinger_distance_criterion # Conflicts: # doc/over_sampling.rst # doc/whats_new/v0.0.4.rst

…oot cause

- added Hellinger pyd file to MANIFEST - update cython version requirements in hellinger cython code

…e_criterion

…e_criterion # Conflicts: # .gitignore # .travis.yml # appveyor.yml # imblearn/tensorflow/_generator.py # imblearn/tensorflow/tests/test_generator.py # imblearn/utils/_validation.py

EvgeniDubov · 2019-12-26T06:47:19Z

@glemaitre @chkoar I've synced with master and got lint, travis and appveyor issues, none of which caused by my contribution
can you please take a look

giladwa1 · 2020-02-17T07:13:28Z

@glemaitre @chkoar My DS team is using Hellinger distance split criterion from @EvgeniDubov private repo. We would appreciate it being part of scikit-learn-contrib. We're willing to help move this PR forward in any way possible.

chkoar · 2020-02-17T07:57:25Z

@giladwa1 I am not familiar with the Hellinger distance yet but if people are willing to help to get this merged I am ok even if it works only for the binary case.

glemaitre · 2020-02-17T09:16:50Z

Speaking a bit more with the dev of scikit-learn, I think that it could be integrated into scikit-learn directly. It would only be for the binary case and we should have good tests and a nice example showing its benefit. The issue in imbalanced-learn is that we will be required to code in cython and then it had a lot of burden on the wheel generation which personally I would like to avoid if possible. This somehow a cost which is a bit hidden.

…

On Mon, 17 Feb 2020 at 08:57, Christos Aridas ***@***.***> wrote: @giladwa1 <https://github.com/giladwa1> I am not familiar with the Hellinger distance yet but if people are willing to help to get this merged I am ok even if it works only for the binary case. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#437?email_source=notifications&email_token=ABY32P6QOZSEZJUMUFF755TRDI7OLA5CNFSM4FJO4VXKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEL5MXTQ#issuecomment-586861518>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABY32P6GMHFI2R2NKZS62OTRDI7OLANCNFSM4FJO4VXA> .

-- Guillaume Lemaitre Scikit-learn @ Inria Foundation https://glemaitre.github.io/

chkoar · 2020-02-17T11:41:19Z

The issue in imbalanced-learn is that we will be required to code in cython and then it had a lot of burden on the wheel generation which personally I would like to avoid if possible.

That is very true.

giladwa1 · 2020-02-20T06:37:39Z

@glemaitre @chkoar Thanks for the quick reply, I will continue the discussion in the scikit-learn PR scikit-learn/scikit-learn#16478

chkoar · 2020-07-29T09:21:22Z

@glemaitre since this PR transferred could we close this PR?

Sandy4321 · 2023-01-08T20:29:58Z

pls clarify is it added to main code or not?

EvgeniDubov added 4 commits July 11, 2018 14:19

added cython implementation of hellinger distance criterion compatibl…

e58f628

…e with sklearn

added usage example

649e204

added README

689a41b

update license

e4e13a7

EvgeniDubov and others added 4 commits July 12, 2018 10:33

Fixed pep8 issues in the example

e33d8a3

Fixed pep8 issues in the setup

f08534b

added support for cython build based on https://github.com/jakevdp/cy…

d655b60

…thon_template

Merge branch 'hellinger_distance_criterion' of https://github.com/Evg…

8d779e0

…eniDubov/imbalanced-learn into hellinger_distance_criterion # Conflicts: # imblearn/tree_split/setup.py

EvgeniDubov and others added 5 commits July 25, 2018 13:16

updated 'whats new'

b94ed53

updated the example

3e265ce

updates user guide and api

097a582

fixed LGTM issues

0517276

Merge branch 'master' into hellinger_distance_criterion

65a4c62

Gitman-code mentioned this pull request Aug 18, 2018

Feature Request: Hellinger split criterion for classificaiton trees scikit-learn/scikit-learn#9947

Closed

glemaitre force-pushed the master branch 2 times, most recently from bbf2b12 to 513203c Compare September 7, 2018 13:26

EvgeniDubov added 2 commits October 2, 2018 15:28

Merged with glemaitre@27fffea

a700cbd

Merge branch 'hellinger_distance_criterion' of https://github.com/Evg…

13bc07d

…eniDubov/imbalanced-learn into hellinger_distance_criterion # Conflicts: # doc/over_sampling.rst # doc/whats_new/v0.0.4.rst

EvgeniDubov added 8 commits August 15, 2019 14:43

commented out hellinger usage example to narrow down travis failure r…

4af0af8

…oot cause

added hellinger usage example to tree.rst

97ef77f

- added Cython temp files to git ignore

a7855a7

- added Hellinger pyd file to MANIFEST - update cython version requirements in hellinger cython code

Merge remote-tracking branch 'upstream/master' into hellinger_distanc…

ccbbaaf

…e_criterion

documentation update

21e6909

added cython installation to travis

9268ee9

fix few LGTM issues

e4c5360

fix LGTM issue

fc9e483

glemaitre force-pushed the master branch from 65132db to 68123d0 Compare November 8, 2019 22:54

EvgeniDubov added 9 commits December 23, 2019 10:01

Merge remote-tracking branch 'upstream/master' into hellinger_distanc…

1eee5af

…e_criterion # Conflicts: # .gitignore # .travis.yml # appveyor.yml # imblearn/tensorflow/_generator.py # imblearn/tensorflow/tests/test_generator.py # imblearn/utils/_validation.py

travis fix

a3cfa7d

fix appveyor

0bb474c

updated MANIFEST.in

2fc250e

aligned setup file to master

008b808

fixed lint issues

6e67b96

fix lint issues

3afdbb4

fixed lint issues

806cc7b

fix lint issues

7acac1f

EvgeniDubov mentioned this pull request Feb 19, 2020

ENH Hellinger distance split criterion for classification trees scikit-learn/scikit-learn#16478

Closed

5 tasks

chkoar force-pushed the master branch from 4a201cd to 0eb9033 Compare June 20, 2020 02:58

glemaitre force-pushed the master branch from f8347ad to 56eefdf Compare September 29, 2021 16:10

glemaitre force-pushed the master branch from 3228f8a to 7e94390 Compare October 21, 2021 20:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] ENH: Hellinger distance tree split criterion for imbalanced data classification #437

[WIP] ENH: Hellinger distance tree split criterion for imbalanced data classification #437

EvgeniDubov commented Jul 11, 2018 •

edited

pep8speaks commented Jul 11, 2018 •

edited

codecov bot commented Jul 11, 2018 •

edited

glemaitre commented Jul 11, 2018

glemaitre commented Jul 11, 2018

EvgeniDubov commented Jul 16, 2018

glemaitre commented Aug 30, 2018

Gitman-code commented Aug 30, 2018

EvgeniDubov commented Sep 3, 2018

glemaitre commented Sep 3, 2018

Gitman-code commented Sep 10, 2018

glemaitre commented Sep 11, 2018 •

edited

Gitman-code commented Sep 11, 2018

EvgeniDubov commented Dec 26, 2019

giladwa1 commented Feb 17, 2020

chkoar commented Feb 17, 2020

glemaitre commented Feb 17, 2020 via email

chkoar commented Feb 17, 2020

giladwa1 commented Feb 20, 2020

chkoar commented Jul 29, 2020

Sandy4321 commented Jan 8, 2023

[WIP] ENH: Hellinger distance tree split criterion for imbalanced data classification #437

Are you sure you want to change the base?

[WIP] ENH: Hellinger distance tree split criterion for imbalanced data classification #437

Conversation

EvgeniDubov commented Jul 11, 2018 • edited

Reference Issue

What does this implement/fix? Explain your changes.

Any other comments?

pep8speaks commented Jul 11, 2018 • edited

Comment last updated at 2019-12-23 09:29:09 UTC

codecov bot commented Jul 11, 2018 • edited

Codecov Report

glemaitre commented Jul 11, 2018

glemaitre commented Jul 11, 2018

EvgeniDubov commented Jul 16, 2018

glemaitre commented Aug 30, 2018

Gitman-code commented Aug 30, 2018

EvgeniDubov commented Sep 3, 2018

glemaitre commented Sep 3, 2018

Gitman-code commented Sep 10, 2018

glemaitre commented Sep 11, 2018 • edited

Gitman-code commented Sep 11, 2018

EvgeniDubov commented Dec 26, 2019

giladwa1 commented Feb 17, 2020

chkoar commented Feb 17, 2020

glemaitre commented Feb 17, 2020 via email

chkoar commented Feb 17, 2020

giladwa1 commented Feb 20, 2020

chkoar commented Jul 29, 2020

Sandy4321 commented Jan 8, 2023

EvgeniDubov commented Jul 11, 2018 •

edited

pep8speaks commented Jul 11, 2018 •

edited

codecov bot commented Jul 11, 2018 •

edited

glemaitre commented Sep 11, 2018 •

edited