add post-pruning for decision trees #6557

amueller · 2016-03-16T21:26:40Z

I frequently get asked about post-pruning. Often using single trees is important for interpretability, and post-pruning can help both interpretability and generalization performance.
[I'm surprised there is no open issue on this, but I couldn't find one]

nelson-liu · 2016-03-17T06:30:37Z

I can give this a shot :)

lesshaste · 2016-03-18T11:20:24Z

The title of this issue is related #4630 .

maniteja123 · 2016-03-18T11:36:43Z

This issue #941 also seems to be proposing prune method to the tree but the code has considerably changed in the meantime I suppose.

lesshaste · 2016-03-18T11:45:17Z

On the topic of interpretability, there has also been published work on creating single decision trees from random forests. As counter-intuitive as this sounds, due to the non-optimality of standard decision tree algorithms, this can apparently give single decisions tree that are both interpretable and also better classifiers/regressors than you would get otherwise.

qmaruf · 2016-04-05T18:05:17Z

@amueller Could you please explain this issue a little bit? I want to work on this.

amueller · 2016-05-21T21:02:24Z

actually maybe the first thing to do would be to add a stopping criterion based on mutual information...

amueller · 2016-05-21T21:03:19Z

@lesshaste can you give a citation? (I had been thinking about that recently, damn, obviously it has been done ^^)

lesshaste · 2016-05-21T21:30:45Z

@amueller It has been a long time since I looked at this but:

Section 2.2 of
"Rule Extraction from Random Forest: the RF+HC Methods" by Morteza Mashayekhi and Robin Gras has all the relevant citations I know. Please let me know if you don't have access to this paper.

On an only slightly related note, I remember looking at http://blog.datadive.net/interpreting-random-forests/ with some interest.

glouppe · 2016-06-06T07:53:03Z

actually maybe the first thing to do would be to add a stopping criterion based on mutual information...

An easy addition would be to stop the construction when p(t) i(t, s*) < beta, i.e. when the weighted impurity p(t) i(t, s*) for the best split s* becomes less than some user-defined threshold beta.

ogrisel · 2016-06-06T08:15:23Z

I frequently get asked about post-pruning. Often using single trees is important for interpretability, and post-pruning can help both interpretability and generalization performance.

I think this can also be beneficial for some ensemble algorithms like (gradient) boosting although it's typically not increasing the predictive accuracy for bagging-style ensembles (e.g. random forests).

Pruning might also be useful to save some memory and decrease prediction times a bit.

nelson-liu · 2016-06-30T11:28:10Z

@amueller / others: @jmschrei and i met to discuss the issue of post-pruning a few weeks ago, and we were unsure of how it would fit in the current scikit-learn API. Generally, post-pruning needs a validation set, but this doesn't seem to fit nicely with how the library is currently organized (namely, issues like creation / origin of the validation set and whether this would be an argument to fit or a separate method come to mind). How were you thinking of this being implemented from an API point of view?
For now, I'm working on looking through the splitter.pyx code and adding the early stopping criteria based on weighted impurity for GSoC while thinking about MAE.

nelson-liu · 2016-06-30T11:31:53Z

Seems like there was some discussion about API issues in #941, but the tree module has changed quite a bit and perhaps we should have the discussion again. For one, we have parameters like max_leaf_nodes and max_depth to control growth of the tree a bit.

nlgranger · 2016-06-30T15:13:34Z

I would be inclined to set up pruning as a separate method. In fact, a separate function that takes a tree and returns the pruned version would be ok to begin with. This is the approach which is used in https://github.com/ajtulloch/sklearn-compiledtrees to generate optimized code for trees.

@nelson-liu With regard to the regularization options, they sometimes lead to subtrees with all nodes belonging to the same class. While it is not pruning in the sense of regularization, it would be nice to have a function to get rid of these extra nodes that only add computational cost during prediction.

jmschrei · 2016-06-30T16:16:21Z

When would you ever get a subtree with all nodes predicting the same class? You wouldn't make a split if there wasn't a gain in your criterion.

jmschrei · 2016-06-30T16:20:47Z

In essence, what I'd like is a 'warm start' like method for building a tree, like we have a warm start for building a random forest. You should have a method to add a single node to a tree, but still get a valid estimator before and after adding the node. This would allow users to evaluate the trees performance on the evaluation set as they build it, just like you can add trees to a random forest, evaluating its performance on the validation set each time. This shouldn't be terribly difficult functionally to add, the biggest issue is just the API for this. @ogrisel @glouppe do you have any thoughts?

nlgranger · 2016-06-30T17:30:09Z

@jmschrei
I meant that a split of 90:10 into 80:0 & 10:10 won't change the performance if the tree stops there.

nelson-liu · 2016-07-01T00:00:59Z

@pixelou another important thing to consider is whether we want this to be useable with GridSearchCV...a separate method would break that i think.

nlgranger · 2016-07-01T07:06:09Z

@nelson-liu Sorry I wasn't clear: I meant that most of the hard work is to write the pruning procedures themselve. Integration shouldn't be too hard (but I'm not the one doing it obviously :-) )

One the top of my head, I see several options for integration:

options given to the tree constructor are then taken into account by .fit
options are given to .fit directly
a separate .prune or .post_prune method has to be called explicitely
a separate prune_tree or post_prune_tree function takes the tree and returns another pruned tree

It's up to you to decide on this, but I think one can write a separate private (class?)method for pruning and make it available to the API as one of the above solutions.

Note that I have just given the points above without further thinking ;-). 2 and 4 are clearly not the sklearn way of doing things, and you just mentioned how 3 can be a problem.

jmschrei · 2016-07-02T22:49:15Z

The first option of those seems the most consistent with sklearn.

nelson-liu · 2016-08-09T05:32:14Z

@ameuller @glouppe @GaelVaroquaux any opinion on the api for post pruning? I see these three options, the first of which seems the best to me. Just wanted to get some clarification before I start to think about working on this:

options given to the tree constructor are then taken into account by .fit
a separate .prune or .post_prune method has to be called after fitting
a separate prune_tree or post_prune_tree function takes the tree and returns another pruned tree

glouppe · 2016-08-09T06:02:56Z

Yes, the first option is certainly the one that fits best with our API.

raghavrv · 2016-08-09T06:07:19Z

Isn't it pre pruning if you do it at fit? Aren't we supposed to check the complexity of the tree and use post pruning to reduce it?

jnothman · 2016-08-09T06:08:56Z

Start with the third option, then decide whether the others are appropriate...

raghavrv · 2016-08-09T06:09:53Z

+1 for that

jnothman · 2016-08-09T07:22:30Z

(by which I mean, the example code might help decide)

jmschrei · 2016-08-09T13:31:07Z

It's not pre-pruning if you do it at fit, if you build the full tree and then go backwards and remove nodes. I agree with @glouppe that the first one is the best option, but I also agree with @jnothman that since the code will rely on a prune_tree method ~~anyway~~ it may be better to create a standalone thing during the development stages.

raghavrv · 2016-08-09T13:44:02Z

It's not pre-pruning if you do it at fit, if you build the full tree and then go backwards and remove nodes.

I assumed that the user would want to select whether to post pruning or not based on the built tree...

ccmaymay · 2017-12-15T13:27:44Z

@amueller oh, I see.

feng-1985 · 2018-03-08T03:30:49Z

Look forward to it!

Gitman-code · 2018-03-13T23:00:24Z

For some of the ensemble methods this could be worked in naturally. If there is bagging then the OOB sample could be used for validation and pruning. This would be like how oob_improvement_ is calculated in GradientBoostingClassifier

ghost · 2018-05-04T10:29:53Z

Is anyone working on this right now?

jnothman · 2018-05-06T00:50:52Z

Not to my knowledge

wsy1991 · 2018-06-12T07:35:09Z

is there anybody done it ???

wsy1991 · 2018-06-12T07:35:54Z

is anyone have done it that could share your code......

wsy1991 · 2018-06-13T02:55:39Z

cost complexity pruning based on sklearn.decisiontreeclassifier ????

wsy1991 · 2018-06-13T03:07:12Z

Can someone do cost complexity pruning based on sklearn.decisiontreeclassifier()？ If not, I'll ask again .I'm seriously.......

zanderbraam · 2018-06-13T16:07:23Z

Found this:
https://github.com/shenwanxiang/sklearn-post-prune-tree

Not sure if it is exactly cost complexity pruning?

appleyuchi · 2018-12-03T10:44:18Z

@zanderbraam it's not CCP

appleyuchi · 2018-12-08T11:35:16Z

Hi,all ,
I have implemented:
CCP(Cost Complexity Pruning) Algorithm on
sklearn-CART-Classification-model in Python,

ECP(Error Complexity Pruning) Algorithm on
sklearn-CART-Regression-model in Python,

here's the link:
https://github.com/appleyuchi/Decision_Tree_Prune
you may like it.

amueller · 2018-12-08T14:52:12Z

@appleyuchi thanks!
I find it a bit hard to follow the structure of the code, in particular given that file names and comments are in Chinese. There also seems to be a lot of duplicate code.

nlgranger · 2018-12-13T14:34:46Z

I don't work with DT anymore, but has anyone had a look at https://github.com/beedotkiran/randomforestpruning-ismm-2017 ? It seems relevant to this issue.

nlgranger · 2018-12-13T14:56:10Z

A tree is just a very small forest. Can't this implementation scale down to trees?

amueller · 2018-12-13T15:28:23Z

@appleyuchi I'm not sure if I follow what you're saying but we will not adopt an implementation based on going to JSON. Any implementation in scikit-learn would have to work directly with the scikit-learn data structures.

adrinjalali · 2018-12-14T12:13:36Z

@appleyuchi thanks for your efforts. Regarding how tricky it may be to implement this feature, we do recognize that touching the tree code base is not necessarily a trivial task. There have been efforts to change that and have a more readable implementation. Besides, this issue isn't labeled as "Easy" for that exact reason.

I hope you find other issues that you may be interested in and you keep the good work on them :)

thomasjpfan · 2018-12-23T18:56:51Z

I am working on this issue with a cost complexity pruning (CPP) algorithm. I see several tests that can be used to check tree pruning:

Increasing alpha (in CPP) should result in smaller or equal number of nodes.
Make sure the pruned tree is actually a subtree of the original tree.

What other tests would be appropriate for tree pruning?

zhenyu-zhou · 2019-07-26T02:42:02Z

@appleyuchi Thanks for sharing! My concern is that, even as a Chinese, it's still pretty hard for me to follow the code. I guess it is better to have the code more modularized so that others can apply your implementation on any data sets.

appleyuchi · 2019-07-26T03:29:46Z

@zhenyu-zhou
Because almost all of you did NOT ever read the book《classification and regression trees》carefully.

The first author of this book has already passed away so you can not contact him for questions.

The defect of this book is discussed in
http://www.dcc.fc.up.pt/~ltorgo/PhD/th4.pdf
or
http://www.doc88.com/p-6445227043649.html

which point out that CCP/ECP algorithm-cross validation will fail for unbalanced and small datasets,
you should understand the above academic materials before you implement it.

I analyzed and summarized the defect in:

https://blog.csdn.net/appleyuchi/article/details/84957220

The Github link I provide for CCP/ECP is just "application style",NOT from sklearn's "bottom variable style",(the latter will be much more efficient and faster)
so they reject.

even as a Chinese, it's still pretty hard for me to follow the code. I guess it is better to have the code more modularized so that others can apply your implementation on any data sets.

It can be applied on many datasets I have tested,I guess you even have NOT clicked in and read the instruction in the Github link

zhenyu-zhou · 2019-07-26T03:58:17Z

@appleyuchi

But it can be applied on any datasets you want,I'm sure you even have NOT clicked in and read the instruction in the link

You are ABSOLUTELY WRONG. To make it short, I just have one question for you: do you provide a clean API, like sklearn did?

You shouldn't expect everyone to read every detail of your code before using it, and blame others for that, if you treat your code as a library as sklearn. That's one of the reasons they reject the code. Consider it when you are using other libraries, take sklearn as an example, if it only has a bunch of self-contained code for experiments, which requires a certain input format, without a general framework. You need to carefully examine the library code to determine how to split the main logic out and apply to your dataset. Will you use it? I acknowledge that the experiment is interesting, but it is not a library. I just want to kindly post sth in my mind to improve the code but it is too hard.

appleyuchi · 2019-07-26T05:03:12Z

@zhenyu-zhou

You shouldn't expect everyone to read every detail of your code before using it,and blame others for that,

you misunderstand it ,what I said is material,not code.
Note that it refers to a book,NOT the code I have written.

I mean《classification and regression trees》is a famous book,not the code I wrote ,
so that's not blame.and "you" refers to new contributors,NOT the existing member of sklearn.

I just want to kindly post sth in my mind to improve the code but it is too hard.

again,you should read the book carefully before you implement it.

@zhenyu-zhou

, if it only has a bunch of self-contained code for experiments, which requires a certain input format, without a general framework. You need to carefully examine the library code to determine how to split the main logic out and apply to your dataset.

The API style has been dicussed several months ago when you have NOT been here,
again,
what I have implemented is"application style",NOT from"sklearn bottom data structure"

Good Luck.

ps:
Notification from this issue has been cancelled,because I'm busy.
@me will NOT take effect any longer.
If you have any question ,contact me via Email please.

arcadiahero · 2020-06-13T08:05:59Z

@ameuller @glouppe @GaelVaroquaux any opinion on the api for post pruning? I see these three options, the first of which seems the best to me. Just wanted to get some clarification before I start to think about working on this:

options given to the tree constructor are then taken into account by .fit

a separate .prune or .post_prune method has to be called after fitting

a separate prune_tree or post_prune_tree function takes the tree and returns another pruned tree
do you have any code for this?
Thanks

amueller added New Feature Need Contributor labels Mar 16, 2016

nelson-liu mentioned this issue Jul 3, 2016

[MRG+1] feature: add beta-threshold early stopping for decision tree growth #6954

Merged

jnothman mentioned this issue Mar 14, 2018

DecisionTreeClassifier: Auto-prune the useless levels when splitter=’best’ #10810

Closed

jnothman mentioned this issue Sep 5, 2018

Minimal cost complexity pruning #12008

Closed

appleyuchi mentioned this issue Dec 10, 2018

Does sklearn group consider the feature of pruning regression tree? #12744

Closed

adrinjalali added the Moderate Anything that requires some knowledge of conventions and best practices label Dec 14, 2018

thomasjpfan mentioned this issue Dec 29, 2018

[MRG] Adds Minimal Cost-Complexity Pruning to Decision Trees #12887

Merged

dalelane mentioned this issue Aug 19, 2019

Some decision tree branching may be unnecessary IBM/taxinomitis#226

Open

NicolasHug closed this as completed in #12887 Aug 20, 2019

add post-pruning for decision trees #6557

add post-pruning for decision trees #6557

Comments

amueller commented Mar 16, 2016

nelson-liu commented Mar 17, 2016

lesshaste commented Mar 18, 2016

maniteja123 commented Mar 18, 2016

lesshaste commented Mar 18, 2016

qmaruf commented Apr 5, 2016

amueller commented May 21, 2016

amueller commented May 21, 2016

lesshaste commented May 21, 2016

glouppe commented Jun 6, 2016

ogrisel commented Jun 6, 2016

nelson-liu commented Jun 30, 2016

nelson-liu commented Jun 30, 2016

nlgranger commented Jun 30, 2016

jmschrei commented Jun 30, 2016

jmschrei commented Jun 30, 2016 • edited

nlgranger commented Jun 30, 2016

nelson-liu commented Jul 1, 2016

nlgranger commented Jul 1, 2016

jmschrei commented Jul 2, 2016

nelson-liu commented Aug 9, 2016

glouppe commented Aug 9, 2016

raghavrv commented Aug 9, 2016

jnothman commented Aug 9, 2016

raghavrv commented Aug 9, 2016

jnothman commented Aug 9, 2016

jmschrei commented Aug 9, 2016

raghavrv commented Aug 9, 2016

ccmaymay commented Dec 15, 2017

feng-1985 commented Mar 8, 2018

Gitman-code commented Mar 13, 2018

ghost commented May 4, 2018

jnothman commented May 6, 2018

wsy1991 commented Jun 12, 2018

wsy1991 commented Jun 12, 2018

wsy1991 commented Jun 13, 2018

wsy1991 commented Jun 13, 2018

zanderbraam commented Jun 13, 2018

appleyuchi commented Dec 3, 2018

appleyuchi commented Dec 8, 2018 • edited

amueller commented Dec 8, 2018

nlgranger commented Dec 13, 2018

nlgranger commented Dec 13, 2018

amueller commented Dec 13, 2018

adrinjalali commented Dec 14, 2018

thomasjpfan commented Dec 23, 2018 • edited

zhenyu-zhou commented Jul 26, 2019

appleyuchi commented Jul 26, 2019 • edited

zhenyu-zhou commented Jul 26, 2019

appleyuchi commented Jul 26, 2019 • edited

arcadiahero commented Jun 13, 2020

jmschrei commented Jun 30, 2016 •

edited

appleyuchi commented Dec 8, 2018 •

edited

thomasjpfan commented Dec 23, 2018 •

edited

appleyuchi commented Jul 26, 2019 •

edited

appleyuchi commented Jul 26, 2019 •

edited