-
-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add post-pruning for decision trees #6557
Comments
I can give this a shot :) |
The title of this issue is related #4630 . |
This issue #941 also seems to be proposing prune method to the tree but the code has considerably changed in the meantime I suppose. |
On the topic of interpretability, there has also been published work on creating single decision trees from random forests. As counter-intuitive as this sounds, due to the non-optimality of standard decision tree algorithms, this can apparently give single decisions tree that are both interpretable and also better classifiers/regressors than you would get otherwise. |
@amueller Could you please explain this issue a little bit? I want to work on this. |
actually maybe the first thing to do would be to add a stopping criterion based on mutual information... |
@lesshaste can you give a citation? (I had been thinking about that recently, damn, obviously it has been done ^^) |
@amueller It has been a long time since I looked at this but: Section 2.2 of On an only slightly related note, I remember looking at http://blog.datadive.net/interpreting-random-forests/ with some interest. |
An easy addition would be to stop the construction when |
I think this can also be beneficial for some ensemble algorithms like (gradient) boosting although it's typically not increasing the predictive accuracy for bagging-style ensembles (e.g. random forests). Pruning might also be useful to save some memory and decrease prediction times a bit. |
@amueller / others: @jmschrei and i met to discuss the issue of post-pruning a few weeks ago, and we were unsure of how it would fit in the current scikit-learn API. Generally, post-pruning needs a validation set, but this doesn't seem to fit nicely with how the library is currently organized (namely, issues like creation / origin of the validation set and whether this would be an argument to fit or a separate method come to mind). How were you thinking of this being implemented from an API point of view? |
Seems like there was some discussion about API issues in #941, but the tree module has changed quite a bit and perhaps we should have the discussion again. For one, we have parameters like |
I would be inclined to set up pruning as a separate method. In fact, a separate function that takes a tree and returns the pruned version would be ok to begin with. This is the approach which is used in https://github.com/ajtulloch/sklearn-compiledtrees to generate optimized code for trees. @nelson-liu With regard to the regularization options, they sometimes lead to subtrees with all nodes belonging to the same class. While it is not pruning in the sense of regularization, it would be nice to have a function to get rid of these extra nodes that only add computational cost during prediction. |
When would you ever get a subtree with all nodes predicting the same class? You wouldn't make a split if there wasn't a gain in your criterion. |
In essence, what I'd like is a 'warm start' like method for building a tree, like we have a warm start for building a random forest. You should have a method to add a single node to a tree, but still get a valid estimator before and after adding the node. This would allow users to evaluate the trees performance on the evaluation set as they build it, just like you can add trees to a random forest, evaluating its performance on the validation set each time. This shouldn't be terribly difficult functionally to add, the biggest issue is just the API for this. @ogrisel @glouppe do you have any thoughts? |
@jmschrei |
@pixelou another important thing to consider is whether we want this to be useable with |
@nelson-liu Sorry I wasn't clear: I meant that most of the hard work is to write the pruning procedures themselve. Integration shouldn't be too hard (but I'm not the one doing it obviously :-) ) One the top of my head, I see several options for integration:
It's up to you to decide on this, but I think one can write a separate private (class?)method for pruning and make it available to the API as one of the above solutions. Note that I have just given the points above without further thinking ;-). 2 and 4 are clearly not the sklearn way of doing things, and you just mentioned how 3 can be a problem. |
The first option of those seems the most consistent with sklearn. |
@ameuller @glouppe @GaelVaroquaux any opinion on the api for post pruning? I see these three options, the first of which seems the best to me. Just wanted to get some clarification before I start to think about working on this:
|
Yes, the first option is certainly the one that fits best with our API. |
Isn't it pre pruning if you do it at fit? Aren't we supposed to check the complexity of the tree and use post pruning to reduce it? |
Start with the third option, then decide whether the others are appropriate... |
+1 for that |
(by which I mean, the example code might help decide) |
It's not pre-pruning if you do it at fit, if you build the full tree and then go backwards and remove nodes. I agree with @glouppe that the first one is the best option, but I also agree with @jnothman that since the code will rely on a prune_tree method |
I assumed that the user would want to select whether to post pruning or not based on the built tree... |
@amueller oh, I see. |
Look forward to it! |
For some of the ensemble methods this could be worked in naturally. If there is bagging then the OOB sample could be used for validation and pruning. This would be like how oob_improvement_ is calculated in GradientBoostingClassifier |
Is anyone working on this right now? |
Not to my knowledge |
is there anybody done it ??? |
is anyone have done it that could share your code...... |
cost complexity pruning based on sklearn.decisiontreeclassifier ???? |
Can someone do cost complexity pruning based on sklearn.decisiontreeclassifier()? If not, I'll ask again .I'm seriously....... |
Found this: Not sure if it is exactly cost complexity pruning? |
@zanderbraam it's not CCP |
Hi,all , ECP(Error Complexity Pruning) Algorithm on here's the link: |
@appleyuchi thanks! |
I don't work with DT anymore, but has anyone had a look at https://github.com/beedotkiran/randomforestpruning-ismm-2017 ? It seems relevant to this issue. |
A tree is just a very small forest. Can't this implementation scale down to trees? |
@appleyuchi I'm not sure if I follow what you're saying but we will not adopt an implementation based on going to JSON. Any implementation in scikit-learn would have to work directly with the scikit-learn data structures. |
@appleyuchi thanks for your efforts. Regarding how tricky it may be to implement this feature, we do recognize that touching the tree code base is not necessarily a trivial task. There have been efforts to change that and have a more readable implementation. Besides, this issue isn't labeled as "Easy" for that exact reason. I hope you find other issues that you may be interested in and you keep the good work on them :) |
I am working on this issue with a cost complexity pruning (CPP) algorithm. I see several tests that can be used to check tree pruning:
What other tests would be appropriate for tree pruning? |
@appleyuchi Thanks for sharing! My concern is that, even as a Chinese, it's still pretty hard for me to follow the code. I guess it is better to have the code more modularized so that others can apply your implementation on any data sets. |
@zhenyu-zhou The first author of this book has already passed away so you can not contact him for questions. The defect of this book is discussed in which point out that CCP/ECP algorithm-cross validation will fail for unbalanced and small datasets, I analyzed and summarized the defect in: https://blog.csdn.net/appleyuchi/article/details/84957220 The Github link I provide for CCP/ECP is just "application style",NOT from sklearn's "bottom variable style",(the latter will be much more efficient and faster)
It can be applied on many datasets I have tested,I guess you even have NOT clicked in and read the instruction in the Github link |
You are ABSOLUTELY WRONG. To make it short, I just have one question for you: do you provide a clean API, like sklearn did? You shouldn't expect everyone to read every detail of your code before using it, and blame others for that, if you treat your code as a library as sklearn. That's one of the reasons they reject the code. Consider it when you are using other libraries, take sklearn as an example, if it only has a bunch of self-contained code for experiments, which requires a certain input format, without a general framework. You need to carefully examine the library code to determine how to split the main logic out and apply to your dataset. Will you use it? I acknowledge that the experiment is interesting, but it is not a library. I just want to kindly post sth in my mind to improve the code but it is too hard. |
you misunderstand it ,what I said is material,not code. I mean《classification and regression trees》is a famous book,not the code I wrote ,
again,you should read the book carefully before you implement it.
The API style has been dicussed several months ago when you have NOT been here, ps: |
|
I frequently get asked about post-pruning. Often using single trees is important for interpretability, and post-pruning can help both interpretability and generalization performance.
[I'm surprised there is no open issue on this, but I couldn't find one]
The text was updated successfully, but these errors were encountered: