New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added BestBinaryCategoricalSplit
splitter for DecisionTree
.
#3649
base: master
Are you sure you want to change the base?
Conversation
e622209
to
3a496d8
Compare
The test |
3a496d8
to
d5f4aa5
Compare
BestBinaryCategoricalSplit
splitter for DecisionTree
.
This issue has been automatically marked as stale because it has not had any recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions! 👍 |
Marking it stale just as I am reviewing it... sorry @mlpack-bot 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @nikolay-apanasov, thanks so much for the hard work on this. It took a long time to set aside enough time to review this fully. I think overall everything is correct, but algorithmically I have two major questions:
- Do you have somewhere I can find Breiman's proof of correctness for the categorical case? I don't have a copy of 'Classification and Regression Trees', but maybe it exists in a paper somewhere? If not, maybe I can scare up a copy somewhere.
- I don't see how we are restricting the use of
BestBinaryCategoricalSplit
to L2 loss in the regression case. I haven't yet fully read the Fisher paper, but I will try to do that shortly.
I have a number of other comments mostly concerning style, simplifications, and in a couple of cases optimizations. The only other major comment is that we should update the documentation in doc/user/methods/
too.
I don't think I got all the style issues, but I tried to add suggestions as I went by.
Overall I think this is great, and it would be wonderful to add this to mlpack. Thanks again for investing the time into working on this. 👍
src/mlpack/methods/decision_tree/best_binary_categorical_split.hpp
Outdated
Show resolved
Hide resolved
src/mlpack/methods/decision_tree/best_binary_categorical_split.hpp
Outdated
Show resolved
Hide resolved
src/mlpack/methods/decision_tree/best_binary_categorical_split_impl.hpp
Outdated
Show resolved
Hide resolved
src/mlpack/methods/decision_tree/best_binary_categorical_split_impl.hpp
Outdated
Show resolved
Hide resolved
categoryResponse[i]/categoryCounts[i]; | ||
|
||
uvec sortedCategories = sort_index(categoryResponse); | ||
uvec categoryRank(numCategories); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could possibly be simplified into a single Armadillo expression too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I will see what I can do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know if the suggestion above doesn't work, I can try again if you like. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rcurtin The Armadillo one-liners did not work for me (yet): I am getting some arcane template-related output from the GCC compiler. For the preceding one-liner, I get:
error: no match for ‘operator[]’ (operand types are ‘arma::u
vec’ {aka ‘arma::Col<long long unsigned int>’} and ‘arma::Col<long long unsigned int>’)
67 | arma::uvec transformedData = categoryRank[arma::conv_to<arma::uvec>::from(data)];
followed by a long list of failed substitutions. The Armadillo library and its types are still relatively foreign to me: before I dig in, if it's simple for you, I would definitely appreciate a suggestion here. Then I did not find an elegant way to simplify lines 183-186. If you have anything particular in mind, please let me know.
for (size_t i = 0; i < numCategories; i++) | ||
categoryRank[sortedCategories[i]] = i; | ||
|
||
uvec transformedData(n); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here (a single Armadillo expression should be possible).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok.
Also, this is worth addressing: shouldn't we be able to get similar or better performance at least in the two-class case? Can you find a situation where it does? You could try, for instance, the covertype dataset or the pokerhand dataset, but modify the labels to group the classes such that it's a two-class problem. |
@rcurtin Thank you for the detailed review.
I'm only aware of the book. However, do you have Ripley's book Pattern Recognition and Neural Networks? If you do, he provides a different proof of the same claim, that is his proof of Proposition 7.1 in Chapter 7. Hastie et al. mention this theorem in passing in Elements of Statistical Learning, but they don't provide a proof, only a reference.
There is only one overload for regression, and if you look at the implementation for the overload, the first statement is a
Ok, sounds good. Any particular suggestion as to content?
No, I don't think that's possible. I'm sure that the implementation could be made more efficient. However, the computational complexity of the I committed the simple style fixes today, and I should be able to get the rest done by the end of next week. Thanks again for the thoughtful review, and let me know if you think of anything else that I should look at. |
Thanks---I don't have the Ripley book but I looked at the discussion in ESL. I don't doubt that the theorem is correct, more I just wanted to make sure that it applies to our situation fully. Based on the discussion in ESL I think it does. 👍 ESL also mentions this paper as an approximation for the multiclass case: https://pages.stat.wisc.edu/~loh/treeprogs/fact/LV88.pdf It could be interesting to look into it someday, but definitely another time, no need to do that now. :)
Just a couple bullet points, I think if you take a look at the existing documentation you'll have an idea. The hope is that the website documentation is short and to the point, and it has been hard for me to omit details (I am a detailed person!) but I think overall it provides a better experience for the users if what we list is something simple, and if they want details they can go to the code, which is well-commented anyway.
Yes, I am now wondering what I was thinking when I wrote the comment. |
Ok, sounds good. Then I'll try to get this all wrapped up by the end of the week.
You know, I too saw that comment in ESL. I did not read the paper in detail, though. Before I skimmed the paper, I actually had read the scathing commentary to the article (which immediately follows it in the published journal) by Breiman and Friedman, and to be honest, afterwards I was not motivated to read it in great detail, because I have great respect for those two statisticians. However, as I said, I did skim it, and I did not see the optimization they were referencing for categorical splits in the multi-class setting. Their system uses linear combinations (a form of LDA) to perform splits, and therefore I don't see how that would be applicable, since that is a fundamentally different splitting algorithm. For example, the FACT system that uses LDA for splits is not invariant under univariate monotone transformations. Maybe ESL was suggesting LDA as an approximation algorithm, however I don't see how using LDA to split the node could be seen as an approximation to an optimal partition. Moreover, nowhere in the article do they mention approximation. |
Sure, there is really no hurry, take your time. 👍
Ah, ok---thanks for the clarification. That was also 1988, so I suspect in the intervening 36 years there have been other efforts in the same direction that is perhaps more applicable to our setting here. 👍 |
Most definitely. As I mentioned in the proposal, Chou's algorithm, which uses K-means clustering, is one attractive variant for finding an approximation in the multi-class setting. It's a non-trivial algorithm, and will take some work to implement well, but I think it would be a really cool addition. |
Right, I had forgotten you mentioned that, sorry about that. Anyway, yeah, cool for later, but definitely later and not now 😄 |
684d723
to
cceb6f3
Compare
Co-authored-by: Ryan Curtin <ryan@ratml.org>
cceb6f3
to
d9ad327
Compare
I noticed that the LARS tests were failing here, so I spent a while investigating and just opened up #3701 to fix that. I'll review the rest of this hopefully in the next couple days---sorry that my LARS dive took longer than expected! |
You know, I noticed that the LARS tests were failing too, but I did not get around to actually taking a look, because I'm not familiar with that part of the code base yet. It's great that you got that figured out.
No worries at all. I have a lot on my plate right now with my graduate studies, so take your time. |
As described in the proposal issue-884.pdf, I have completed a first implementation of
BestBinaryCategoricalSplit
, including an extended test suite forDecisionTree
andDecisionTreeRegressor
. Initial results show that the spitter results in good classification performance, and does better thanAllCategoricalSplit
, albeit by a small margin. For example on the mushroom dataset, the best accuracy that aDecisionTree<BestBinaryCategoricalSplit>
achieves (with a depth setting ofmaximumDepth > 7
is1
, whereasDecisionTree<AllCategoricalSplit>
does no better than ~99.525
. Moreover,BestBinaryCategoricalSplit
does as well or better at all depth constraints. I would appreciate any suggestions on good datasets for additional benchmarking. All the datasets I have used are low in difficulty, and I have not yet been able to find a categorical dataset that is both large and non-trivial.Benchmarks on the
MockCategorical
, UCI Mushroom and UCI Nursery datasets indicate that this new splitter causes the runtime performance to decrease by a factor of three to four on the binary classification task, and even more in the multi-class setting. This is expected, because in the binary setting we have to transform the categorical data, ordering the categories by their proportion in class one, which requires a sort. Moreover in the multi-class setting, we have to evaluate an exponential number of category partitions. As in the proposal, the documentation makes it clear that this splitter should not be used when there are multiple classes and many categories.