Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added BestBinaryCategoricalSplit splitter for DecisionTree. #3649

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

nikolay-apanasov
Copy link

@nikolay-apanasov nikolay-apanasov commented Mar 4, 2024

As described in the proposal issue-884.pdf, I have completed a first implementation of BestBinaryCategoricalSplit, including an extended test suite for DecisionTree and DecisionTreeRegressor. Initial results show that the spitter results in good classification performance, and does better than AllCategoricalSplit, albeit by a small margin. For example on the mushroom dataset, the best accuracy that a DecisionTree<BestBinaryCategoricalSplit> achieves (with a depth setting of maximumDepth > 7 is 1, whereas DecisionTree<AllCategoricalSplit> does no better than ~ 99.525. Moreover, BestBinaryCategoricalSplit does as well or better at all depth constraints. I would appreciate any suggestions on good datasets for additional benchmarking. All the datasets I have used are low in difficulty, and I have not yet been able to find a categorical dataset that is both large and non-trivial.

Benchmarks on the MockCategorical, UCI Mushroom and UCI Nursery datasets indicate that this new splitter causes the runtime performance to decrease by a factor of three to four on the binary classification task, and even more in the multi-class setting. This is expected, because in the binary setting we have to transform the categorical data, ordering the categories by their proportion in class one, which requires a sort. Moreover in the multi-class setting, we have to evaluate an exponential number of category partitions. As in the proposal, the documentation makes it clear that this splitter should not be used when there are multiple classes and many categories.

@nikolay-apanasov
Copy link
Author

nikolay-apanasov commented Mar 4, 2024

The test BestBinaryCategoricalSplitRegressionTwoPerfectTest in the DecisionTreeRegressorTest suite was failing on the Linux builds. However on my (Debian 12.0) system, the tests all passed OK, using different random seeds over a thousand consecutive runs. I looked at the test, and it seemed right. The split should produce two pure nodes, one with all the samples having response zero, and the other with all the samples having response two, in which case the MSE gain should be (approximately) zero. Then I saw the Azure output, and noticed that the build was running with Armadillo version 9.8. The remark from the documentation about initializing to zero flashed through my mind: I went and checked, and sure enough, the vector was not being initialized to zero, although it was on my system because it is running Armadillo 11.4. This was a fortunate problem, though, because the code was actually incorrect. I was using an arma::uvec to hold the mean of the response values, rather than an arma::vec.

@nikolay-apanasov nikolay-apanasov changed the title Completed first implementation of {BestBinaryCategoricalSplit}. Added BestBinaryCategoricalSplit splitter for DecisionTree. Mar 4, 2024
Copy link

mlpack-bot bot commented Apr 3, 2024

This issue has been automatically marked as stale because it has not had any recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions! 👍

@rcurtin
Copy link
Member

rcurtin commented Apr 3, 2024

Marking it stale just as I am reviewing it... sorry @mlpack-bot 😄

Copy link
Member

@rcurtin rcurtin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @nikolay-apanasov, thanks so much for the hard work on this. It took a long time to set aside enough time to review this fully. I think overall everything is correct, but algorithmically I have two major questions:

  • Do you have somewhere I can find Breiman's proof of correctness for the categorical case? I don't have a copy of 'Classification and Regression Trees', but maybe it exists in a paper somewhere? If not, maybe I can scare up a copy somewhere.
  • I don't see how we are restricting the use of BestBinaryCategoricalSplit to L2 loss in the regression case. I haven't yet fully read the Fisher paper, but I will try to do that shortly.

I have a number of other comments mostly concerning style, simplifications, and in a couple of cases optimizations. The only other major comment is that we should update the documentation in doc/user/methods/ too.

I don't think I got all the style issues, but I tried to add suggestions as I went by.

Overall I think this is great, and it would be wonderful to add this to mlpack. Thanks again for investing the time into working on this. 👍

categoryResponse[i]/categoryCounts[i];

uvec sortedCategories = sort_index(categoryResponse);
uvec categoryRank(numCategories);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could possibly be simplified into a single Armadillo expression too.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I will see what I can do.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if the suggestion above doesn't work, I can try again if you like. 👍

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rcurtin The Armadillo one-liners did not work for me (yet): I am getting some arcane template-related output from the GCC compiler. For the preceding one-liner, I get:

error: no match for ‘operator[]’ (operand types are ‘arma::u
vec’ {aka ‘arma::Col<long long unsigned int>’} and ‘arma::Col<long long unsigned int>’)    
   67 |       arma::uvec transformedData = categoryRank[arma::conv_to<arma::uvec>::from(data)];

followed by a long list of failed substitutions. The Armadillo library and its types are still relatively foreign to me: before I dig in, if it's simple for you, I would definitely appreciate a suggestion here. Then I did not find an elegant way to simplify lines 183-186. If you have anything particular in mind, please let me know.

for (size_t i = 0; i < numCategories; i++)
categoryRank[sortedCategories[i]] = i;

uvec transformedData(n);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here (a single Armadillo expression should be possible).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

@rcurtin
Copy link
Member

rcurtin commented Apr 3, 2024

Benchmarks on the MockCategorical, UCI Mushroom and UCI Nursery datasets indicate that this new splitter causes the runtime performance to decrease by a factor of three to four on the binary classification task, and even more in the multi-class setting.

Also, this is worth addressing: shouldn't we be able to get similar or better performance at least in the two-class case? Can you find a situation where it does? You could try, for instance, the covertype dataset or the pokerhand dataset, but modify the labels to group the classes such that it's a two-class problem.

@nikolay-apanasov
Copy link
Author

nikolay-apanasov commented Apr 4, 2024

@rcurtin Thank you for the detailed review.

Do you have somewhere I can find Breiman's proof of correctness for the categorical case?

I'm only aware of the book. However, do you have Ripley's book Pattern Recognition and Neural Networks? If you do, he provides a different proof of the same claim, that is his proof of Proposition 7.1 in Chapter 7. Hastie et al. mention this theorem in passing in Elements of Statistical Learning, but they don't provide a proof, only a reference.

I don't see how we are restricting the use of BestBinaryCategoricalSplit to L2 loss in the regression case. I haven't yet fully read the Fisher paper, but I will try to do that shortly.

There is only one overload for regression, and if you look at the implementation for the overload, the first statement is a static_assert that checks that the FitnessFunction is MSEGain.

The only other major comment is that we should update the documentation in doc/user/methods/ too.

Ok, sounds good. Any particular suggestion as to content?

Also, this is worth addressing: shouldn't we be able to get similar or better performance at least in the two-class case? Can you find a situation where it does? You could try, for instance, the covertype dataset or the pokerhand dataset, but modify the labels to group the classes such that it's a two-class problem.

No, I don't think that's possible. I'm sure that the implementation could be made more efficient. However, the computational complexity of the AllCategoricalSplit algorithm is O(n). Then as I mentioned above, in the binary setting we have to transform the categorical data, ordering the categories by their proportion in class one, which requires a sort, meaning that in the two-class setting, BestBinaryCategoricalSplit is a O(n lg n) algorithm. I don't know how you could order the categories, as required for the algorithm, without a sort.

I committed the simple style fixes today, and I should be able to get the rest done by the end of next week. Thanks again for the thoughtful review, and let me know if you think of anything else that I should look at.

@rcurtin
Copy link
Member

rcurtin commented Apr 8, 2024

I'm only aware of the book. However, do you have Ripley's book Pattern Recognition and Neural Networks? If you do, he provides a different proof of the same claim, that is his proof of Proposition 7.1 in Chapter 7. Hastie et al. mention this theorem in passing in Elements of Statistical Learning, but they don't provide a proof, only a reference.

Thanks---I don't have the Ripley book but I looked at the discussion in ESL. I don't doubt that the theorem is correct, more I just wanted to make sure that it applies to our situation fully. Based on the discussion in ESL I think it does. 👍

ESL also mentions this paper as an approximation for the multiclass case: https://pages.stat.wisc.edu/~loh/treeprogs/fact/LV88.pdf

It could be interesting to look into it someday, but definitely another time, no need to do that now. :)

Ok, sounds good. Any particular suggestion as to content?

Just a couple bullet points, I think if you take a look at the existing documentation you'll have an idea. The hope is that the website documentation is short and to the point, and it has been hard for me to omit details (I am a detailed person!) but I think overall it provides a better experience for the users if what we list is something simple, and if they want details they can go to the code, which is well-commented anyway.

No, I don't think that's possible. I'm sure that the implementation could be made more efficient. However, the computational complexity of the AllCategoricalSplit algorithm is O(n). Then as I mentioned above, in the binary setting we have to transform the categorical data, ordering the categories by their proportion in class one, which requires a sort, meaning that in the two-class setting, BestBinaryCategoricalSplit is a O(n lg n) algorithm. I don't know how you could order the categories, as required for the algorithm, without a sort.

Yes, I am now wondering what I was thinking when I wrote the comment. BestBinaryCategoricalSplit can't reasonably be expected to be anywhere near as fast as AllCategoricalSplit (which doesn't really even need to do any work...). So, all good from my end.

@nikolay-apanasov
Copy link
Author

Just a couple bullet points, I think if you take a look at the existing documentation you'll have an idea. The hope is that the website documentation is short and to the point, and it has been hard for me to omit details (I am a detailed person!) but I think overall it provides a better experience for the users if what we list is something simple, and if they want details they can go to the code, which is well-commented anyway.

Ok, sounds good. Then I'll try to get this all wrapped up by the end of the week.

ESL also mentions this paper as an approximation for the multiclass case

You know, I too saw that comment in ESL. I did not read the paper in detail, though. Before I skimmed the paper, I actually had read the scathing commentary to the article (which immediately follows it in the published journal) by Breiman and Friedman, and to be honest, afterwards I was not motivated to read it in great detail, because I have great respect for those two statisticians. However, as I said, I did skim it, and I did not see the optimization they were referencing for categorical splits in the multi-class setting. Their system uses linear combinations (a form of LDA) to perform splits, and therefore I don't see how that would be applicable, since that is a fundamentally different splitting algorithm. For example, the FACT system that uses LDA for splits is not invariant under univariate monotone transformations. Maybe ESL was suggesting LDA as an approximation algorithm, however I don't see how using LDA to split the node could be seen as an approximation to an optimal partition. Moreover, nowhere in the article do they mention approximation.

@rcurtin
Copy link
Member

rcurtin commented Apr 9, 2024

Ok, sounds good. Then I'll try to get this all wrapped up by the end of the week.

Sure, there is really no hurry, take your time. 👍

You know, I too saw that comment in ESL. I did not read the paper in detail, though. Before I skimmed the paper, I actually had read the scathing commentary to the article (which immediately follows it in the published journal) by Breiman and Friedman, and to be honest, afterwards I was not motivated to read it in great detail, because I have great respect for those two statisticians. However, as I said, I did skim it, and I did not see the optimization they were referencing for categorical splits in the multi-class setting. Their system uses linear combinations (a form of LDA) to perform splits, and therefore I don't see how that would be applicable, since that is a fundamentally different splitting algorithm. For example, the FACT system that uses LDA for splits is not invariant under univariate monotone transformations. Maybe ESL was suggesting LDA as an approximation algorithm, however I don't see how using LDA to split the node could be seen as an approximation to an optimal partition. Moreover, nowhere in the article do they mention approximation.

Ah, ok---thanks for the clarification. That was also 1988, so I suspect in the intervening 36 years there have been other efforts in the same direction that is perhaps more applicable to our setting here. 👍

@nikolay-apanasov
Copy link
Author

nikolay-apanasov commented Apr 10, 2024

That was also 1988, so I suspect in the intervening 36 years there have been other efforts in the same direction that is perhaps more applicable to our setting here.

Most definitely. As I mentioned in the proposal, Chou's algorithm, which uses K-means clustering, is one attractive variant for finding an approximation in the multi-class setting. It's a non-trivial algorithm, and will take some work to implement well, but I think it would be a really cool addition.

@rcurtin
Copy link
Member

rcurtin commented Apr 10, 2024

Right, I had forgotten you mentioned that, sorry about that. Anyway, yeah, cool for later, but definitely later and not now 😄

Co-authored-by: Ryan Curtin <ryan@ratml.org>
@rcurtin
Copy link
Member

rcurtin commented May 8, 2024

I noticed that the LARS tests were failing here, so I spent a while investigating and just opened up #3701 to fix that. I'll review the rest of this hopefully in the next couple days---sorry that my LARS dive took longer than expected!

@nikolay-apanasov
Copy link
Author

I noticed that the LARS tests were failing here, so I spent a while investigating and just opened up #3701 to fix that.

You know, I noticed that the LARS tests were failing too, but I did not get around to actually taking a look, because I'm not familiar with that part of the code base yet. It's great that you got that figured out.

I'll review the rest of this hopefully in the next couple days---sorry that my LARS dive took longer than expected!

No worries at all. I have a lot on my plate right now with my graduate studies, so take your time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants