Added `BestBinaryCategoricalSplit` splitter for `DecisionTree`. #3649

nikolay-apanasov · 2024-03-04T00:55:13Z

As described in the proposal issue-884.pdf, I have completed a first implementation of BestBinaryCategoricalSplit, including an extended test suite for DecisionTree and DecisionTreeRegressor. Initial results show that the spitter results in good classification performance, and does better than AllCategoricalSplit, albeit by a small margin. For example on the mushroom dataset, the best accuracy that a DecisionTree<BestBinaryCategoricalSplit> achieves (with a depth setting of maximumDepth > 7 is 1, whereas DecisionTree<AllCategoricalSplit> does no better than ~ 99.525. Moreover, BestBinaryCategoricalSplit does as well or better at all depth constraints. I would appreciate any suggestions on good datasets for additional benchmarking. All the datasets I have used are low in difficulty, and I have not yet been able to find a categorical dataset that is both large and non-trivial.

Benchmarks on the MockCategorical, UCI Mushroom and UCI Nursery datasets indicate that this new splitter causes the runtime performance to decrease by a factor of three to four on the binary classification task, and even more in the multi-class setting. This is expected, because in the binary setting we have to transform the categorical data, ordering the categories by their proportion in class one, which requires a sort. Moreover in the multi-class setting, we have to evaluate an exponential number of category partitions. As in the proposal, the documentation makes it clear that this splitter should not be used when there are multiple classes and many categories.

nikolay-apanasov · 2024-03-04T05:54:19Z

The test BestBinaryCategoricalSplitRegressionTwoPerfectTest in the DecisionTreeRegressorTest suite was failing on the Linux builds. However on my (Debian 12.0) system, the tests all passed OK, using different random seeds over a thousand consecutive runs. I looked at the test, and it seemed right. The split should produce two pure nodes, one with all the samples having response zero, and the other with all the samples having response two, in which case the MSE gain should be (approximately) zero. Then I saw the Azure output, and noticed that the build was running with Armadillo version 9.8. The remark from the documentation about initializing to zero flashed through my mind: I went and checked, and sure enough, the vector was not being initialized to zero, although it was on my system because it is running Armadillo 11.4. This was a fortunate problem, though, because the code was actually incorrect. I was using an arma::uvec to hold the mean of the response values, rather than an arma::vec.

mlpack-bot · 2024-04-03T18:45:34Z

This issue has been automatically marked as stale because it has not had any recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions! 👍

rcurtin · 2024-04-03T19:01:52Z

Marking it stale just as I am reviewing it... sorry @mlpack-bot 😄

rcurtin

Hey @nikolay-apanasov, thanks so much for the hard work on this. It took a long time to set aside enough time to review this fully. I think overall everything is correct, but algorithmically I have two major questions:

Do you have somewhere I can find Breiman's proof of correctness for the categorical case? I don't have a copy of 'Classification and Regression Trees', but maybe it exists in a paper somewhere? If not, maybe I can scare up a copy somewhere.
I don't see how we are restricting the use of BestBinaryCategoricalSplit to L2 loss in the regression case. I haven't yet fully read the Fisher paper, but I will try to do that shortly.

I have a number of other comments mostly concerning style, simplifications, and in a couple of cases optimizations. The only other major comment is that we should update the documentation in doc/user/methods/ too.

I don't think I got all the style issues, but I tried to add suggestions as I went by.

Overall I think this is great, and it would be wonderful to add this to mlpack. Thanks again for investing the time into working on this. 👍

src/mlpack/methods/decision_tree/best_binary_categorical_split.hpp

src/mlpack/methods/decision_tree/all_categorical_split_impl.hpp

src/mlpack/methods/decision_tree/best_binary_numeric_split_impl.hpp

src/mlpack/methods/decision_tree/best_binary_categorical_split_impl.hpp

rcurtin · 2024-04-03T20:09:41Z

src/mlpack/methods/decision_tree/best_binary_categorical_split_impl.hpp

+          categoryResponse[i]/categoryCounts[i];
+
+  uvec sortedCategories = sort_index(categoryResponse);
+  uvec categoryRank(numCategories);


I think this could possibly be simplified into a single Armadillo expression too.

Ok, I will see what I can do.

Let me know if the suggestion above doesn't work, I can try again if you like. 👍

@rcurtin The Armadillo one-liners did not work for me (yet): I am getting some arcane template-related output from the GCC compiler. For the preceding one-liner, I get:

error: no match for ‘operator[]’ (operand types are ‘arma::u vec’ {aka ‘arma::Col<long long unsigned int>’} and ‘arma::Col<long long unsigned int>’) 67 | arma::uvec transformedData = categoryRank[arma::conv_to<arma::uvec>::from(data)];

followed by a long list of failed substitutions. The Armadillo library and its types are still relatively foreign to me: before I dig in, if it's simple for you, I would definitely appreciate a suggestion here. Then I did not find an elegant way to simplify lines 183-186. If you have anything particular in mind, please let me know.

rcurtin · 2024-04-03T20:09:56Z

src/mlpack/methods/decision_tree/best_binary_categorical_split_impl.hpp

+  for (size_t i = 0; i < numCategories; i++)
+    categoryRank[sortedCategories[i]] = i;
+
+  uvec transformedData(n); 


Same here (a single Armadillo expression should be possible).

src/mlpack/methods/decision_tree/best_binary_categorical_split_impl.hpp

rcurtin · 2024-04-03T21:12:21Z

Benchmarks on the MockCategorical, UCI Mushroom and UCI Nursery datasets indicate that this new splitter causes the runtime performance to decrease by a factor of three to four on the binary classification task, and even more in the multi-class setting.

Also, this is worth addressing: shouldn't we be able to get similar or better performance at least in the two-class case? Can you find a situation where it does? You could try, for instance, the covertype dataset or the pokerhand dataset, but modify the labels to group the classes such that it's a two-class problem.

nikolay-apanasov · 2024-04-04T04:02:47Z

@rcurtin Thank you for the detailed review.

Do you have somewhere I can find Breiman's proof of correctness for the categorical case?

I'm only aware of the book. However, do you have Ripley's book Pattern Recognition and Neural Networks? If you do, he provides a different proof of the same claim, that is his proof of Proposition 7.1 in Chapter 7. Hastie et al. mention this theorem in passing in Elements of Statistical Learning, but they don't provide a proof, only a reference.

I don't see how we are restricting the use of BestBinaryCategoricalSplit to L2 loss in the regression case. I haven't yet fully read the Fisher paper, but I will try to do that shortly.

There is only one overload for regression, and if you look at the implementation for the overload, the first statement is a static_assert that checks that the FitnessFunction is MSEGain.

The only other major comment is that we should update the documentation in doc/user/methods/ too.

Ok, sounds good. Any particular suggestion as to content?

Also, this is worth addressing: shouldn't we be able to get similar or better performance at least in the two-class case? Can you find a situation where it does? You could try, for instance, the covertype dataset or the pokerhand dataset, but modify the labels to group the classes such that it's a two-class problem.

No, I don't think that's possible. I'm sure that the implementation could be made more efficient. However, the computational complexity of the AllCategoricalSplit algorithm is O(n). Then as I mentioned above, in the binary setting we have to transform the categorical data, ordering the categories by their proportion in class one, which requires a sort, meaning that in the two-class setting, BestBinaryCategoricalSplit is a O(n lg n) algorithm. I don't know how you could order the categories, as required for the algorithm, without a sort.

I committed the simple style fixes today, and I should be able to get the rest done by the end of next week. Thanks again for the thoughtful review, and let me know if you think of anything else that I should look at.

rcurtin · 2024-04-08T17:01:55Z

I'm only aware of the book. However, do you have Ripley's book Pattern Recognition and Neural Networks? If you do, he provides a different proof of the same claim, that is his proof of Proposition 7.1 in Chapter 7. Hastie et al. mention this theorem in passing in Elements of Statistical Learning, but they don't provide a proof, only a reference.

Thanks---I don't have the Ripley book but I looked at the discussion in ESL. I don't doubt that the theorem is correct, more I just wanted to make sure that it applies to our situation fully. Based on the discussion in ESL I think it does. 👍

ESL also mentions this paper as an approximation for the multiclass case: https://pages.stat.wisc.edu/~loh/treeprogs/fact/LV88.pdf

It could be interesting to look into it someday, but definitely another time, no need to do that now. :)

Ok, sounds good. Any particular suggestion as to content?

Just a couple bullet points, I think if you take a look at the existing documentation you'll have an idea. The hope is that the website documentation is short and to the point, and it has been hard for me to omit details (I am a detailed person!) but I think overall it provides a better experience for the users if what we list is something simple, and if they want details they can go to the code, which is well-commented anyway.

No, I don't think that's possible. I'm sure that the implementation could be made more efficient. However, the computational complexity of the AllCategoricalSplit algorithm is O(n). Then as I mentioned above, in the binary setting we have to transform the categorical data, ordering the categories by their proportion in class one, which requires a sort, meaning that in the two-class setting, BestBinaryCategoricalSplit is a O(n lg n) algorithm. I don't know how you could order the categories, as required for the algorithm, without a sort.

Yes, I am now wondering what I was thinking when I wrote the comment. BestBinaryCategoricalSplit can't reasonably be expected to be anywhere near as fast as AllCategoricalSplit (which doesn't really even need to do any work...). So, all good from my end.

nikolay-apanasov · 2024-04-08T22:52:23Z

Just a couple bullet points, I think if you take a look at the existing documentation you'll have an idea. The hope is that the website documentation is short and to the point, and it has been hard for me to omit details (I am a detailed person!) but I think overall it provides a better experience for the users if what we list is something simple, and if they want details they can go to the code, which is well-commented anyway.

Ok, sounds good. Then I'll try to get this all wrapped up by the end of the week.

ESL also mentions this paper as an approximation for the multiclass case

You know, I too saw that comment in ESL. I did not read the paper in detail, though. Before I skimmed the paper, I actually had read the scathing commentary to the article (which immediately follows it in the published journal) by Breiman and Friedman, and to be honest, afterwards I was not motivated to read it in great detail, because I have great respect for those two statisticians. However, as I said, I did skim it, and I did not see the optimization they were referencing for categorical splits in the multi-class setting. Their system uses linear combinations (a form of LDA) to perform splits, and therefore I don't see how that would be applicable, since that is a fundamentally different splitting algorithm. For example, the FACT system that uses LDA for splits is not invariant under univariate monotone transformations. Maybe ESL was suggesting LDA as an approximation algorithm, however I don't see how using LDA to split the node could be seen as an approximation to an optimal partition. Moreover, nowhere in the article do they mention approximation.

rcurtin · 2024-04-09T15:51:50Z

Ok, sounds good. Then I'll try to get this all wrapped up by the end of the week.

Sure, there is really no hurry, take your time. 👍

You know, I too saw that comment in ESL. I did not read the paper in detail, though. Before I skimmed the paper, I actually had read the scathing commentary to the article (which immediately follows it in the published journal) by Breiman and Friedman, and to be honest, afterwards I was not motivated to read it in great detail, because I have great respect for those two statisticians. However, as I said, I did skim it, and I did not see the optimization they were referencing for categorical splits in the multi-class setting. Their system uses linear combinations (a form of LDA) to perform splits, and therefore I don't see how that would be applicable, since that is a fundamentally different splitting algorithm. For example, the FACT system that uses LDA for splits is not invariant under univariate monotone transformations. Maybe ESL was suggesting LDA as an approximation algorithm, however I don't see how using LDA to split the node could be seen as an approximation to an optimal partition. Moreover, nowhere in the article do they mention approximation.

Ah, ok---thanks for the clarification. That was also 1988, so I suspect in the intervening 36 years there have been other efforts in the same direction that is perhaps more applicable to our setting here. 👍

nikolay-apanasov · 2024-04-10T07:52:56Z

That was also 1988, so I suspect in the intervening 36 years there have been other efforts in the same direction that is perhaps more applicable to our setting here.

Most definitely. As I mentioned in the proposal, Chou's algorithm, which uses K-means clustering, is one attractive variant for finding an approximation in the multi-class setting. It's a non-trivial algorithm, and will take some work to implement well, but I think it would be a really cool addition.

rcurtin · 2024-04-10T17:47:40Z

Right, I had forgotten you mentioned that, sorry about that. Anyway, yeah, cool for later, but definitely later and not now 😄

Co-authored-by: Ryan Curtin <ryan@ratml.org>

rcurtin · 2024-05-08T21:22:25Z

I noticed that the LARS tests were failing here, so I spent a while investigating and just opened up #3701 to fix that. I'll review the rest of this hopefully in the next couple days---sorry that my LARS dive took longer than expected!

nikolay-apanasov · 2024-05-09T08:45:27Z

I noticed that the LARS tests were failing here, so I spent a while investigating and just opened up #3701 to fix that.

You know, I noticed that the LARS tests were failing too, but I did not get around to actually taking a look, because I'm not familiar with that part of the code base yet. It's great that you got that figured out.

I'll review the rest of this hopefully in the next couple days---sorry that my LARS dive took longer than expected!

No worries at all. I have a lot on my plate right now with my graduate studies, so take your time.

mlpack-bot bot added s: needs review s: unanswered s: unlabeled labels Mar 4, 2024

nikolay-apanasov force-pushed the best-categorical branch 2 times, most recently from e622209 to 3a496d8 Compare March 4, 2024 05:47

nikolay-apanasov force-pushed the best-categorical branch from 3a496d8 to d5f4aa5 Compare March 4, 2024 09:09

nikolay-apanasov changed the title ~~Completed first implementation of {BestBinaryCategoricalSplit}.~~ Added BestBinaryCategoricalSplit splitter for DecisionTree. Mar 4, 2024

mlpack-bot bot added the s: stale label Apr 3, 2024

rcurtin added c: methods t: added feature and removed s: stale s: unlabeled labels Apr 3, 2024

rcurtin reviewed Apr 3, 2024

View reviewed changes

Added BestBinaryCategoricalSplit splitter for DecisionTree.

43913ea

nikolay-apanasov force-pushed the best-categorical branch from 684d723 to cceb6f3 Compare April 22, 2024 00:52

Applied suggestions from code review.

d9ad327

Co-authored-by: Ryan Curtin <ryan@ratml.org>

nikolay-apanasov force-pushed the best-categorical branch from cceb6f3 to d9ad327 Compare April 22, 2024 01:18

Updated DecisionTree documentation for BestBinaryCategoricalSplit.

f523e39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added `BestBinaryCategoricalSplit` splitter for `DecisionTree`. #3649

Added `BestBinaryCategoricalSplit` splitter for `DecisionTree`. #3649

nikolay-apanasov commented Mar 4, 2024 •

edited

nikolay-apanasov commented Mar 4, 2024 •

edited

mlpack-bot bot commented Apr 3, 2024

rcurtin commented Apr 3, 2024

rcurtin left a comment

rcurtin Apr 3, 2024

nikolay-apanasov Apr 4, 2024

rcurtin Apr 8, 2024

nikolay-apanasov Apr 22, 2024

rcurtin Apr 3, 2024

nikolay-apanasov Apr 4, 2024

rcurtin commented Apr 3, 2024

nikolay-apanasov commented Apr 4, 2024 •

edited

rcurtin commented Apr 8, 2024

nikolay-apanasov commented Apr 8, 2024

rcurtin commented Apr 9, 2024

nikolay-apanasov commented Apr 10, 2024 •

edited

rcurtin commented Apr 10, 2024

rcurtin commented May 8, 2024

nikolay-apanasov commented May 9, 2024

Added BestBinaryCategoricalSplit splitter for DecisionTree. #3649

Are you sure you want to change the base?

Added BestBinaryCategoricalSplit splitter for DecisionTree. #3649

Conversation

nikolay-apanasov commented Mar 4, 2024 • edited

nikolay-apanasov commented Mar 4, 2024 • edited

mlpack-bot bot commented Apr 3, 2024

rcurtin commented Apr 3, 2024

rcurtin left a comment

Choose a reason for hiding this comment

rcurtin Apr 3, 2024

Choose a reason for hiding this comment

nikolay-apanasov Apr 4, 2024

Choose a reason for hiding this comment

rcurtin Apr 8, 2024

Choose a reason for hiding this comment

nikolay-apanasov Apr 22, 2024

Choose a reason for hiding this comment

rcurtin Apr 3, 2024

Choose a reason for hiding this comment

nikolay-apanasov Apr 4, 2024

Choose a reason for hiding this comment

rcurtin commented Apr 3, 2024

nikolay-apanasov commented Apr 4, 2024 • edited

rcurtin commented Apr 8, 2024

nikolay-apanasov commented Apr 8, 2024

rcurtin commented Apr 9, 2024

nikolay-apanasov commented Apr 10, 2024 • edited

rcurtin commented Apr 10, 2024

rcurtin commented May 8, 2024

nikolay-apanasov commented May 9, 2024

Added `BestBinaryCategoricalSplit` splitter for `DecisionTree`. #3649

Added `BestBinaryCategoricalSplit` splitter for `DecisionTree`. #3649

nikolay-apanasov commented Mar 4, 2024 •

edited

nikolay-apanasov commented Mar 4, 2024 •

edited

nikolay-apanasov commented Apr 4, 2024 •

edited

nikolay-apanasov commented Apr 10, 2024 •

edited