Improve Proglearn oblique tree speed #419

parthgvora · 2021-01-21T21:30:21Z

Cythonize bottlenecks in proglearn oblique tree splitter to speed up training times. Main bottleneck is finding the next best split.

parthgvora · 2021-01-21T21:30:47Z

please assign me to this

rflperry · 2021-01-21T23:56:52Z

I took a look at the sklearn cython tree code. I think a potential course of action is to basically copy over the sklearn.tree.DecisionTreeClassifier code and all the cython backend to the proglearn repo. Then modify the cython backend to add oblique splits and update the sklearn.tree.DecisionTreeClassifier class as need be.

This would allow for easy compatible integration with sklearn. It also uses their optimized code to the best extent.

Some thoughts about modifying their cython code, lots of stuff like sparse data and feature importance can be ignored to start. I think an MVP could simply sum together a sampled fixed number of features (equivalent to a dot product with weights of all 1). Verify this works, then we can iterate by adding weights/dot product instead of sum, controlling how many features to sample with a parameter in the constructor, etc.

PSSF23 · 2021-01-22T03:40:48Z

@rflperry I hold some objections for the copying idea, as copying the classifier code basically means copying over the entire sklearn.tree folder. I found it very hard to locate the codes we really need, and the easiest step for sklearn integration would be directly working on a fork.

That being said, @parthgvora if you need, I think I can provide some help about how sklearn currently structured its tree classifiers. And good luck on cython! It's pretty confusing in my opinion...

rflperry · 2021-01-22T14:13:52Z

I personally am not adverse to copying over the entire folder. It seems in some ways easier than trying to write cython functions fully from scratch for parts of the forest. Of course that cython code has a lot of functionality we don't need to worry about modifying for now, but that can just be deleted and then added/modified later (i.e. sparse X case, feature importance, etc). But that's just my take, up to @parthgvora

parthgvora · 2021-01-22T16:42:22Z

To be honest, I kind of think Hao is right here. Having tried this, there's a ton of stuff in the sklearn.tree folder we'd need to sort out to do what we need to do. Also, dealing with the GIL is a real pain in cython. So I'm kind of adverse to copying the folder directly.

That said, we should keep their code in mind when writing our own cython to make integration easier. We can also take some inspiration from the way they wrote their code as well, and do smaller copy-pastes as necessary.

So, I think we should cythonize piece by piece, starting with the splitter. We're probably going to need our own splitter for oblique splits, and its the slowest part anyways. So I'm gonna try to write this out in cython best I can, using their code for guidance at times. Then, if we want, we can keep adding sklearn functionality with small copy-pastes.

Also, implementing the splitter first should speed up the code enough to let someone implement and test MORF in python while we keep working on cythonizing.

parthgvora changed the title ~~Improve Proglearn SPORF speed~~ Improve Proglearn oblique tree speed Jan 21, 2021

jovo assigned parthgvora Jan 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Proglearn oblique tree speed #419

Improve Proglearn oblique tree speed #419

parthgvora commented Jan 21, 2021 •

edited

parthgvora commented Jan 21, 2021

rflperry commented Jan 21, 2021

PSSF23 commented Jan 22, 2021

rflperry commented Jan 22, 2021 •

edited

parthgvora commented Jan 22, 2021 •

edited

Improve Proglearn oblique tree speed #419

Improve Proglearn oblique tree speed #419

Comments

parthgvora commented Jan 21, 2021 • edited

parthgvora commented Jan 21, 2021

rflperry commented Jan 21, 2021

PSSF23 commented Jan 22, 2021

rflperry commented Jan 22, 2021 • edited

parthgvora commented Jan 22, 2021 • edited

parthgvora commented Jan 21, 2021 •

edited

rflperry commented Jan 22, 2021 •

edited

parthgvora commented Jan 22, 2021 •

edited