Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Proglearn oblique tree speed #419

Open
parthgvora opened this issue Jan 21, 2021 · 5 comments
Open

Improve Proglearn oblique tree speed #419

parthgvora opened this issue Jan 21, 2021 · 5 comments
Assignees

Comments

@parthgvora
Copy link
Contributor

parthgvora commented Jan 21, 2021

Cythonize bottlenecks in proglearn oblique tree splitter to speed up training times. Main bottleneck is finding the next best split.

@parthgvora
Copy link
Contributor Author

please assign me to this

@parthgvora parthgvora changed the title Improve Proglearn SPORF speed Improve Proglearn oblique tree speed Jan 21, 2021
@rflperry
Copy link
Member

I took a look at the sklearn cython tree code. I think a potential course of action is to basically copy over the sklearn.tree.DecisionTreeClassifier code and all the cython backend to the proglearn repo. Then modify the cython backend to add oblique splits and update the sklearn.tree.DecisionTreeClassifier class as need be.

This would allow for easy compatible integration with sklearn. It also uses their optimized code to the best extent.

Some thoughts about modifying their cython code, lots of stuff like sparse data and feature importance can be ignored to start. I think an MVP could simply sum together a sampled fixed number of features (equivalent to a dot product with weights of all 1). Verify this works, then we can iterate by adding weights/dot product instead of sum, controlling how many features to sample with a parameter in the constructor, etc.

@PSSF23
Copy link
Member

PSSF23 commented Jan 22, 2021

@rflperry I hold some objections for the copying idea, as copying the classifier code basically means copying over the entire sklearn.tree folder. I found it very hard to locate the codes we really need, and the easiest step for sklearn integration would be directly working on a fork.

That being said, @parthgvora if you need, I think I can provide some help about how sklearn currently structured its tree classifiers. And good luck on cython! It's pretty confusing in my opinion...

@rflperry
Copy link
Member

rflperry commented Jan 22, 2021

I personally am not adverse to copying over the entire folder. It seems in some ways easier than trying to write cython functions fully from scratch for parts of the forest. Of course that cython code has a lot of functionality we don't need to worry about modifying for now, but that can just be deleted and then added/modified later (i.e. sparse X case, feature importance, etc). But that's just my take, up to @parthgvora

@parthgvora
Copy link
Contributor Author

parthgvora commented Jan 22, 2021

To be honest, I kind of think Hao is right here. Having tried this, there's a ton of stuff in the sklearn.tree folder we'd need to sort out to do what we need to do. Also, dealing with the GIL is a real pain in cython. So I'm kind of adverse to copying the folder directly.

That said, we should keep their code in mind when writing our own cython to make integration easier. We can also take some inspiration from the way they wrote their code as well, and do smaller copy-pastes as necessary.

So, I think we should cythonize piece by piece, starting with the splitter. We're probably going to need our own splitter for oblique splits, and its the slowest part anyways. So I'm gonna try to write this out in cython best I can, using their code for guidance at times. Then, if we want, we can keep adding sklearn functionality with small copy-pastes.

Also, implementing the splitter first should speed up the code enough to let someone implement and test MORF in python while we keep working on cythonizing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants