Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizations for scikit-tree to improve multi-core performance #245

Open
10 tasks
sampan501 opened this issue Mar 12, 2024 · 7 comments
Open
10 tasks

Optimizations for scikit-tree to improve multi-core performance #245

sampan501 opened this issue Mar 12, 2024 · 7 comments
Labels
bug Something isn't working research Requires experimentation, theory and research.

Comments

@sampan501
Copy link
Member

sampan501 commented Mar 12, 2024

Checklist

  • I have verified that the issue exists against the main branch.
  • I have read the relevant section in the contribution guide on reporting bugs.
  • I have checked the issues list for similar or identical bug reports.
  • I have checked the pull requests list for existing proposed fixes.
  • I have checked the CHANGELOG and the commit log to find out if the bug was already fixed in the main branch.
  • I have included in the "Description" section below a traceback from any exceptions related to this bug.
  • I have included in the "Related issues or possible duplicates" section beloew all related issues and possible duplicate issues (If there are none, check this box anyway).
  • I have included in the "Environment" section below the name of the operating system and Python version that I was using when I discovered this bug.
  • I have included in the "Environment" section below the output of pip freeze.
  • I have included in the "Steps to reproduce" section below a minimally reproducible example.

Description

There is occasional low CPU usage when using scikit-tree forests in parallel. Running the same code, in machines with many cores, I'm getting roughly 4-5% usage with scikit-tree forests and 60-70% using scikit-learn for the same types of problems. We should look into their Cython code optimizations and see how we can make improvements to our code base.

@sampan501 sampan501 added bug Something isn't working research Requires experimentation, theory and research. labels Mar 12, 2024
@adam2392
Copy link
Collaborator

I think in terms of sequential experiments to run:

  1. RandomForestClassifier in scikit-learn vs RandomForestClassifier in scikit-tree in just n_samples vs time to fit with n_jobs =1 vs n_jobs = -1

If this doesn't look good, it means forsure our compiler is messed up somehow, or we introduce some serious issues in the fork that we're not aware of.

  1. Wrap HonestForestClassifier with DTC from sklearn vs DTC from scikit-tree. To determine if HonestForest introduces this issue somehow

Within each of the above, we would have to investigate CPU/RAM usage in-depth using valgrind, or something...

@sampan501
Copy link
Member Author

image (1)
image

@sampan501
Copy link
Member Author

image (1)

CoMIGHT before changes in #242

image

CoMIGHT after changes in #242

@adam2392
Copy link
Collaborator

To confirm this is not an isolated issue with comight right? Or so far it is?

@sampan501
Copy link
Member Author

it is not

@SUKI-O
Copy link
Member

SUKI-O commented Mar 13, 2024

We ran some tests and after the fix Adam pushed the diff between RF and sktree-RF are:

Fit time for RandomForestClassifier: 3.522181987762451
Fit time for RandomForestClassifier: 3.4983439445495605
Fit time for RandomForestClassifier: 3.518531084060669
Fit time for RandomForestClassifier: 3.5076229572296143
Fit time for RandomForestClassifier: 3.5162460803985596
Fit time for sktreeRandomForestClassifier: 3.697654962539673
Fit time for sktreeRandomForestClassifier: 3.660207986831665
Fit time for sktreeRandomForestClassifier: 3.6615519523620605
Fit time for sktreeRandomForestClassifier: 3.6803948879241943
Fit time for sktreeRandomForestClassifier: 3.653079032897949

Note: the result for sktree-RF was 7sec+ prior to this fix.

The script for this test is found : https://github.com/neurodata/might/blob/cmi/exps/new_submission/Figure6_comight_vs_nsamples_ndims/test_rf_parallel.py

The commit that we tested to get ~3sec on sktree-RF was: 7c75677

@sampan501
Copy link
Member Author

wooot!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working research Requires experimentation, theory and research.
Projects
None yet
Development

No branches or pull requests

3 participants