Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MAINT Handle Criterion.samples using a memoryview #25005

Merged
merged 13 commits into from
Dec 1, 2022

Conversation

adam2392
Copy link
Contributor

Reference Issues/PRs

Fixes: #25004

Putting up a PR to document what the end result of #24678, #24987 and this PR would look like.

What does this implement/fix? Explain your changes.

Converts SIZE_t* samples to const SIZE_t[:] samples.

Any other comments?

This is a downstream PR to #24678 and #24987 and should be reviewed/merged AFTER those are reviewed/merged.

I will rebase everything in sequence.

Cross-referencing: #17299 , #24875

@adam2392 adam2392 marked this pull request as ready for review November 23, 2022 02:29
@jjerphan jjerphan changed the title [MAINT] Refactor samples inside Criterion classes from pointer to memoryview MAINT Handle Criterion.samples using a memoryview Nov 23, 2022
Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @adam2392.

Here are a few comments.

sklearn/tree/_criterion.pxd Outdated Show resolved Hide resolved
sklearn/tree/_criterion.pyx Outdated Show resolved Hide resolved
sklearn/tree/_criterion.pyx Outdated Show resolved Hide resolved
sklearn/tree/_criterion.pyx Show resolved Hide resolved
sklearn/tree/_splitter.pyx Outdated Show resolved Hide resolved
@adam2392
Copy link
Contributor Author

adam2392 commented Nov 23, 2022

Hmm running the asv benchmark, one gets a regression in terms of the time to fit dense data... Granted... I am using my laptop, which could be running stuff in the background... But also would there be a reason this regression is actually true?

cc: @jjerphan

> asv continuous --verbose --split --bench RandomForest upstream/main samples
...

> asv compare main samples
       before           after         ratio
     [3eb00d83]       [d059a306]
     <main>           <samples> 
             228M             244M     1.07  ensemble.RandomForestClassifierBenchmark.peakmem_fit('dense', 1)
             576M             559M     0.97  ensemble.RandomForestClassifierBenchmark.peakmem_fit('sparse', 1)
             216M             216M     1.00  ensemble.RandomForestClassifierBenchmark.peakmem_predict('dense', 1)
             442M             442M     1.00  ensemble.RandomForestClassifierBenchmark.peakmem_predict('sparse', 1)
+         4.88±0s          5.37±0s     1.10  ensemble.RandomForestClassifierBenchmark.time_fit('dense', 1)
       7.16±0.01s          7.16±0s     1.00  ensemble.RandomForestClassifierBenchmark.time_fit('sparse', 1)
          133±5ms          138±6ms     1.03  ensemble.RandomForestClassifierBenchmark.time_predict('dense', 1)
         962±30ms         938±20ms     0.98  ensemble.RandomForestClassifierBenchmark.time_predict('sparse', 1)

The memoryview should be as efficient as its pointer counterpart in all the ways that we use it.

@jjerphan
Copy link
Member

jjerphan commented Nov 23, 2022

I would recommend running asv benchmark on a (dedicated) machine without any other work load (or with CPU affinity and isolation set).

I think variations aren't due to this PR changes which should be harmless.

@adam2392
Copy link
Contributor Author

@jshinm can you help run asv benchmarks here.

@jshinm
Copy link
Contributor

jshinm commented Nov 30, 2022

@adam2392 it seems like as @jjerphan mentioned there's a small harmless variation between tests which in my case sample branch was slightly faster. Nothing was run in the background as @adam2392 suggested to me

(sklearn-main) jshinm@jshinm-OMEN-by-HP-Laptop-16-b0xxx:~/Desktop/workstation/sklearn-jms/asv_benchmarks$ asv compare 167b2980 9c9c8582 

All benchmarks:

       before           after         ratio
     [167b2980]       [9c9c8582]
     <main>           <samples~20>
0.7230709112942986  0.7230709112942986     1.00  ensemble.HistGradientBoostingClassifierBenchmark.track_test_score
  0.9812160155622751  0.9812160155622751     1.00  ensemble.HistGradientBoostingClassifierBenchmark.track_train_score
             191M             186M     0.98  ensemble.RandomForestClassifierBenchmark.peakmem_fit('dense', 1)
             427M             422M     0.99  ensemble.RandomForestClassifierBenchmark.peakmem_fit('sparse', 1)
             195M             190M     0.98  ensemble.RandomForestClassifierBenchmark.peakmem_predict('dense', 1)
             411M             406M     0.99  ensemble.RandomForestClassifierBenchmark.peakmem_predict('sparse', 1)
-      4.47±0.01s       4.03±0.01s     0.90  ensemble.RandomForestClassifierBenchmark.time_fit('dense', 1)
       6.33±0.01s       6.38±0.02s     1.01  ensemble.RandomForestClassifierBenchmark.time_fit('sparse', 1)
          132±1ms        130±0.7ms     0.99  ensemble.RandomForestClassifierBenchmark.time_predict('dense', 1)
          847±1ms          871±2ms     1.03  ensemble.RandomForestClassifierBenchmark.time_predict('sparse', 1)
  0.7464271763500541  0.7464271763500541     1.00  ensemble.RandomForestClassifierBenchmark.track_test_score('dense', 1)
  0.8656423941766682  0.8656423941766682     1.00  ensemble.RandomForestClassifierBenchmark.track_test_score('sparse', 1)
  0.9968171694224932  0.9968171694224932     1.00  ensemble.RandomForestClassifierBenchmark.track_train_score('dense', 1)
  0.9996123288718864  0.9996123288718864     1.00  ensemble.RandomForestClassifierBenchmark.track_train_score('sparse', 1)

Test machine info

os [Linux 5.15.0-53-generic]: 
arch [x86_64]: 
cpu [11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz]:
num_cpu [16]:
ram [65483276]:

@adam2392
Copy link
Contributor Author

In that case, this is good to go on my end @jjerphan

@adam2392
Copy link
Contributor Author

Unless you would like me to rename every i, p, k index pointers where sample_indices are used?

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reporting the benchmarks' results, @jshinm.

This PR LGTM modulo the last point @adam2392 mentionned:

Unless you would like me to rename every i, p, k index pointers where sample_indices are used?

In fact, this is probably better to revert this change to use i, p and k` because those are used in other places in this file and performing this renaming would change the scope of this PR.

What do you think, @adam2392 ? Can you revert it back?
Sorry for having changed my mind.

sklearn/tree/_criterion.pyx Outdated Show resolved Hide resolved

cdef DOUBLE_t y_mean = 0.
cdef DOUBLE_t poisson_loss = 0.
cdef DOUBLE_t w = 1.0
cdef SIZE_t i, k, p
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI @jshinm you forgot to define this in your PR to address Julien's comments. In Cython, all variables must be typed, meaning the type is defined apriori to usage.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cython will always surprise me. I am not really sure how did it infer previously and create those variables for us.

@adam2392
Copy link
Contributor Author

adam2392 commented Dec 1, 2022

What do you think, @adam2392 ? Can you revert it back? Sorry for having changed my mind.

Done.

In fact, this is probably better to revert this change to use i, p and k` because those are used in other places in this file and performing this renaming would change the scope of this PR.

Now that it has been brought up, I think that would be a nice parallel PR tho to improve the readability of the Cython code. I get extreme cognitive load when reading code with letters as variables :p. Happy to do another PR, or encourage @jshinm to start a quick one there :).

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Thank you, @adam2392 and @jshinm!

I agree regarding cognitive load. I guess those are standard mathematical notation in algorithms that are handy when one is used to them, but they might be changed. Note that we somewhat also want to minimizes cosmetic changes because those come with other costs.

@glemaitre glemaitre self-requested a review December 1, 2022 13:27
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@glemaitre glemaitre merged commit 599e03c into scikit-learn:main Dec 1, 2022
@glemaitre
Copy link
Member

Thanks @jshinm @adam2392

jjerphan pushed a commit to jjerphan/scikit-learn that referenced this pull request Jan 20, 2023
Co-authored-by: Jong Shin <jshin.m@gmail.com>
jjerphan pushed a commit to jjerphan/scikit-learn that referenced this pull request Jan 20, 2023
Co-authored-by: Jong Shin <jshin.m@gmail.com>
@adam2392 adam2392 deleted the samples branch March 2, 2023 18:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MAINT Convert samples parameter in Criterion classes to memory view
4 participants