[DRAFT] Engine plugin API and engine entry point for Lloyd's KMeans #24497

ogrisel · 2022-09-22T16:21:54Z

This is a draft pull-request to allow third-party packages such as https://github.com/soda-inria/sklearn-numba-dpex to contribute alternative implementations of core computational routines of CPU-intensive scikit-learn estimators.

This would be particularly useful to experiment with GPU-optimized alternatives to our CPU-optimized Cython code.

This PR will serve as a design experiment to tackle #22438.

ogrisel · 2022-09-22T16:22:25Z

/cc @fcharras @jjerphan @betatim

ogrisel · 2022-09-22T16:23:52Z

Note: this is still a work in progress and the Engine API is not set in stones. In particular, I broken the MBKMeans tests. But this should be enough to get started.

ogrisel · 2022-09-22T16:29:14Z

sklearn/cluster/_kmeans.py

-                max_iter=self.max_iter,
-                verbose=self.verbose,
-                tol=self._tol,
-                n_threads=self._n_threads,
            )


Note that currently we delegate the full Lloyd loop to the engine for the sake of simplicity for this first iteration.

However my plan would rather be to only delegate one Lloyd iteration at a time instead and move the loop and convergence check back into the estimator class (probably in a new private method).

Moving this loop back into the estimators class will be required to properly integrate @jeremiedbb's work on callbacks for instance: #22000.

While looking at the KMeans in cuml and sklearn I noticed that my test data set that took ~2.5s for sklearn took only 25ms in cuml. I assume with numba-dpex the speed up will be similar. So the main point is "something that takes long now might be super fast in a plugin".

On the one hand max_iter is typically(?) only O(100) so the overhead of making a few hundred additional function calls is not that great. On the other hand, what data would the callbacks need to make decisions (beyond just progressbar updates, which you could argue you don't need for fast stuff) and how much does it cost to transfer that from device to device?

My question is: do you/someone know what happens to the performance of alternative implementations of kmeans when they have to be implemented as "one iteration at a time"? I don't know enough about how kmeans on something like a GPU works, but I hope the answer is "nothing happens, this is totally doable" because that would make life easy.

In general it will be implemented "one interation at a time" (exemple with the plugin we're working on: https://github.com/soda-inria/sklearn-numba-dpex/blob/main/sklearn_numba_dpex/kmeans/drivers.py#L342, or likewise with the current cython implementation in sklearn: https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/cluster/_kmeans.py#L599) and the loop on max_iter is in python, it is completely negligible.

So the "simplicity" @ogrisel refers to here is only about the design of the engine api itself.

I expect that that for most algorithms in scikit-learn, the actual computation time for one fit iteration would last 10x or more the overhead of a GPU dispatch (assuming we leave the data and temporary datastructures on device memory between consecutive iterations).

What is the plan regarding this change? I think it would make sense to already go for the "one call per iteration" (what you propose) now instead of waiting. If that works for you I'll send a PR.

A reason to tackle it now is that the centroids and labels will be arrays of different types for different engines, so the convergence checking needs to deal with this. (unless we decide the estimator attributes should always be numpy arrays?)

It will also help sharpen the API for when data is passed in to an engine. My proposal would to settle on:

prepare_fit(X, y, sample_weight) will be called every time when KMeans.fit() is called. The engine should keep hold of the data as it won't be provided again

the single iteration method becomes kmeans_single(current_centroids). X and friends isn't passed again.

In particular when we call kmeans_single once per iteration we need a way to provide the data to the engine once, so that it can convert it or decide that it would like to raise NotImplemented, etc. If X and friends are passed to kmeans_single on each iteration, the engine should in principle be prepared to convert it at each iteration because there is no promise that it hasn't change since the last time kmeans_single was called.

Something that would be harder to do in that case is re-use things from the Cython engine. For example currently I use:

def init_centroids(self, X): return cp.asarray(super().init_centroids(X))

because there is not much benefit in re-implementing the initial centroid selection. In a super strict world where the above proposal is adopted, then init_centroids should also not be passed X again, but it should use what was previously provided in prepare_fit().

The point being, I'd support switching to the per-iteration API so we can work these things out with code (instead of our heads).

I think we agree on the implementation strategy here so you're welcome to send a PR 👍 and let's iterate on the details from there if needed, I think @ogrisel @jjerphan and I can be available for review.

Regarding:

because there is not much benefit in re-implementing the initial centroid selection.

Actually I can think of two reasons:

initial centroid selection includes k-means++ which might benefit from optimizations

it's not likely that super().init_centroids(X) will work if X is not a numpy or scipy array, and it could be a good thing for the fit to accept other type of inputs when used with other engines, e.g cupy arrays for a a cuda-based engine, to save data loading to and from the targeted device. So, super().init_centroids(X) might need to be rewritten for cases where X is a different object.

fcharras · 2022-12-13T20:57:49Z

I adapted the plugin in soda-inria/sklearn-numba-dpex#74

sklearn/cluster/_kmeans.py

Co-authored-by: Franck Charras <29153872+fcharras@users.noreply.github.com>

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

Rename cluster counting method

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

fcharras · 2023-01-27T10:14:05Z

Last round of merges broke the tests 🔴 apparently some issue with the engine getter raising RuntimeErrors

ogrisel · 2023-02-01T15:56:29Z

@fcharras suggested that the last merge broke the CI without us realizing it because ogrisel#13 targeted the ogrisel/scikit-learn repo instead of scikit-learn/scikit-learn repo and as a results the regular CI config does not apply.

I think I should push the wip-engines branch on ogrisel/scikit-learn to a new long-running feature branch (e.g. named feature-engine-api) on the main scikit-learn/scikit-learn repo. This way, sub-PRs to this new feature branch would benefit from the usual CI setup without the extra maintenance of reconfiguring CI services to work for the ogrisel/scikit-learn repo.

Any opinion or better name name suggestion @scikit-learn/core-devs?

/cc @betatim

ogrisel · 2023-02-01T15:58:02Z

When doing so I will probably squash the existing history as most of the original commits have non-informative wip commit messages. Let me know if you are opposed to this plan. Shared co-authorship of the squashed commit should be preserved.

jjerphan · 2023-02-01T16:10:03Z

This looks appropriate to me. 👍

Nit: do you think feature/engine-api is slightly clearer?

betatim · 2023-02-03T14:07:32Z

Sounds like a good idea and it will be nice to have the benefit of robots checking our work :)

No opinion on the branch name.

ogrisel · 2023-02-03T14:20:16Z

Closing in favor of the newly created #25535. Let's stop using the ogrisel/scikit-learn:wip-engines branch from now on.

ogrisel added 18 commits March 29, 2022 17:54

wip engines

e209cb9

wip

1647eed

fixes

87654c1

wip

b05d4ab

more specific assertion

828f797

Add docstring to the config context

5819524

add default kwarg

5df598c

Various fixes

44cbd6c

empty doc

a693d2e

Merge branch 'main' into wip-engines

e57eae7

WIP

1dbae5b

Merge branch 'main' into wip-engines

ec6baa0

wip

dd586f1

Move tolerance computation to the engine

e3c1056

wip

bd280ef

Merge branch 'main' into wip-engines

4913f9b

wip

3352c58

wip

4753143

github-actions bot added the module:cluster label Sep 22, 2022

ogrisel commented Sep 22, 2022

View reviewed changes

ogrisel mentioned this pull request Sep 22, 2022

Path for pluggable low-level computational routines #22438

Open

fcharras and others added 4 commits September 23, 2022 09:10

linting

2794d26

fix MBKMeans and linting

cc36c6e

Merge branch 'main' into wip-engines

80d4ba4

Draft changelog entry

f92d63f

ogrisel changed the title ~~[DRAFT] KMeans plugin API~~ [DRAFT] Engine plugin API and engine entry point for Lloyd's KMeans Sep 23, 2022

ogrisel added 2 commits September 23, 2022 20:27

doc reorg

dbf607b

fix changelog entry to add the pr number

abb278a

fcharras reviewed Dec 13, 2022

View reviewed changes

sklearn/cluster/_kmeans.py Outdated Show resolved Hide resolved

betatim and others added 20 commits December 14, 2022 09:28

Pass sample_weight parameter

ff191e2

Co-authored-by: Franck Charras <29153872+fcharras@users.noreply.github.com>

Add engine aware mixin to factor out engine stuff

d74d652

Add attribute conversion decorator

dfb3fe0

Tweak attribute conversion

e531a6d

Use Array API to get unique cluster count

737c74f

Rename cluster counting method

a1ff1ec

Update conversion related bits

519ae67

Fix engine provider at fit time

1a19de0

Combine engine selection and validation

96c5f9b

Rename argument

0c29cc2

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

Merge pull request #14 from betatim/engines-cluster-count

c7be19a

Rename cluster counting method

Merge branch 'main' into wip-engines

56bf7b1

Rename loop variable

48803d9

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

Switch back to using accepts for engine selection

6d1b390

Allow "runtime" engines to be passed as well as provider names

157a9c6

Update tests

1441088

Update docstring for accepts

68a64a8

Update comment for engine class config

d9020e7

Add engine_name attribute to ad-hoc engine classes

183068d

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

Merge pull request #13 from betatim/engine-mixin

39a39ad

ogrisel closed this Feb 3, 2023

ogrisel mentioned this pull request Feb 3, 2023

[DRAFT] Engine plugin API and engine entry point for Lloyd's KMeans #25535

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DRAFT] Engine plugin API and engine entry point for Lloyd's KMeans #24497

[DRAFT] Engine plugin API and engine entry point for Lloyd's KMeans #24497

ogrisel commented Sep 22, 2022

ogrisel commented Sep 22, 2022

ogrisel commented Sep 22, 2022

ogrisel Sep 22, 2022

betatim Sep 27, 2022

fcharras Sep 28, 2022 •

edited

ogrisel Sep 28, 2022 •

edited

betatim Oct 20, 2022 •

edited

betatim Oct 20, 2022

fcharras Oct 20, 2022 •

edited

fcharras commented Dec 13, 2022 •

edited

fcharras commented Jan 27, 2023

ogrisel commented Feb 1, 2023 •

edited

ogrisel commented Feb 1, 2023

jjerphan commented Feb 1, 2023 •

edited

betatim commented Feb 3, 2023

ogrisel commented Feb 3, 2023 •

edited

[DRAFT] Engine plugin API and engine entry point for Lloyd's KMeans #24497

[DRAFT] Engine plugin API and engine entry point for Lloyd's KMeans #24497

Conversation

ogrisel commented Sep 22, 2022

ogrisel commented Sep 22, 2022

ogrisel commented Sep 22, 2022

ogrisel Sep 22, 2022

Choose a reason for hiding this comment

betatim Sep 27, 2022

Choose a reason for hiding this comment

fcharras Sep 28, 2022 • edited

Choose a reason for hiding this comment

ogrisel Sep 28, 2022 • edited

Choose a reason for hiding this comment

betatim Oct 20, 2022 • edited

Choose a reason for hiding this comment

betatim Oct 20, 2022

Choose a reason for hiding this comment

fcharras Oct 20, 2022 • edited

Choose a reason for hiding this comment

fcharras commented Dec 13, 2022 • edited

fcharras commented Jan 27, 2023

ogrisel commented Feb 1, 2023 • edited

ogrisel commented Feb 1, 2023

jjerphan commented Feb 1, 2023 • edited

betatim commented Feb 3, 2023

ogrisel commented Feb 3, 2023 • edited

fcharras Sep 28, 2022 •

edited

ogrisel Sep 28, 2022 •

edited

betatim Oct 20, 2022 •

edited

fcharras Oct 20, 2022 •

edited

fcharras commented Dec 13, 2022 •

edited

ogrisel commented Feb 1, 2023 •

edited

jjerphan commented Feb 1, 2023 •

edited

ogrisel commented Feb 3, 2023 •

edited