Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DRAFT] Engine plugin API and engine entry point for Lloyd's KMeans #24497

Closed
wants to merge 79 commits into from

Conversation

ogrisel
Copy link
Member

@ogrisel ogrisel commented Sep 22, 2022

This is a draft pull-request to allow third-party packages such as https://github.com/soda-inria/sklearn-numba-dpex to contribute alternative implementations of core computational routines of CPU-intensive scikit-learn estimators.

This would be particularly useful to experiment with GPU-optimized alternatives to our CPU-optimized Cython code.

This PR will serve as a design experiment to tackle #22438.

@ogrisel
Copy link
Member Author

ogrisel commented Sep 22, 2022

/cc @fcharras @jjerphan @betatim

@ogrisel
Copy link
Member Author

ogrisel commented Sep 22, 2022

Note: this is still a work in progress and the Engine API is not set in stones. In particular, I broken the MBKMeans tests. But this should be enough to get started.

max_iter=self.max_iter,
verbose=self.verbose,
tol=self._tol,
n_threads=self._n_threads,
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that currently we delegate the full Lloyd loop to the engine for the sake of simplicity for this first iteration.

However my plan would rather be to only delegate one Lloyd iteration at a time instead and move the loop and convergence check back into the estimator class (probably in a new private method).

Moving this loop back into the estimators class will be required to properly integrate @jeremiedbb's work on callbacks for instance: #22000.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While looking at the KMeans in cuml and sklearn I noticed that my test data set that took ~2.5s for sklearn took only 25ms in cuml. I assume with numba-dpex the speed up will be similar. So the main point is "something that takes long now might be super fast in a plugin".

On the one hand max_iter is typically(?) only O(100) so the overhead of making a few hundred additional function calls is not that great. On the other hand, what data would the callbacks need to make decisions (beyond just progressbar updates, which you could argue you don't need for fast stuff) and how much does it cost to transfer that from device to device?

My question is: do you/someone know what happens to the performance of alternative implementations of kmeans when they have to be implemented as "one iteration at a time"? I don't know enough about how kmeans on something like a GPU works, but I hope the answer is "nothing happens, this is totally doable" because that would make life easy.

Copy link
Contributor

@fcharras fcharras Sep 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general it will be implemented "one interation at a time" (exemple with the plugin we're working on: https://github.com/soda-inria/sklearn-numba-dpex/blob/main/sklearn_numba_dpex/kmeans/drivers.py#L342, or likewise with the current cython implementation in sklearn: https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/cluster/_kmeans.py#L599) and the loop on max_iter is in python, it is completely negligible.

So the "simplicity" @ogrisel refers to here is only about the design of the engine api itself.

Copy link
Member Author

@ogrisel ogrisel Sep 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect that that for most algorithms in scikit-learn, the actual computation time for one fit iteration would last 10x or more the overhead of a GPU dispatch (assuming we leave the data and temporary datastructures on device memory between consecutive iterations).

Copy link
Member

@betatim betatim Oct 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the plan regarding this change? I think it would make sense to already go for the "one call per iteration" (what you propose) now instead of waiting. If that works for you I'll send a PR.

A reason to tackle it now is that the centroids and labels will be arrays of different types for different engines, so the convergence checking needs to deal with this. (unless we decide the estimator attributes should always be numpy arrays?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will also help sharpen the API for when data is passed in to an engine. My proposal would to settle on:

  • prepare_fit(X, y, sample_weight) will be called every time when KMeans.fit() is called. The engine should keep hold of the data as it won't be provided again
  • the single iteration method becomes kmeans_single(current_centroids). X and friends isn't passed again.

In particular when we call kmeans_single once per iteration we need a way to provide the data to the engine once, so that it can convert it or decide that it would like to raise NotImplemented, etc. If X and friends are passed to kmeans_single on each iteration, the engine should in principle be prepared to convert it at each iteration because there is no promise that it hasn't change since the last time kmeans_single was called.

Something that would be harder to do in that case is re-use things from the Cython engine. For example currently I use:

def init_centroids(self, X):
    return cp.asarray(super().init_centroids(X))

because there is not much benefit in re-implementing the initial centroid selection. In a super strict world where the above proposal is adopted, then init_centroids should also not be passed X again, but it should use what was previously provided in prepare_fit().

The point being, I'd support switching to the per-iteration API so we can work these things out with code (instead of our heads).

Copy link
Contributor

@fcharras fcharras Oct 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we agree on the implementation strategy here so you're welcome to send a PR 👍 and let's iterate on the details from there if needed, I think @ogrisel @jjerphan and I can be available for review.

Regarding:

because there is not much benefit in re-implementing the initial centroid selection.

Actually I can think of two reasons:

  • initial centroid selection includes k-means++ which might benefit from optimizations

  • it's not likely that super().init_centroids(X) will work if X is not a numpy or scipy array, and it could be a good thing for the fit to accept other type of inputs when used with other engines, e.g cupy arrays for a a cuda-based engine, to save data loading to and from the targeted device. So, super().init_centroids(X) might need to be rewritten for cases where X is a different object.

@ogrisel ogrisel changed the title [DRAFT] KMeans plugin API [DRAFT] Engine plugin API and engine entry point for Lloyd's KMeans Sep 23, 2022
@fcharras
Copy link
Contributor

fcharras commented Dec 13, 2022

I adapted the plugin in soda-inria/sklearn-numba-dpex#74

@fcharras
Copy link
Contributor

Last round of merges broke the tests 🔴 apparently some issue with the engine getter raising RuntimeErrors

@ogrisel
Copy link
Member Author

ogrisel commented Feb 1, 2023

@fcharras suggested that the last merge broke the CI without us realizing it because ogrisel#13 targeted the ogrisel/scikit-learn repo instead of scikit-learn/scikit-learn repo and as a results the regular CI config does not apply.

I think I should push the wip-engines branch on ogrisel/scikit-learn to a new long-running feature branch (e.g. named feature-engine-api) on the main scikit-learn/scikit-learn repo. This way, sub-PRs to this new feature branch would benefit from the usual CI setup without the extra maintenance of reconfiguring CI services to work for the ogrisel/scikit-learn repo.

Any opinion or better name name suggestion @scikit-learn/core-devs?

/cc @betatim

@ogrisel
Copy link
Member Author

ogrisel commented Feb 1, 2023

When doing so I will probably squash the existing history as most of the original commits have non-informative wip commit messages. Let me know if you are opposed to this plan. Shared co-authorship of the squashed commit should be preserved.

@jjerphan
Copy link
Member

jjerphan commented Feb 1, 2023

This looks appropriate to me. 👍

Nit: do you think feature/engine-api is slightly clearer?

@betatim
Copy link
Member

betatim commented Feb 3, 2023

Sounds like a good idea and it will be nice to have the benefit of robots checking our work :)

No opinion on the branch name.

@ogrisel
Copy link
Member Author

ogrisel commented Feb 3, 2023

Closing in favor of the newly created #25535. Let's stop using the ogrisel/scikit-learn:wip-engines branch from now on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants