[DRAFT] Engine plugin API and engine entry point for Lloyd's KMeans #25535

ogrisel · 2023-02-03T14:18:47Z

This is a draft pull-request to allow third-party packages such as https://github.com/soda-inria/sklearn-numba-dpex to contribute alternative implementations of core computational routines of CPU-intensive scikit-learn estimators.

This would be particularly useful to experiment with GPU-optimized alternatives to our CPU-optimized Cython code.

This Draft PR serves as a long running design experiment (feature branch) to tackle #22438.

This is a takeover of the old PR (#24497) from the ogrisel/scikit-learn:wip-engines branch so as to make it easier to get the CI work when doing sub PRs targeting scikit-learn/scikit-learn:feature/engine-api.

ogrisel · 2023-02-03T14:24:32Z

I forgot to squash, let me do that now. Actually, with the previous merges, it sounds complicated to do it without making a mistake. Let's keep that for later. I will just update the current main instead.

ogrisel · 2023-02-03T15:33:49Z

I fixed the linters related failures on the CI but there remains problems when running the common tests for KMeans:

pytest -v -k "test_estimators[KMeans"
[...]
15 failed, 27 passed, 2 skipped, 27664 deselected, 2 xfailed, 78 warnings in 6.31s

ogrisel · 2023-02-07T12:39:10Z

@fcharras @betatim the tests are now green. The last remaining point is codecov complaining we did not test enough.

fcharras · 2023-02-07T14:15:18Z

TY for the follow-up and the new branch on scikit-learn. Will bump to this commit at sklearn_numba_dpex and report here.

fcharras · 2023-02-08T16:08:29Z

I have a few questions regarding the intention regarding compatibility with the framework that enable a plugin to easily re-use scikit-learn unittest.

Currently we use this where the option --sklearn-engine-provider sklearn_numba_dpex is exposed in this branch here.

To ensure compatibility with scikit learn unit tests, the following requirements must be verified:

the estimators must expose attributes with the sklearn native type (so having engine_attributes set to "sklearn_types") in the configuration
the compute must call the engine exposed by the plugin that has been set
the outputs of prediction methods must also have the sklearn native types

I think the first two requirements can be forced by setting engine_attributes to sklearn_types and engine_provider to the appropriate engine. Can one reasonably expect the engine to work if attributes are stored with sklearn types ? With the current state, this can work if the engine implement the detection and conversion of such case (which I think is reasonable to expect). Or should the conversion from sklearn types to native types be managed by sklearn directly ? (e.g. with an additional convert_from_sklearn_types method).

Now, about the output type: in the case config_context(engine_attributes="sklearn_types", engine_provider="my_plugin"), are predictions expected to work ? then what output type is expected ? For the third requirement to hold, the output needs to have sklearn native types.

If this behavior is OK we can directly add engine_attribtes="sklearn_types" here and maybe add a bit of documentation. WDYT ?

Also, for a given plugin we want to skip the tests for parameters and input types that the engine doesn't accept. We had designed an exception for it (this one). What do you think about it ? An alternative could be that EngineAwareMixin._get_engine is aware when it's ran within a pytest run and then raise this exception, so that engines do not need to manage that at their level.

fcharras · 2023-02-14T10:10:17Z

Latest discussions suggested to remove the NotSupportedByEngineError exception, and just run all unit tests in the engine config context and rely on the fallback mechanism, see the relevant PR here #25605

But some tests can't pass with unless the config context also applies engine_attributes="sklearn_types", however we don't want to apply this parameter everywhere since it would reduce the coverage for the plugin being tested (because it will activate the default engine for predict methods everywhere).

See soda-inria/sklearn-numba-dpex#91 for an example of what goes wrong currently.

betatim · 2023-02-14T13:16:44Z

Real life is tricky :-/ I'm not sure what the "right thing" is here. The test that fails checks that you can fit with sparse data and then predict on dense, and vice-versa. For "native" scikit-learn that seems like a reasonable thing to check.

For a plugin situation where we allow an engine to reject sparse input, which seems like a reasonable thing to allow, then it isn't surprising that the test fails. In some sense it is a feature that the test fails, not a bug. From that point of view, the test is "not relevant", so it shouldn't be run for the engine.

Looking at the error message the user sees "TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array." makes me wonder if we could provide a better one by re-running the acceptance selection again. Right now the error isn't bad, but it comes from a place quite far away from the source of the problem. If we re-run the acceptance selection the engine can raise a more specific error as it knows that sparse input is not supported for this engine.

fcharras · 2023-02-14T16:22:16Z

sklearn/_engine/base.py

+            # when they set the config (ad-hoc class or string naming
+            # a provider).
+            engine_class = provider_name
+            if getattr(engine_class, "engine_name", None) != engine_name:


WDYT of raising an error in this case ? or make the "engine_name" optional ? (just realized my configuration wasn't working properly because the class I was passing to the config context missed the "engine_name" attribute)

I think all classes that want to be an engine should have a engine_name attribute. Otherwise it is unclear what engine the class implements. Imagine you wrap a pipeline that has several steps that each have a plugin.

So along those lines, maybe we should use if engine_class.engine_name != engine_name: here, which would lead to an exception if the class doesn't define a engine_name.

+1 for letting it fail if the attribute is not defined

The default cython engine for kmeans doesn't define this attribute. Even if it's useless here, it would be helpful to define it for the same reasons accepts is ?

The default cython engine for kmeans doesn't define this attribute.

That is true. I forgot that for the default engine and all engines that aren't "ad-hoc" engines (is this a good name for classes directly passed in, compared to those registered via an entrypoint?) the look up has an additional level of indirection via a engine spec. The engine spec has a name attribute that contains the name of the engine. Just name seemed to generic as a name of the attribute for the "ad-hoc" classes, but maybe we should unify it?

I'm unsure about defining it also for the non ad-hoc engine classes. It adds an additional location where this information is stored (attribute of the class and name of the entrypoint) and they could become out of sync. However, I also like having symmetry/no differences between the entrypoint and ad-hoc engines :-/

Maybe the engine name could only be defined on the class itself and we could update the engine specs in the importlib metadata to only register the class?

But that means that we will need to import the class to discover its name, which is a bit sad, because it could trigger import overhead from plugins unrelated to the specific algorithm of interest.

But maybe this is not a big deal.

We could add a engine_name attribute to the engine class when it is loaded via EngineSpec.get_engine_class?

fcharras · 2023-02-22T11:12:05Z

For a plugin situation where we allow an engine to reject sparse input, which seems like a reasonable thing to allow, then it isn't surprising that the test fails. In some sense it is a feature that the test fails, not a bug

Then I'd think that we either need:

to add as many new keywords in the config context as necessary so we can define a context that make all tests pass with the engine option in pytest (like enabling a fallback to default engine in predict in case input type is not supported, that would solve our current issues)
or we need to keep the FeatureNotSupportedByEngine exception for use in the unit tests, so the engine can trigger skip or xfail for those tests by raising it
or the engine provide a list of tests that are expected to fail with the full string that identifies it in pytest, and expose it in in a way that it can be fetched by sklearn conftest,

I think the third point is what we had discussed initially, I'm a bit concerned that it could be more difficult to maintain, maintainers should be notified whenever a test that was registered as expected to fail is edited upstream, so we would need pytest to provide a hash of each test, that sounds complicated. I'd be more encline to explore one of the first two solutions first.

ogrisel · 2023-02-23T13:50:31Z

At the last call, I seem to recall that we discussed the possibility to use the engine native types in the tests and check if the scikit-learn tests (assert_allclose and friends) would accept them naturally, assuming that all plugins would use numpy-like containers that follow the Array API and/or implement the __array__ method to allow numpy to convert those attributes to host-allocated numpy arrays as needed.

It might be worth experimenting how many tests would pass, how many would need to be adapted to be less strict about type checks and instead rely more on duck-typing.

fcharras · 2023-02-24T13:57:40Z

I look at it some more:

assert_allclose currently fails because of isfinite call when using sklearn_numba_dpex
using the example of dpctl.tensor, having sklearn unit tests compliant with the array API would actually not be enough, since some of the features are WIP.

github-actions · 2023-08-04T08:44:07Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 7d52073. Link to the linter CI: here}

Co-authored-by: Tim Head <betatim@gmail.com> Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Franck Charras <29153872+fcharras@users.noreply.github.com>

fcharras · 2023-09-01T14:47:37Z

@betatim @ogrisel, the feature/engine-api branch is in a bad state, conflicts are all over the place on random files, not sure to understand why (did I use a starting commit that was desync locally last time I did the pull --no-rebase PR ?). We could squash all commits and push force to get back to a cleaner state, the resulting branch is here: https://github.com/fcharras/scikit-learn/tree/feature/engine-api, the single squashed commit is 2ccfc8c, the diff with current scikit-learn latest commit on main can be seen here: main...fcharras:scikit-learn:feature/engine-api . WDYT push-forcing that on scikit-learn:feature/engine-api ? (I haven't altered the initial code from here, it should be conform to the latest state we were settled on except it's rebased).

To ensure that nothing was lost in the process, and assuming I might have done a mistake with this unfortunate 40K lines PR last time, I re-did the merge of current main on the latest commit before the last merge, which was a897a34 , after double-checking that everything was well synced in my local dev environment. Then I squashed everything (using git diff / git apply patch) in a single commit.

I've tested the commit 2ccfc8c using this new plugin that is about to be merged.

Also, I'm about to start implementing an API for the KNN (+ corresponding pytorch implementation in this same plugin)

ogrisel · 2023-09-13T12:09:54Z

Let me force push the new commit.

ogrisel · 2023-09-13T12:17:02Z

@fcharras I force-pushed the result of your squash (2ccfc8c) under scikit-learn:feature/engine-api (this branch).

github-actions bot added the module:cluster label Feb 3, 2023

ogrisel mentioned this pull request Feb 3, 2023

Path for pluggable low-level computational routines #22438

Open

ogrisel mentioned this pull request Feb 3, 2023

[DRAFT] Engine plugin API and engine entry point for Lloyd's KMeans #24497

Closed

betatim mentioned this pull request Feb 14, 2023

Setup a GitHub Action/PR check that runs KMeans tests from scikit-learn with this plugin activated rapidsai/scikit-learn-nv#7

Open

fcharras reviewed Feb 14, 2023

View reviewed changes

betatim mentioned this pull request Mar 29, 2023

Introduce SIMD intrinsics for _dist_metrics.pyx #26010

Open

saulshanabrook mentioned this pull request Apr 2, 2023

Use in Scikit Learn metadsl/.github#2

Open

Micky774 mentioned this pull request May 5, 2023

API Allow users to pass instances of DistanceMetric directly to metric keyword arguments #26329

Open

5 tasks

ogrisel mentioned this pull request Jun 15, 2023

Array API support for k-means #26585

Open

adrinjalali mentioned this pull request Jul 7, 2023

C++ back-end library for scikit-learn ? #26799

Closed

fcharras mentioned this pull request Jul 31, 2023

MNT Pull main on feature/engine-api dev branch #26949

Merged

Engine plugin API and engine entry point for Lloyd's KMeans

2ccfc8c

Co-authored-by: Tim Head <betatim@gmail.com> Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Franck Charras <29153872+fcharras@users.noreply.github.com>

ogrisel force-pushed the feature/engine-api branch from 1915869 to 2ccfc8c Compare September 13, 2023 12:15

betatim added 2 commits November 1, 2023 11:00

Merge branch 'main' into feature/engine-api

15b3bad

Missing changes

7d52073

MridulS mentioned this pull request Mar 5, 2024

Handle SPEC 2 scientific-python/specs#229

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DRAFT] Engine plugin API and engine entry point for Lloyd's KMeans #25535

[DRAFT] Engine plugin API and engine entry point for Lloyd's KMeans #25535

ogrisel commented Feb 3, 2023 •

edited

ogrisel commented Feb 3, 2023 •

edited

ogrisel commented Feb 3, 2023

ogrisel commented Feb 7, 2023

fcharras commented Feb 7, 2023

fcharras commented Feb 8, 2023 •

edited

fcharras commented Feb 14, 2023

betatim commented Feb 14, 2023

fcharras Feb 14, 2023

betatim Feb 15, 2023

fcharras Feb 15, 2023 •

edited

betatim Feb 15, 2023 •

edited

ogrisel Feb 23, 2023

ogrisel Feb 23, 2023

betatim Feb 23, 2023

fcharras commented Feb 22, 2023 •

edited

ogrisel commented Feb 23, 2023

fcharras commented Feb 24, 2023

github-actions bot commented Aug 4, 2023 •

edited

fcharras commented Sep 1, 2023 •

edited

ogrisel commented Sep 13, 2023

ogrisel commented Sep 13, 2023

[DRAFT] Engine plugin API and engine entry point for Lloyd's KMeans #25535

Are you sure you want to change the base?

[DRAFT] Engine plugin API and engine entry point for Lloyd's KMeans #25535

Conversation

ogrisel commented Feb 3, 2023 • edited

ogrisel commented Feb 3, 2023 • edited

ogrisel commented Feb 3, 2023

ogrisel commented Feb 7, 2023

fcharras commented Feb 7, 2023

fcharras commented Feb 8, 2023 • edited

fcharras commented Feb 14, 2023

betatim commented Feb 14, 2023

fcharras Feb 14, 2023

Choose a reason for hiding this comment

betatim Feb 15, 2023

Choose a reason for hiding this comment

fcharras Feb 15, 2023 • edited

Choose a reason for hiding this comment

betatim Feb 15, 2023 • edited

Choose a reason for hiding this comment

ogrisel Feb 23, 2023

Choose a reason for hiding this comment

ogrisel Feb 23, 2023

Choose a reason for hiding this comment

betatim Feb 23, 2023

Choose a reason for hiding this comment

fcharras commented Feb 22, 2023 • edited

ogrisel commented Feb 23, 2023

fcharras commented Feb 24, 2023

github-actions bot commented Aug 4, 2023 • edited

✔️ Linting Passed

fcharras commented Sep 1, 2023 • edited

ogrisel commented Sep 13, 2023

ogrisel commented Sep 13, 2023

ogrisel commented Feb 3, 2023 •

edited

ogrisel commented Feb 3, 2023 •

edited

fcharras commented Feb 8, 2023 •

edited

fcharras Feb 15, 2023 •

edited

betatim Feb 15, 2023 •

edited

fcharras commented Feb 22, 2023 •

edited

github-actions bot commented Aug 4, 2023 •

edited

fcharras commented Sep 1, 2023 •

edited