Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DRAFT] Engine plugin API and engine entry point for Lloyd's KMeans #25535

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

ogrisel
Copy link
Member

@ogrisel ogrisel commented Feb 3, 2023

This is a draft pull-request to allow third-party packages such as https://github.com/soda-inria/sklearn-numba-dpex to contribute alternative implementations of core computational routines of CPU-intensive scikit-learn estimators.

This would be particularly useful to experiment with GPU-optimized alternatives to our CPU-optimized Cython code.

This Draft PR serves as a long running design experiment (feature branch) to tackle #22438.

This is a takeover of the old PR (#24497) from the ogrisel/scikit-learn:wip-engines branch so as to make it easier to get the CI work when doing sub PRs targeting scikit-learn/scikit-learn:feature/engine-api.

@ogrisel
Copy link
Member Author

ogrisel commented Feb 3, 2023

I forgot to squash, let me do that now. Actually, with the previous merges, it sounds complicated to do it without making a mistake. Let's keep that for later. I will just update the current main instead.

@ogrisel
Copy link
Member Author

ogrisel commented Feb 3, 2023

I fixed the linters related failures on the CI but there remains problems when running the common tests for KMeans:

pytest -v -k "test_estimators[KMeans"
[...]
15 failed, 27 passed, 2 skipped, 27664 deselected, 2 xfailed, 78 warnings in 6.31s

@ogrisel
Copy link
Member Author

ogrisel commented Feb 7, 2023

@fcharras @betatim the tests are now green. The last remaining point is codecov complaining we did not test enough.

@fcharras
Copy link
Contributor

fcharras commented Feb 7, 2023

TY for the follow-up and the new branch on scikit-learn. Will bump to this commit at sklearn_numba_dpex and report here.

@fcharras
Copy link
Contributor

fcharras commented Feb 8, 2023

I have a few questions regarding the intention regarding compatibility with the framework that enable a plugin to easily re-use scikit-learn unittest.

Currently we use this where the option --sklearn-engine-provider sklearn_numba_dpex is exposed in this branch here.

To ensure compatibility with scikit learn unit tests, the following requirements must be verified:

  • the estimators must expose attributes with the sklearn native type (so having engine_attributes set to "sklearn_types") in the configuration
  • the compute must call the engine exposed by the plugin that has been set
  • the outputs of prediction methods must also have the sklearn native types

I think the first two requirements can be forced by setting engine_attributes to sklearn_types and engine_provider to the appropriate engine. Can one reasonably expect the engine to work if attributes are stored with sklearn types ? With the current state, this can work if the engine implement the detection and conversion of such case (which I think is reasonable to expect). Or should the conversion from sklearn types to native types be managed by sklearn directly ? (e.g. with an additional convert_from_sklearn_types method).

Now, about the output type: in the case config_context(engine_attributes="sklearn_types", engine_provider="my_plugin"), are predictions expected to work ? then what output type is expected ? For the third requirement to hold, the output needs to have sklearn native types.

If this behavior is OK we can directly add engine_attribtes="sklearn_types" here and maybe add a bit of documentation. WDYT ?

Also, for a given plugin we want to skip the tests for parameters and input types that the engine doesn't accept. We had designed an exception for it (this one). What do you think about it ? An alternative could be that EngineAwareMixin._get_engine is aware when it's ran within a pytest run and then raise this exception, so that engines do not need to manage that at their level.

@fcharras
Copy link
Contributor

Latest discussions suggested to remove the NotSupportedByEngineError exception, and just run all unit tests in the engine config context and rely on the fallback mechanism, see the relevant PR here #25605

But some tests can't pass with unless the config context also applies engine_attributes="sklearn_types", however we don't want to apply this parameter everywhere since it would reduce the coverage for the plugin being tested (because it will activate the default engine for predict methods everywhere).

See soda-inria/sklearn-numba-dpex#91 for an example of what goes wrong currently.

@betatim
Copy link
Member

betatim commented Feb 14, 2023

Real life is tricky :-/ I'm not sure what the "right thing" is here. The test that fails checks that you can fit with sparse data and then predict on dense, and vice-versa. For "native" scikit-learn that seems like a reasonable thing to check.

For a plugin situation where we allow an engine to reject sparse input, which seems like a reasonable thing to allow, then it isn't surprising that the test fails. In some sense it is a feature that the test fails, not a bug. From that point of view, the test is "not relevant", so it shouldn't be run for the engine.


Looking at the error message the user sees "TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array." makes me wonder if we could provide a better one by re-running the acceptance selection again. Right now the error isn't bad, but it comes from a place quite far away from the source of the problem. If we re-run the acceptance selection the engine can raise a more specific error as it knows that sparse input is not supported for this engine.

# when they set the config (ad-hoc class or string naming
# a provider).
engine_class = provider_name
if getattr(engine_class, "engine_name", None) != engine_name:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT of raising an error in this case ? or make the "engine_name" optional ? (just realized my configuration wasn't working properly because the class I was passing to the config context missed the "engine_name" attribute)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all classes that want to be an engine should have a engine_name attribute. Otherwise it is unclear what engine the class implements. Imagine you wrap a pipeline that has several steps that each have a plugin.

So along those lines, maybe we should use if engine_class.engine_name != engine_name: here, which would lead to an exception if the class doesn't define a engine_name.

Copy link
Contributor

@fcharras fcharras Feb 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for letting it fail if the attribute is not defined

The default cython engine for kmeans doesn't define this attribute. Even if it's useless here, it would be helpful to define it for the same reasons accepts is ?

Copy link
Member

@betatim betatim Feb 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default cython engine for kmeans doesn't define this attribute.

That is true. I forgot that for the default engine and all engines that aren't "ad-hoc" engines (is this a good name for classes directly passed in, compared to those registered via an entrypoint?) the look up has an additional level of indirection via a engine spec. The engine spec has a name attribute that contains the name of the engine. Just name seemed to generic as a name of the attribute for the "ad-hoc" classes, but maybe we should unify it?

I'm unsure about defining it also for the non ad-hoc engine classes. It adds an additional location where this information is stored (attribute of the class and name of the entrypoint) and they could become out of sync. However, I also like having symmetry/no differences between the entrypoint and ad-hoc engines :-/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the engine name could only be defined on the class itself and we could update the engine specs in the importlib metadata to only register the class?

But that means that we will need to import the class to discover its name, which is a bit sad, because it could trigger import overhead from plugins unrelated to the specific algorithm of interest.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But maybe this is not a big deal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add a engine_name attribute to the engine class when it is loaded via EngineSpec.get_engine_class?

@fcharras
Copy link
Contributor

fcharras commented Feb 22, 2023

For a plugin situation where we allow an engine to reject sparse input, which seems like a reasonable thing to allow, then it isn't surprising that the test fails. In some sense it is a feature that the test fails, not a bug

Then I'd think that we either need:

  • to add as many new keywords in the config context as necessary so we can define a context that make all tests pass with the engine option in pytest (like enabling a fallback to default engine in predict in case input type is not supported, that would solve our current issues)
  • or we need to keep the FeatureNotSupportedByEngine exception for use in the unit tests, so the engine can trigger skip or xfail for those tests by raising it
  • or the engine provide a list of tests that are expected to fail with the full string that identifies it in pytest, and expose it in in a way that it can be fetched by sklearn conftest,

I think the third point is what we had discussed initially, I'm a bit concerned that it could be more difficult to maintain, maintainers should be notified whenever a test that was registered as expected to fail is edited upstream, so we would need pytest to provide a hash of each test, that sounds complicated. I'd be more encline to explore one of the first two solutions first.

@ogrisel
Copy link
Member Author

ogrisel commented Feb 23, 2023

At the last call, I seem to recall that we discussed the possibility to use the engine native types in the tests and check if the scikit-learn tests (assert_allclose and friends) would accept them naturally, assuming that all plugins would use numpy-like containers that follow the Array API and/or implement the __array__ method to allow numpy to convert those attributes to host-allocated numpy arrays as needed.

It might be worth experimenting how many tests would pass, how many would need to be adapted to be less strict about type checks and instead rely more on duck-typing.

@fcharras
Copy link
Contributor

I look at it some more:

  • assert_allclose currently fails because of isfinite call when using sklearn_numba_dpex
  • using the example of dpctl.tensor, having sklearn unit tests compliant with the array API would actually not be enough, since some of the features are WIP.

@github-actions
Copy link

github-actions bot commented Aug 4, 2023

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 7d52073. Link to the linter CI: here

Co-authored-by: Tim Head <betatim@gmail.com>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Franck Charras <29153872+fcharras@users.noreply.github.com>
@fcharras
Copy link
Contributor

fcharras commented Sep 1, 2023

@betatim @ogrisel, the feature/engine-api branch is in a bad state, conflicts are all over the place on random files, not sure to understand why (did I use a starting commit that was desync locally last time I did the pull --no-rebase PR ?). We could squash all commits and push force to get back to a cleaner state, the resulting branch is here: https://github.com/fcharras/scikit-learn/tree/feature/engine-api, the single squashed commit is 2ccfc8c, the diff with current scikit-learn latest commit on main can be seen here: main...fcharras:scikit-learn:feature/engine-api . WDYT push-forcing that on scikit-learn:feature/engine-api ? (I haven't altered the initial code from here, it should be conform to the latest state we were settled on except it's rebased).

To ensure that nothing was lost in the process, and assuming I might have done a mistake with this unfortunate 40K lines PR last time, I re-did the merge of current main on the latest commit before the last merge, which was a897a34 , after double-checking that everything was well synced in my local dev environment. Then I squashed everything (using git diff / git apply patch) in a single commit.

I've tested the commit 2ccfc8c using this new plugin that is about to be merged.

Also, I'm about to start implementing an API for the KNN (+ corresponding pytorch implementation in this same plugin)

@ogrisel
Copy link
Member Author

ogrisel commented Sep 13, 2023

Let me force push the new commit.

@ogrisel
Copy link
Member Author

ogrisel commented Sep 13, 2023

@fcharras I force-pushed the result of your squash (2ccfc8c) under scikit-learn:feature/engine-api (this branch).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants