Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Path for pluggable low-level computational routines #22438

Open
ogrisel opened this issue Feb 10, 2022 · 30 comments
Open

Path for pluggable low-level computational routines #22438

ogrisel opened this issue Feb 10, 2022 · 30 comments

Comments

@ogrisel
Copy link
Member

ogrisel commented Feb 10, 2022

The goal of this issue is to discuss the design and prototype a way to register alternative implementations for core low level routines in scikit-learn, in particular to benefit from hardware optimized implementations (e.g. using GPUs efficiently).

Motivation

scikit-learn aims to provide reasonably easy to maintain and portable implementations of standard machine learning algorithms. Those implementations are typically written in Python (with the help of NumPy and SciPy) or in Cython when the overhead of the Python interpreter prevents us to efficiently implement algorithms with (nested) tight loops. This allows us reasonably fast implementations as binary packages (installable with pip/PyPI, conda/conda-forge, conda/anaconda or Linux distros) for a variety of platforms (Linux / macOS / Windows) x (x86_64, i686, arm64, ppc64le) from a single code-base with no external runtime dependencies beyond Python, NumPy and SciPy.

Recently, GPU hardware have proven very competitive for many machine learning related workloads, either from a pure latency standpoint, or from the standpoint of a better computation/energy trade-off (irrespective of the raw speed considerations). However, hardware optimized implementation are typically not portable and mandates additional dependencies.

We therefore propose to design a way for our users to register alternative implementations of low-level computation routines in scikit-learn provided they has installed the required extension package(s) that matches their specific hardware.

Relationship to adopting the Array API spec

This proposal is related and complementary to another effort, namely:

The Array API spec is useful to make it possible to have some scikit-learn estimators written using a pure numpy syntax to delegate the computation to alternative Array API compatible libraries such as CuPy.

However, some algorithms in scikit-learn cannot be efficiently written using NumPy operation only, for instance the main K-Means loop is written in Cython to process chunks of samples in parallel (using prange and OpenMP), compute distance with centroids and reduce those distances to find assign each sample to its closest centroid on-the-fly while preventing unnecessary memory transfer between CPU cache and RAM.

If we want to run this algorithm efficiently on GPU hardware, one would need to dispatch the computation of this low level function to an alternative implementation that can work on GPU, either written in C/C++ with GPU-specific supporting runtime libraries and compilers (e.g. OpenCL, NVIDIA Cuda, Intel oneAPI DPC++, AMD ROCm...) or using a Python syntax with the help of GPU support provided in numba for instance.

List of candidate routines

  • the main k-means loop (pairwise distances between samples and centroids with on-the-fly argmin reduction)
  • the core k-nearest neighbors computation loop (pairwise distances with on-the-fly arg-k-min reduction)
  • pairwise distances (without arg-k-min reduction)
  • pairwise kernels computation (e.g. for use in Nystroem)
  • ...

Explicit registration API design ideas

I started to draft some API ideas in:

Feel free to comment here, or there.

This design is subject to evolve, in particular to make it possible to register both Array API extensions and non-Array API extensions with the same registration API.

Next steps

@thomasjpfan
Copy link
Member

thomasjpfan commented Mar 7, 2022

Here are is a proposal on the developer API for the multi-dispatching:

scikit-learn dispatching

This registers pairwise_distance as a function that can be dispatched:

from sklearn.utils.dispatch import multidispatch

# Dispatch on "X" and "metrics"
@multidispatch(domain="sklearn.metrics", keys=["X", "metric"])
def pairwise_distance(...):
    ...

Backend Provider

Using Python types

The following registers faster_pairwise and accepts the types specified by the Python types. This backend only supports "euclidean" and "manhattan" metrics:

from sklearn.utils.dispatch import Backend
import numpy as np

be = Backend(domain="sklearn.metrics")

@be.register(name="pairwise_distance")
def faster_pairwise(X: np.ndarray, .., metric: Literal["euclidean", "manhattan"]):
    ...

Specifying a CuPy array:

Specifying cp.ndarray in the typing annotation dispatches for CuPy arrays:

import cupy as cp

@be.register(name="pairwise_distance")
def faster_pairwise(X: cp.ndarray, ...):
    ...

Backend only supports float32

There can be a callable that is passed in the dispatch keys and returns True if the function supports the input.

# Uses a **kwargs just in case we want to dispatch more keys in the future
@be.register(
    name="pairwise_distance",
    supports=lambda **kwargs: kwargs["X"].dtype == np.float32
)
def faster_pairwise(X: np.ndarray, ...):
    ...

Note that this callable can be used instead of Python typing.

@ogrisel
Copy link
Member Author

ogrisel commented Mar 18, 2022

It might be more complex that this because we will need to:

  • do the backend selection depending on the input data when we call _validate_data on an estimator, that is much before actually reach the concrete accelerated code in fit
  • when we do an iterative computation in fit, we might want to pre-allocate backend specific temporary datastructures before entering the loop and the call the accelerated function with the temporary datastructures as argument. But we might still want to keep the loop it-self in Python at the vanilla scikit-learn level because we might need to hook the callbacks ([WIP] Callback API continued #22000).

@thomasjpfan
Copy link
Member

do the backend selection depending on the input data when we call _validate_data on an estimator, that is much before actually reach the concrete accelerated code in fit

If dispatching is decided on the estimator level with _validate_data, then the user API for registering a backend could be:

# Set computational backend for training
clf = KNeighborsClassifier(n_neighbors=5).set_computational_engine("sklearn_dppy")
clf.fit(X_train, y_train)

# changing computational backend for prediction
clf.set_computational_engine("sklearn_numba")
y_pred = clf.predict(X_test)

This has the benefit of not needing a global state when using the context manager. What do you think?

@ogrisel
Copy link
Member Author

ogrisel commented Mar 21, 2022

This has the benefit of not needing a global state when using the context manager. What do you think?

Indeed, but it might become tedious to set the engine for each step in a pipeline once we start supporting many estimators.

@thomasjpfan
Copy link
Member

thomasjpfan commented Mar 28, 2022

Do we want to support using different dispatchers for each estimator in a pipeline?

pipe = make_pipeline(
    StandScalar().set_computational_engine("sklearn_faster_scalar"),
    PCA().set_computational_engine("sklearn_numba"),
    LogisticRegression()  # use default scikit-learn
)

On another note, what would the API look like for device transfers. For example:

pipe = make_pipeline(
    PCA().set_computational_engine("sklearn_cuml"),
    LogisticRegression()  # use default scikit-learn
)

# X and y are NumPy arrays
pipe.fit(X, y)

If we want the above to "just work", the backend sklearn_cuml would need to transfer the data onto the GPU and then logistic regression would need to transfer it back to CPU.

@ogrisel
Copy link
Member Author

ogrisel commented Mar 29, 2022

Indeed, maybe we could allow for both APIs (context manager + estimator level engine registration). We will make all the API private in the short term (in the experimental phases) so that we can get a better feeling of both option.

I am working a on a draft branch, will try to publish it somewhere by the end of the week.

@lorentzenchr
Copy link
Member

IIUC, this would mean that we define a plugin API. Is this correct?
What is the advantage of such a plugin mechanism vs libraries implementing their own estimators?

@ogrisel
Copy link
Member Author

ogrisel commented Mar 29, 2022

More code/documentation reuse and public API consistency.

@dantegd
Copy link

dantegd commented Apr 15, 2022

From the developers of cuML, we think this is a great proposal to improve user’s experience, and extend Scikit-learn without impacting it’s ease of use, and we’d love to collaborate and contribute towards making it happen. The main advantages of this approach as we see it mirror what @ogrisel says, when compared to just having libraries following the Scikit-learn APIs but being separate, are: ensuring a more consistent user experience, reduce barrier of entry (still using scikit-learn proper with an option as opposed to a new library), and discoverability/documentation.

There are quite a few elements where we would like to give our feedback based on the past few years of developing a scikit-learn-like library for GPUs. First, I think the API that probably would have the least need for maintenance from Scikit-learn itself is indeed:

# Set computational backend for training
clf = KNeighborsClassifier(n_neighbors=5).set_computational_engine("sklearn_dppy")
clf.fit(X_train, y_train)

using _validate_data as @thomasjpfan mentioned, this is due to a number of things:

  1. It builds on top of the Scikit-learn API design and how many projects (like us, and others in scikit-contrib) already have built towards those APIs.
  2. It moves the maintenance complexity towards the backends, as long as there is clarity on the API contracts (which Scikit-learn has always been fantastic at establishing already).

I think that second point is particularly important to make the effort to easily be adoptable by future libraries that might use different types of hardware. Today, for cuML for example, that means it’s on us to accept NumPy/CPU objects and we do the transferences to device and back, which is something we’ve learnt we already had to support due to user’s expectations anyways.

That said, the mechanism could be even more powerful if the pipeline machinery in Scikit-learn could relax some validations so that memory transferences could be minimized in pipelines like:

pipe = make_pipeline(
    PCA().set_computational_engine("sklearn_cuml"),
    LogisticRegression()  # use default scikit-learn
)

Perhaps a mechanism that register’s what’s the “preferred” device/format of a computational engine, so that if there are multiple consecutive algorithms in the same device, the data doesn’t need to be transferred back and forth. One problem which we've had to address in cuML is how to minimize data transfers and conversions (for example, row major to column major). Generally, computational engines may have preferred memory formats for particular algorithms (e.g. row-major vs column-major), and so one thing we might want to think about is mechanisms to allow an engine to maintain data in its preferred location and format through several chained calls to that engine. Being able to register this “preference” allows backends to take advantage of this if desired, or just default to using NumPy/array arrays, so it is opt-in, which means it wouldn’t complicate engine development unless the engine needs it. It would also keep maintenance on the scikit-learn codebase side low, by keeping the bulk of that responsibility (and flexibility) on the engine side.

@ogrisel
Copy link
Member Author

ogrisel commented May 16, 2022

Indeed. Having the ability to register the memory layout for the output datastructure of a transformer is also linked to the pandas-in / pandas-out discussion:

@ogrisel
Copy link
Member Author

ogrisel commented May 16, 2022

I also want to see if relying on uarray (a pluggable type-based dispatching mechanism that is progressively be leveraged in SciPy for similar purpose) would be useful or if we should rather handle the backend dispatching manually. I am thinking that non-trivial dispatching decision might happen during the _validate_data in the fit / predict / transform call of a given estimator and that this decision should then consistently be reused throughout the rest of the fit / predict / transform call.

In particular, before entering an iterative loop in a fit method, some temporary datastructures would probably need to be allocated consistently with that dispatching decision and then passed as argument to the backend specific implementation called repeatedly inside the fit loop while the fit loop itself would stay in the scikit-learn code-base, for instance to call per-iteration callbacks (#22000) or to mutualize the stopping criterion logic across all backends.

@adrinjalali
Copy link
Member

cc @scikit-learn/core-devs

@betatim
Copy link
Member

betatim commented Sep 6, 2022

IIUC, this would mean that we define a plugin API. Is this correct?

I like this way of thinking about this. In particular because you want to replace more than just one method (fit()) but might also want to replace preprocessing stuff. So it feels more like "install a plugin sklearn-XYZ-foobar to get speed ups for estimator XYZ if you have a FooBar accelerator". The plugin then uses predefined hooks to replace/augment(?) the code that is already in scikit-learn. Or maybe there will just be a plugin per accelerator.

Something like https://pluggy.readthedocs.io/en/stable/ looks like a nice way to define plugins, might be a good place to learn from/get inspiration (I'm not sure using pluggy directly is the right thing to do).

Some question that come up if you use a plugin system:

  • what if more than one plugin hooks into the same thing, which plugin wins?
  • What if a user doesn't want to use the plugin for some of the estimators that it supports?

I think there must be prior art for this kind of stuff out there that we could learn from. And maybe it is very unlikely that someone has more than one accelerator in the same computer/would want to disable acceleration for some estimators but not others?

@jjerphan
Copy link
Member

jjerphan commented Sep 7, 2022

Thanks for participating in this conversation, @betatim.

I like this way of thinking about this. In particular because you want to replace more than just one method (fit()) but might also want to replace preprocessing stuff. So it feels more like "install a plugin sklearn-XYZ-foobar to get speed ups for estimator XYZ if you have a FooBar accelerator". The plugin then uses predefined hooks to replace/augment(?) the code that is already in scikit-learn. Or maybe there will just be a plugin per accelerator.

This probably is not (yet) explicit, but I think we somewhat plan to have a grosser granularity, i.e. have a plugin per accelerator/vendor (foobar) that would ship implementations for several methods/functions (XYZ).

I think there must be prior art for this kind of stuff out there that we could learn from. And maybe it is very unlikely that someone has more than one accelerator in the same computer/would want to disable acceleration for some estimators but not others?

I think also it is appropriate to consider some UX regarding plugins (beforehand). Do you know projects that are using pluggy, so as to get learn from their experience? I will try to search for such projects.

@betatim
Copy link
Member

betatim commented Sep 7, 2022

I think also it is appropriate to consider some UX regarding plugins (beforehand). Do you know projects that are using pluggy, so as to get learn from their experience? I will try to search for such projects.

pytest (pluggy was "invented" for/by pytest) and datasette are two projects I know of that use pluggy.

@betatim
Copy link
Member

betatim commented Sep 19, 2022

I started writing some code to get a feeling for what a "plugin system" based on pluggy would look like (code over architecture docs). I already learnt a lot and made more sense of how pluggy works.

My plan is to get the basic infrastructure working and to see if its possible to do things like having multiple plugins that implement the same function, disabling plugins dynamically, etc.

I'll post a link to my branch here instead of opening a PR. WDYT?

@jjerphan
Copy link
Member

I started writing some code to get a feeling for what a "plugin system" based on pluggy would look like (code over architecture docs). I already learnt a lot and made more sense of how pluggy works.

My plan is to get the basic infrastructure working and to see if its possible to do things like having multiple plugins that implement the same function, disabling plugins dynamically, etc.

I'll post a link to my branch here instead of opening a PR. WDYT?

Thanks for having explored this. I am interested by your branch, and I don't think a PR is required if it's experimental.

cc @fcharras who might also be interested.

@betatim
Copy link
Member

betatim commented Sep 20, 2022

I picked the kmeans example/function linked in the top comment. The only reason for that is that it was easy, I didn't think about whether it is the best example or not. This means if you have a preference/opinion, lets switch to that.

My goal was to write some code based on pluggy to be able to see it in action. I wanted to have a "base plugin" with the default sklearn implementation and then one "other plugin" (the fake cupy plugin in this case). I wanted two plugins in order to see what would happen if one of them wanted to pass on the computation or if it only implements a subset of the hooks. I haven't tried to find an optimal place for all the code to live, I think it feels a bit clunky right now. I haven't thought about performance either.

Diff of the first commit (diff of the branch, might change as I work on it).

I used the bisecting kmeans example to run the code.

I haven't really thought about how kmeans works, but it feels like the hook is in the wrong place now. Maybe it should be "higher level", maybe somewhere in BisectingKMeans.fit()? I'll leave it as it is for now though.

Related to "where should the hook be?": I think we should have the hooks "high enough" so that the overhead of picking a hook and calling it doesn't matter. This means we shouldn't hook a function that only takes 50ns to execute but is called 1000 times. Instead the hook should be "above" the loop that is making the 1000 calls. Ideally you call a hook with enough work so that it would take a few seconds or more to execute.

I like that this example allows the hook implementations to decide if they want to "take the call or not". For example a GPU plugin could decide that it will take action if the input data is already in the GPU memory or if the input data is not on the GPU memory yet but it is "big enough" so that transferring it to GPU memory is worth it. Basically, I think the decision of which hook implementation to use is more complicated than just the datatype of the input data.

Unlike in my hack above, all the "go faster" plugins should be their own Python package that people install.

I like that a plugin is "activated" by pip installing it. It seems like as a user you'd install the plugin to get "faster computations with the magic hardware I have" and want to use it all the time. Even nicer is that I think we can use set_/get_config() together with plugins.manager.subset_hook_caller("kmeans_single_lloyd", list_of_active_plugins) to temporarily disable a plugin. Instead of calling plugins.manager.hook.kmeans_single_lloyd() directly, you use subset_hook_caller() together with a list of active plugins based on get_config().

This is a super long comment already, so I'll stop now and wait to hear what you think 😃

@fcharras
Copy link
Contributor

It's cool that you pick KMeans @betatim we are in the process of implementing a GPU KMeans plugin here , the plugin interface is not done yet but it was planned to work with this branch of scikit-learn that have been drafted by @ogrisel , we can also work on an interface with your branch.

@ogrisel
Copy link
Member Author

ogrisel commented Sep 20, 2022

Sorry, indeed I had not seen the notifications for this discussion. I will try to resume my work on this branch later today to get it in a minimally functional state and hope we can get a constructive discussion on the design from there.

@ogrisel
Copy link
Member Author

ogrisel commented Sep 22, 2022

I will try to resume my work on this branch later today to get it in a minimally functional state and hope we can get a constructive discussion on the design from there.

@betatim here it is: #24497

@betatim
Copy link
Member

betatim commented Sep 23, 2022

Thank you!

How do you think we should try and decide things? Should we start with figuring out the features we need/care about/don't care about to see if we can already filter out some approaches?

From reading all the comments above it isn't clear to me what people think is important, was important but not anymore, etc. For me it would be good to figure out "what is the actual problem we want to solve, and what are problems we don't want to solve". Maybe updating (together) https://hackmd.io/@oliviergrisel/S1RDJ3HCF?

@ogrisel
Copy link
Member Author

ogrisel commented Sep 23, 2022

For information: we had a quick chat this afternoon with @betatim and @fcharras where we discussed the current design proposed in #24497 and the choice to make engine activation explicit manual (at least) for now and not dependent on the type of the input container.

figure out "what is the actual problem we want to solve, and what are problems we don't want to solve"

This is not 100% clear yet to me either. Any current API choice is subject to change as we start implementing engines and get some practical experience on their usability when we try to use them for "real-life"-ish datascience tasks.

We plan to organize public online meeting dedicated to the topic on the scikit-learn discord server in the coming weeks for those interested. We will announce that on the mailing list and maybe twitter.

@betatim
Copy link
Member

betatim commented Sep 27, 2022

I'll put some thoughts "for the future"/to keep at the back of the mind here for now. I don't have answers or expect answers to them, but I need to write them down somewhere otherwise I'll forget:

  • can we make it so that it is "not impossible" to have a future where plugins are enabled by default? In a world where all this has become a bit more stable it seems weird to ask users to explicitly activate something after they already explicitly activated it by installing the plugin (I think a method to opt out when needed is nicer)
  • what should happen if multiple plugins that implement the same engine are activated at the same time? Error? Some kind of priority?
  • What if multiple plugins are activated but only one of them works for the particular input dtype/hyper-parameters?
  • What is a good way for users to see "which plugin did something"? Thinking of bug reports where it has to be easy for users to report which plugins were involved. I think just listing what plugins they've got installed has a high chance of leaving open questions because plugins can be activated/deactivated/set for particular estimators/raise NotImplemented/etc (and it probably also depends on the versions of the plugins...)

@lorentzenchr
Copy link
Member

If I may, I'd like to raise some stupid questions:

  1. Do we need to change our statement on GPU in the FAQ? GPU's are one of the main motivations if I understand the top post correctly.
  2. If we provide a low-level Plugin-API, how will that influence or prohibit own improvements? We will lose a certain amount of flexibility and have more maintenance to do.
  3. What's the main reason for a Plugin-API vs someone else inheriting from BaseEstimator and write her own classes?

@betatim
Copy link
Member

betatim commented Oct 28, 2022

Good questions to be asking, some thoughts on possible answers:

  1. not sure the statement needs changing. scikit-learn itself will continue to not support GPUs. Maybe worth extending the statement to mention "others might have created a plugin that takes advantage of your GPU"?
  2. I think for now the engine API (to stick with the naming in this PR) comes with no backwards compatibility promises because it is experimental. Once it isn't experimental anymore I'd expect it would come with some promises which make big changes harder. I think a way to minimise the chances for a need to change the engine API of an estimator to come up is to implement some engines as part of designing the engine API. My assumption being that if you can implement the current implementation and one or two others with the proposed API, it "may be a good idea".
  3. IMHO the main reason to have an engine API is that you can configure which backend is used, without having to change code. If you interview users they prefer (and have higher trust in) a switch, compared to using implementations from two different packages.

Related to (3), but a new thought: you could imagine having a engine/plugin that provides a different implementation of sklearn.metrics.pairwise.pairwise_distance and sklearn.metrics.pairwise.pairwise_distance_argmin. Thereby speeding up all the parts of scikit-learn that use these functions. You could technically do this in the "separate package" approach via reaching into the innards of sklearn and patching things. But I think mostly the lesson from such attempts is that you quickly end up in a hell of breakage and weird behaviour.

@fcharras
Copy link
Contributor

I can provide some insights for 3. from the experimental plugin we're developing:

  • for users:

    • a plugin api can act like a standard that enforces consistency across different implementations for a same algorithm
    • it can unlock interoperability to a larger scope of hardware and underlying runtimes, e.g an estimator that I train on a GPU from any manufacturer with a specialized tensor library could be ported to a CPU implementation that uses numpy and inversely so, in a similar fashion than what is being tested with the array api
  • in return, developers of said plugins benefits from assets in sklearn that can be reused to improve reliability and publishing, and tooling that help asserting that the implementation respects the contract offered by the standard api:

    • sklearn documentation
    • sklearn unit tests, see this example

@betatim
Copy link
Member

betatim commented Nov 2, 2022

Does anyone have thoughts on what API the different engines (for estimators) should have? Right now it looks like we are leaning towards "bespoke engine API per estimator", based on the naming of the methods on the KMeans proof-of-concept.

Is this a conscious choice? Something in flux? The more I've thought about having engines for many estimators, the more I am thinking it would make sense to have one (or a few) engine APIs. For example "prepare to fit", "fit", "post fit", "prepare to predict", "predict", "post predict", etc.

WDYT?

@lorentzenchr
Copy link
Member

Given the discussion in #26010, a plugin mechanism seems a desirable feature. As it is a considerable amount of API change/addition, I would appreciate to have a SLEP for it.

We already have public APIs with fit, predict etc. It is quite easy for a 3rd party library to just derive from one of our estimators and re-implement the fit method, for example. The point is that fit often does quite many things, in particular a lot of param validation and setting. A plugin mechanism would ideally circumvent this. Then there are 2 paths that I can see (but my sight is very limited here):

  1. An estimator specific plugin API, case by case.
  2. A general lower level API, e.g. the_real_work_behind_fit and the_real_work_behind_predict, in most/every estimator.
    The actual work done in this routine is also case by case.

Summary: Case by case then?

@ogrisel
Copy link
Member Author

ogrisel commented Jun 16, 2023

I agree there is still a tension between estimator-level and lower-level APIs and also between should the dispatch automatically move data from host to device without extra verbose configuration when its likely to enable GPU related speed-ups.

Note that scipy is concurrently making progress by vendoring uarray as a lower level plugin API (function level with a dispatch logic based on the type of the arguments).

I am still not sure what approach would work best, both from a code-complexity point of view (for maintainers, contributors and users who want to read the code to understand what's going on) and for users trying to understand which library is actually used when fitting a non-trivial machine learning pipeline on a host equipped with GPUs).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants