Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python bindings #220

Open
YuhanLiin opened this issue May 11, 2022 · 18 comments
Open

Python bindings #220

YuhanLiin opened this issue May 11, 2022 · 18 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@YuhanLiin
Copy link
Collaborator

We should add Python bindings to the public API of linfa crates. This will allow us to fairly benchmark linfa against scikit-learn, which also uses a Python API, as well as making linfa easier to use, allowing for wider adoption. This process can be done piece by piece. I suggest we start with linfa-clustering, since that's the most-used linfa crate and also has prior art behind it.

Questions

  • How close do we make the API to scikit-learn? Do we want exact parity?
  • How do we support numpy in our API without lots of data copying?
  • linfa makes heavy use of generics, but for Python bindings we need to pick one monomorphization to build. For type params like F we can just pick f64, but for other it's less clear cut. We may also need to choose between different params at runtime instead of compile time. Do we use an enum? A trait object?
  • Similarly, which features do we want to build linfa with?

We'll likely put all the bindings into one Python package, so that we don't build multiple copies of linfa across multiple packages.

Prior Art
When @LukeMathWalker first released linfa, he also released Python bindings here, for benchmarking against scikit-learn. AFAIK these bindings only support KMeans, and they are also 3 years old, but they should provide a good starting point.

@YuhanLiin YuhanLiin added enhancement New feature or request help wanted Extra attention is needed labels May 11, 2022
@relf
Copy link
Member

relf commented May 11, 2022

Regarding the first point, unless there is a compelling reason not to do so, it would be better to follow scikit-learn conventions to ease comparison and reach a wider adoption.

@jovany-wang
Copy link

How close do we make the API to scikit-learn? Do we want exact parity?

I'm not sure whether the current linfa-rs is easy to adoption for pythonbindings.

How do we support numpy in our API without lots of data copying?

I think there're 2 approaches:
(1) Hook a my-numpy as an zero-copy numpy implementation.

import my-numpy as np
x = np.xxx
# Now we can pass x to linfa API which could avoid unnecessary copies underhook.

(2) numpy is implemented in native language(C++), so it's able to pass a numpy object to linfa, and then in linfa, we use the native handles.

arr = np.array([[1,2,3],[1,2,3]], dtype=np.float32)
arr_ptr = arr.ctypes.data
# We could use arr_ptr in native language.

But anyway, we should support linfa being able to operate a numpy object in native.

@YuhanLiin
Copy link
Collaborator Author

Ideally I want to break the numpy array down into its data pointer and metadata components, then assemble it back into the equivalent ndarray, without copying data. Directly adding support for native numpy arrays to linfa requires lots of rewriting and heavy use of generics, which I'd like to avoid.

@jovany-wang
Copy link

I see. but I'm not sure whether the functionalities that we need in ndarray is totally equivalent to numpy.

@quietlychris
Copy link
Member

Just as a counterpoint to @relf: I've started on simple efforts for porting Linfa to Python a couple times (PyO3 can be oddly frustrating sometimes), and I'm not convinced that matchingscikit-learn's API exactly is necessary, or even desirable. In addition to the work that Luca did a few years back, I think it's worth considering the polars library, a Dataframe library, written in Rust but ported to python as a py-polars package. Python already has pandas among others, and polars is very much in this same vein (both are even bears!).

However, the APIs, while similar, are not exactly the same, and for good reason: idiomatic Rust and idiomatic Python (which, granted, is somewhat more loose) are not identical, and porting APIs between the languages reflect this. Both pandas and scikit-learn also "suffer" (for lack of a better term) from their own historical development; the choices that they've made are not necessarily the exactly correct choices. I know that I've made design choices for programs in the past where, if given the chance, I would change, and I expect that the developers for scikit-learn probably feel the same in at least some areas. There's certainly room for variation, and I think that making a decision to aim to match the exact API of scikit-learn could leave opportunities for improvement or experimentation behind where they might actually be desirable. While the point about easy adoption is valid, and I certainly support looking to that API for inspiration or consideration, I think that the in the trade-off between potentially building something a little different but improved, or potentially getting quicker adoption, I'd personally lean towards the former.

@quietlychris
Copy link
Member

I also didn't see this linked yet, so just wanted to add it to the thread: the PyO3 project already has a Numpy interop project called rust-numpy which could help to solve this problem without doing it ourselves. Since Luca's linfa-python is also using that project (and, anecdotally, despite occasional frustrations, I've also had good experiences with), it may be a good option.

@YuhanLiin YuhanLiin mentioned this issue Jun 15, 2022
24 tasks
@DataPsycho
Copy link

DataPsycho commented Dec 18, 2022

I think it will be good to have support for Arrow as in-memory data storage for linfa. Then any DataFrame based on the arrow will have default support such as Polars, DataFusion. That is how we can have multiple DataFrame API support.

@YuhanLiin
Copy link
Collaborator Author

Feel free to open a separate issue for that

@jjerphan
Copy link

Hi all,

tl;dr: I am one of the maintainer of scikit-learn. If you want to use scikit-learn it as a front-end of linfa, you might be interested and want to be involved in scikit-learn/scikit-learn#22438.


I agree with most of @quietlychris's remark:

Both pandas and scikit-learn also "suffer" (for lack of a better term) from their own historical development; the choices that they've made are not necessarily the exactly correct choices. I know that I've made design choices for programs in the past where, if given the chance, I would change, and I expect that the developers for scikit-learn probably feel the same in at least some areas.

I do not know about pandas, but I think scikit-learn has made (close to, if not) the best design decisions back. The situation has changed now, and extending scikit-learn to make it compatible with other projects written in other languages come with unforeseen constraints and challenges (e.g. different idiomatic constructions, harden interfaces adaptations, adherence to dependencies' design choices and concept, vendoring or depending on shared libraries for OpenMP and BLAS implementations, packaging, etc.) making extend it harder but not impossible to extend.

I think scikit-learn's initial design decisions put the emphasis on UX, documentation, compatibility and composability with other projects in the Scientific Python ecosystem, rather than on performance, portability to other context (e.g. embedded systems), and interfaces to other languages. To me, this explains scikit-learn's adoption.

I think other projects like mlpack or SHOGUN took different design decision (based on different use-cases), and it might be valuable to learn about those projects' experience and challenges.

I think one of the most notable room of improvement scikit-learn's maintainers have identified for the library is (native) performance.

When it comes to native performance, we are putting efforts into optimizing costly pattern of computations of algorithms using Cython (see scikit-learn/scikit-learn#22587). If Cython is convenient because it manages a lot of complexity for us, we are facing limits of its constructs and concepts (mainly regarding the cost of polymorphism and dynamic method dispatch and the lack of alternatives in the Cython) for some of the lowest-level implementations. More generally, being tight to CPython we also face the intrinsic performance limitations of the interpreter as a whole: even if dependencies like NumPy and SciPy have efficient low-level implementations, the full execution of users' pipelines remains costly and generally are single threaded.

One of the alternative pathways that we currently are working on to improve performance is a plugin system allowing third party packages' developers to extend scikit-learn with their own custom (GPU) implementations. scikit-learn/scikit-learn#22438 drives the discussions and design and https://github.com/soda-inria/sklearn-numba-dpex is for instance a package that is actively being developed to back some of the algorithms implementations with GPU kernels.

I think the plugin system discussed in scikit-learn/scikit-learn#22438 might be suited for linfa to have python bindings.

@DataPsycho
Copy link

DataPsycho commented Jan 16, 2023

Hi all,

tl;dr: I am one of the maintainer of scikit-learn. If you want to use scikit-learn it as a front-end of linfa, you might be interested and want to be involved in scikit-learn/scikit-learn#22438.

...
I think the plugin system discussed in scikit-learn/scikit-learn#22438 might be suited for linfa to have python bindings.

Commenting according to your tl;dr: Providing python support needed a good implementation of data communication between linfa and python. It is easy to rewrite a scikit-learn implementation in rust with ndarray which will be performant but the main overhead will be coming from data allocation and communication between python and rust api with a big chunk of data or final model-trained metadata. There is already example exists. For example, the way data type/memory conversion happened between PySpark and jvm or numpy array to rust Polars data frame. I think it is better to avoid any kind of numpy based communication for linfa and approach with arrow or other better option if exists.

This data communication will be generic over all algorithms and can make porting easy for python. But at the moment probably not enough people are involved in the project to implement this critical feature.

@jjerphan
Copy link

Isn't it possible to have minimum overhead by reusing data allocated by CPython or Rust (via PyO3) similarly to that is possible with CPython and C/C++ (via Cython, PyBind11 or nanobind)?

@YuhanLiin
Copy link
Collaborator Author

Depends on whether that allocated data is compatible with the ndarray format.

@DataPsycho
Copy link

DataPsycho commented Jan 17, 2023

I am not a low-code geek. It will be good to see how the python API of polars or datafusion communicates with the rust backend. Polars support NumPy array to build a polar data frame. They have very high-performance implementation so far. The same mechanism can be implemented here if some one interested.

@abstractqqq
Copy link

abstractqqq commented Nov 7, 2023

Sorry in advance to offer an somewhat pessimistic point of view. I am working on scientific computing in Rust so I really want to write down some of my reflections over the efforts.

  1. I do not want this package to follow Scikit-learn's API. We can provide a very thin compatibility layer and that is it. Scikit-learn's API is unnecessarily complicated due to backward compatibility issues and unnecessary abstractions e.g. Crazy multiple inheritances (and how they are calling the base methods in the background), crazy number of arguments to every function which most people never change, and some transformers return numpy and some return dataframe, and some return a dataframe with column names changed and you need extra configurations to force transformation output to always be dataframe (doing extra copies)! What a mess! From a practitioner's point of view, I almost never want to leave DataFrame land until the very last step, right before feeding data into the model object.

  2. That brings up the questions: how do we stay in DataFrame land? First, linfa, being a Rust package, should work with Polars. Doing preprocessing stuff in ndarray is not ideal, as you have to manage parallelization yourself, and Datasets come with non-numerical, mixed-numerical type data. That makes dataframes the starting point of any data science project, NOT ndarrays.

  3. I have written almost all transformers, Imputer, and other preprocessing stuff from Scikit-learn using Python Polars alone. The speed gain vs. Scikit-learn is insane, but honestly (sadly) most people don't care as time spent on these steps is insignificant compared to model training time. So if linfa's models run at similar speed or are just slightly faster than the ones in Scikit-learn, I think it is very hard to find adoption. Let's face some reality. Efficiency isn't always the priority in ML/DS. I often find myself spending hours optimizing stuff that no one asks for. The "data science" community is known for burning money and they won't hesitate to spin up a more expensive machine (I do this all the time), or orchestrate some complicated (actually simple) tasks using inefficient software like Pandas and Scikit-learn in a Kubernete and burn more money. My previous company moved to Databricks (super expensive), when in reality they just need to use Polars + some down sampling strategy for their ML tasks. And half of the "data scientists" there started running Pandas on Databricks. What can I say? I left the company immediately.

  4. Looking at Polars's success, I am more confident in my judgement of point (1) that we don't need to follow Scikit-learn's API. People want fast and efficient software, not inherited classes nor base transformers, nor Function Transformer that doesn't serialize.

  5. But I still don't think Linfa can find success in Python data science world. Because (a) Unlike Polars, Linfa isn't offering more than Scikit-learn. Polars offers parallelization + lazy optimziation + SQL + Built in window functions + Better Column Type system, which are vital features missing in Pandas. (b) Inertia is real, especially when the focus of Data scientists is not efficiency, but rather quality models.

  6. Me and my friend are bringing simple models directly into Polars dataframe. So that one can do rolling regression using one simple expression. You can take a look here: https://github.com/abstractqqq/polars_ds_extension. We might use Linfa in the future to bring in some fast and furious clustering into dataframe as well!

  7. Rust scientific computing ecosystem is , to put it politely, stagnant. Ndarray is almost dead. Statrs is not being updated. Even Linfa is slowing down. I feel like the Rust community's focus has completely shifted to Crypto, Web, Webassembly. Polars is on fire but it is more about data engineering. Nalgebra is still going, but dwarfed by NumPy + SciPy in terms of features.... Faer is coming up, but still in its infancy imo.

  8. Do the models work with Arrow data? Many models are moving towards this direction.

Best of luck.

Side note: For linear algebra related task, Faer-rs seems to be a great alternative.

@jjerphan
Copy link

jjerphan commented Nov 7, 2023

Thank you for this comprehensive comment, @abstractqqq.

As a maintainer, I welcome and value this critic. Would you like to give details about your experience with scikit-learn and pain points you have faced on scikit-learn's issue tracker?

@abstractqqq
Copy link

Thank you for this comprehensive comment, @abstractqqq.

As a maintainer, I welcome and value this critic. Would you like to give details about your experience with scikit-learn and pain points you have faced on scikit-learn's issue tracker?

Thank you for following up. I don't feel like going to the issue tracker because it's too crowded and because it's really hard to pinpoint the criticism to an "issue". As @quietlychris mentioned, a lot of it is backwards compatibility, because you need to support NumPy, or some other sparse data structure. I can give you more examples: SimpleImputer on pandas dataframes has terrible performance, likely because it's using NumPy to do imputing which isn't the right thing to do because of mixed data types in dataframes. And what's the deal with f_classif and f_regression being the same function with different names? Why not turn on multithreading for kdtrees for Mutual Information Score? To make f, and mutual information score useful in a data science pipeline, the methods must handle nulls (because of real world data quality issue) but right now they just fail and no mention of the the null issue can be found on the docs. I ended up rewriting the two algorithms in Polars + Kdtree (from SciPy) and got insane speed boost. Then again, I really think that except for the models, other functionalities in Scikit-learn are largely forgotten. That's why the docs are minimal and issues like null handling are not brought up or paid enough attention to... And finally what's the deal with transformers? Say I am doing feature engineering, I want to do a simple logTransform. In polars I just write "pl.col(a).log()", and this can be serialized and used in Pipelines. In Scikit-learn, I need all that boilerplate for a transformer, and FunctionTransformer doesn't serilize...

@quietlychris
Copy link
Member

quietlychris commented Nov 8, 2023

I believe that this discussion has digressed, and I'd like to get things back on track. As a project, Linfa doesn't really follow grand plans. If anyone feels strongly enough about the way that a Python API "should" look, they are welcome to open a draft PR with a prototype for one or two algorithms demonstrating the layout, and what they feel are pros/cons of their approach. If that happens, I'm sure that other stakeholders (read, maintainers and developers who have contributed to Linfa and the associated ecosystem) will be happy to engage in a good-faith discussion around that initial implementation and other approaches (which I also imagine would include lots of code).

There's not single right way to do software design, so until someone puts a concrete proposal forward, I'd like to avoid muddying the water of this issue with what-if's and sweeping generalizations about the scientific computing ecosystem.

@abstractqqq
Copy link

I believe that this discussion has digressed, and I'd like to get things back on track. As a project, Linfa doesn't really follow grand plans. If anyone feels strongly enough about the way that a Python API "should" look, they are welcome to open a draft PR with a prototype for one or two algorithms demonstrating the layout, and what they feel are pros/cons of their approach. If that happens, I'm sure that other stakeholders (read, maintainers and developers who have contributed to Linfa and the associated ecosystem) will be happy to engage in a good-faith discussion around that initial implementation and other approaches (which I also imagine would include lots of code).

There's not single right way to do software design, so until someone puts a concrete proposal forward, I'd like to avoid muddying the water of this issue with what-if's and sweeping generalizations about the scientific computing ecosystem.

You are right. I apologize. Feel free to remove my comments if you see fit. I am too struggling with designing a good API and have been experimenting with things. It would be great if some of us can put out a plan or a meeting if there is enough momentum.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

7 participants