Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add arrow data storage support for linfa pre-processing and training module #285

Open
DataPsycho opened this issue Dec 25, 2022 · 3 comments

Comments

@DataPsycho
Copy link

Preprocessing and transformation with Data frames are heavily used for ml model training in scikitlearn. The two most popular DataFrame libraries (Polars, DataFusion) written in rust are based on apache arrow in-memory data format but not based on ndarray. Which also looks like will be the trend for any new data frame players in Rust. It does not look like there will be a data frame that wraps ndarray under the hood, the way pandas wrap numpy.

By adding arrow support in linfa, any data frame based on arrow will have default support which means any arrow-based data frame can be passed to any preprocessing or training modules of linfa. Without dealing with ndarray. The way pandas can be passed to sci-kit-learn. Hence I propose to have direct arrow support in linfa to have it a more generalized framework. By doing that Polars/DataFusion users can already use rust for ml training out of the box.

@YuhanLiin
Copy link
Collaborator

I don't want to abandon ndarray completely, since all the downstream code relies on it. If we add arrow support then we should support both, preferably via a generic trait that abstracts over the underlying data format. Unfortunately integration between ndarry and arrow doesn't yet exist directly, but they did talk about it here

@DataPsycho
Copy link
Author

That makes sense totally. Just to add support for arrow without abandoning ndarray. Thanks

@DataPsycho
Copy link
Author

I have done further investigation and it looks like you must have ndarray or array-like data structure support alongside. Because there are file types like images, and texts that can not be loaded as data frames, which needed to be loaded one by one or as a batch in an array/tensor for further training, the way datasets in pytorch.utils do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants