Add arrow data storage support for linfa pre-processing and training module #285

DataPsycho · 2022-12-25T03:37:46Z

Preprocessing and transformation with Data frames are heavily used for ml model training in scikitlearn. The two most popular DataFrame libraries (Polars, DataFusion) written in rust are based on apache arrow in-memory data format but not based on ndarray. Which also looks like will be the trend for any new data frame players in Rust. It does not look like there will be a data frame that wraps ndarray under the hood, the way pandas wrap numpy.

By adding arrow support in linfa, any data frame based on arrow will have default support which means any arrow-based data frame can be passed to any preprocessing or training modules of linfa. Without dealing with ndarray. The way pandas can be passed to sci-kit-learn. Hence I propose to have direct arrow support in linfa to have it a more generalized framework. By doing that Polars/DataFusion users can already use rust for ml training out of the box.

YuhanLiin · 2022-12-25T04:54:51Z

I don't want to abandon ndarray completely, since all the downstream code relies on it. If we add arrow support then we should support both, preferably via a generic trait that abstracts over the underlying data format. Unfortunately integration between ndarry and arrow doesn't yet exist directly, but they did talk about it here

DataPsycho · 2022-12-25T13:51:17Z

That makes sense totally. Just to add support for arrow without abandoning ndarray. Thanks

DataPsycho · 2023-01-07T10:30:48Z

I have done further investigation and it looks like you must have ndarray or array-like data structure support alongside. Because there are file types like images, and texts that can not be loaded as data frames, which needed to be loaded one by one or as a batch in an array/tensor for further training, the way datasets in pytorch.utils do.

quietlychris mentioned this issue Dec 28, 2022

Is it possible to use Polars instead of ndarray? #286

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add arrow data storage support for linfa pre-processing and training module #285

Add arrow data storage support for linfa pre-processing and training module #285

DataPsycho commented Dec 25, 2022

YuhanLiin commented Dec 25, 2022

DataPsycho commented Dec 25, 2022

DataPsycho commented Jan 7, 2023

Add arrow data storage support for linfa pre-processing and training module #285

Add arrow data storage support for linfa pre-processing and training module #285

Comments

DataPsycho commented Dec 25, 2022

YuhanLiin commented Dec 25, 2022

DataPsycho commented Dec 25, 2022

DataPsycho commented Jan 7, 2023