Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Modin and PyMars #461

Open
RAbraham opened this issue Nov 30, 2022 · 3 comments
Open

Support for Modin and PyMars #461

RAbraham opened this issue Nov 30, 2022 · 3 comments

Comments

@RAbraham
Copy link

Hi,
I was wondering if siuba could support Modin(https://modin.readthedocs.io/en/stable/index.html#modin-is-a-dataframe-for-datasets-from-1mb-to-1tb) and pymars(https://docs.pymars.org/en/latest/). Both are touted as api replacements for pandas

Caveats

  • Pymars is lazy as in, though the code is similar to pandas, an explicit execute needs to be run at the end(https://docs.pymars.org/en/latest/#mars-dataframe)
  • I just tried to run Modin with siuba and it did not work(I wasn't expecting too but just curious). Interesting it doesn't fail but it's not the same data.
from siuba.data import cars
from siuba import _, filter
import modin.pandas as pd
modin_df = pd.DataFrame(cars)
result_df = filter(cars, _.mpg == _.mpg.max())
print('------------------------')
print(result_df)

print('---------- Modin --------------')
result_modin = filter(modin_df, _.mpg == _.mpg.max())
print(result_modin)

I'm interested in the Ray ML platform(both Modin and PyMars are dataframe apis over the distributed Ray platform) so if you are interested, it would be great to make this work for

pip install "modin[ray]"

@machow
Copy link
Owner

machow commented Dec 1, 2022

Hey! It looks modin DataFrames are not a subclass of the pandas DataFrame, so siuba verbs like mutate, filter, etc.. do not know they should operate on them exactly as they do for pandas.

It looks like explicitly registering things like modin does allow them to dispatch correctly:

import modin.pandas as pd
import pandas as pd2

from siuba import _, mutate

df = pd.DataFrame({'x': [1,2,3]})

mutate.register(df.__class__, mutate.dispatch(pd2.DataFrame))
mutate(df, res = _.x + 1)

It seems like there are two challenges with implementing this:

  1. We don't want to import modin every time we import siuba. So we'll either need to register an abstract base class, or put the modin implementations in a submodule. It seems like a DataFrame abstract base class would be useful, since people could also register new DataFrames to dispatch on with it.
  2. modin's DataFrameGroupBy is also not a pandas subclass, so we'll need to register it also.

(Maybe a last, future piece is that siuba has a system to speed up its pandas grouped operations, that also relies on pandas types :/. Would be quick to adjust, but requires again likely more abstract base classes, unless there's a way to connect a modin DataFrame back to pandas that I'm missing 😓)

@RAbraham
Copy link
Author

RAbraham commented Dec 7, 2022

Thanks for looking into it in depth.

I can't find it right now, but I think there is a way to convert from modin to pandas. Having said that, that may work if we do such a conversion after all the aggregations produce a dataframe that fits into memory but if we do it very early in the pipeline, then it may error out if the original modin dataframe is very big.
re: abstract classes, sounds good. Whenever you get time :)

If you have time, what about PyMars? I think that may fall more in the LazyTbl camp along with SQL but Pymars I think allows one to run python udfs over the dataframe which we can't do with a SQL backend I guess?

@RAbraham
Copy link
Author

Just curious about this idea

  • re: moving to abstract classes etc, is this a complex change requiring a rewrite or something simple to do
  • can someone else (maybe me) do it or do you prefer that you do it
  • given your personal roadmap for this project, where would modin support fall?

I ask because I started writing a library with Modin as a backend and I felt I was merely duplicating a lot of ideas that you have so beautifully executed on. Siuba is one of the finest library designs that I have come across.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants