Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add slice_rows to interchange protocol #349

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

MarcoGorelli
Copy link
Contributor

closes #204

@kkraus14
Copy link
Collaborator

kkraus14 commented Feb 13, 2024

In the case of something like pandas or other dataframe library that doesn't use the Arrow memory layout under the hood, they'd presumably materialize arrow on the __dataframe__ call and then have to slice the arrow format memory, which if containing strings or has a step size, isn't free. This is already potentially a problem in selecting columns as well, so I guess this inefficiency is nothing new?

Additionally, it makes it a bit hard to reason about when the producer vs when the consumer should do row selection. I.E. if Polars is consuming data from say PyArrow, I imagine Polars would rather handle row slicing itself (assuming you'll hit a situation where it's not pure pointer arithmetic). Now in the situation of Pandas consuming data from say Polars, you'd probably want Polars to handle the row slicing.

Arrow interchange protocols handle the slicing case (ignoring step size) by allowing specifying an offset and a size. Maybe we can do something similar here?

@MarcoGorelli
Copy link
Contributor Author

sounds good, thanks

@kkraus14
Copy link
Collaborator

Do we expect / want to encourage developers using dataframe libraries to explicitly call __dataframe__ themselves as opposed to using libraryx.from_dataframe(...)? It feels a bit funky to me currently that we go from say:

pl_df = ...  # My polars dataframe
pdf = pandas.from_dataFrame(pl_df)

to:

pl_df = ...  # My polars dataframe
pdf = pandas.from_dataframe(pl_df.__dataframe__().select_columns(...).slice_rows(...))

My 2c is that this is just highlighting the lack of standard API here and that the experience should be something along the lines of (ignoring API names for column selection and row slicing):

pl_df = ...  # My polars dataframe
pdf = pandas.from_dataframe(pl_df.cols(...).slice_rows(...))

@kkraus14
Copy link
Collaborator

Would be good to have others chime in here given this interchange protocol is already being adopted where we probably don't want to introduce something and later decide to change / remove it.

@MarcoGorelli
Copy link
Contributor Author

It's what plotly already does to not have to convert the entire dataframe

@MarcoGorelli
Copy link
Contributor Author

Any updates here please?

This is the only thing I plan to try adding to the interchange protocol, promised

I think of the interchange protocol as being useful to converting between libraries and doing some preselection in a standardised way:

  • select columns (currently possible)
  • select rows (not possible)

cc @rgommers @jorisvandenbossche

@MarcoGorelli
Copy link
Contributor Author

MarcoGorelli commented Feb 27, 2024

gentle ping

(would really like to get this in for pandas 3.0 tbh, and this topic actually has a real world use case microsoft/vscode-jupyter#13951)


this is just highlighting the lack of standard API here

the "standard api" solution would be:

pandas.from_dataframe(pl_df.__dataframe_consortium_standard__().select(...).take(...))

does that really look any less clunky?

@anmyachev
Copy link
Contributor

I think of the interchange protocol as being useful to converting between libraries and doing some preselection in a standardised way:

select columns (currently possible)
select rows (not possible)

The ability to select subset rows in addition to selecting columns seems harmonious.

Implementation in Modin should not be a problem.

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

How to slice rows? Can it fit into the interchange, or is the standard required?
4 participants