Add `slice_rows` to interchange protocol #349

MarcoGorelli · 2024-02-13T16:16:46Z

closes #204

kkraus14 · 2024-02-13T16:30:31Z

In the case of something like pandas or other dataframe library that doesn't use the Arrow memory layout under the hood, they'd presumably materialize arrow on the __dataframe__ call and then have to slice the arrow format memory, which if containing strings or has a step size, isn't free. This is already potentially a problem in selecting columns as well, so I guess this inefficiency is nothing new?

Additionally, it makes it a bit hard to reason about when the producer vs when the consumer should do row selection. I.E. if Polars is consuming data from say PyArrow, I imagine Polars would rather handle row slicing itself (assuming you'll hit a situation where it's not pure pointer arithmetic). Now in the situation of Pandas consuming data from say Polars, you'd probably want Polars to handle the row slicing.

Arrow interchange protocols handle the slicing case (ignoring step size) by allowing specifying an offset and a size. Maybe we can do something similar here?

MarcoGorelli · 2024-02-13T16:38:18Z

sounds good, thanks

protocol/dataframe_protocol.py

kkraus14 · 2024-02-13T16:59:38Z

Do we expect / want to encourage developers using dataframe libraries to explicitly call __dataframe__ themselves as opposed to using libraryx.from_dataframe(...)? It feels a bit funky to me currently that we go from say:

pl_df = ...  # My polars dataframe
pdf = pandas.from_dataFrame(pl_df)

to:

pl_df = ...  # My polars dataframe
pdf = pandas.from_dataframe(pl_df.__dataframe__().select_columns(...).slice_rows(...))

My 2c is that this is just highlighting the lack of standard API here and that the experience should be something along the lines of (ignoring API names for column selection and row slicing):

pl_df = ...  # My polars dataframe
pdf = pandas.from_dataframe(pl_df.cols(...).slice_rows(...))

kkraus14 · 2024-02-13T17:01:09Z

Would be good to have others chime in here given this interchange protocol is already being adopted where we probably don't want to introduce something and later decide to change / remove it.

MarcoGorelli · 2024-02-13T17:39:32Z

It's what plotly already does to not have to convert the entire dataframe

MarcoGorelli · 2024-02-22T12:27:25Z

Any updates here please?

This is the only thing I plan to try adding to the interchange protocol, promised

I think of the interchange protocol as being useful to converting between libraries and doing some preselection in a standardised way:

select columns (currently possible)
select rows (not possible)

cc @rgommers @jorisvandenbossche

MarcoGorelli · 2024-02-27T13:50:29Z

gentle ping

(would really like to get this in for pandas 3.0 tbh, and this topic actually has a real world use case microsoft/vscode-jupyter#13951)

this is just highlighting the lack of standard API here

the "standard api" solution would be:

pandas.from_dataframe(pl_df.__dataframe_consortium_standard__().select(...).take(...))

does that really look any less clunky?

anmyachev · 2024-04-03T13:35:18Z

I think of the interchange protocol as being useful to converting between libraries and doing some preselection in a standardised way:

select columns (currently possible)
select rows (not possible)

The ability to select subset rows in addition to selecting columns seems harmonious.

Implementation in Modin should not be a problem.

+1

add slice_rows to interchange protocol

94167b1

MarcoGorelli force-pushed the slice-rows branch from fc46bfd to 069e45b Compare February 13, 2024 16:17

use offset and length

1674828

MarcoGorelli force-pushed the slice-rows branch from 069e45b to 94167b1 Compare February 13, 2024 16:30

offset => size

5b09b58

kkraus14 reviewed Feb 13, 2024

View reviewed changes

protocol/dataframe_protocol.py Outdated Show resolved Hide resolved

MarcoGorelli requested review from rgommers and jorisvandenbossche February 13, 2024 21:19

rgommers added the interchange-protocol label Mar 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `slice_rows` to interchange protocol #349

Add `slice_rows` to interchange protocol #349

MarcoGorelli commented Feb 13, 2024

kkraus14 commented Feb 13, 2024 •

edited

MarcoGorelli commented Feb 13, 2024

kkraus14 commented Feb 13, 2024

kkraus14 commented Feb 13, 2024

MarcoGorelli commented Feb 13, 2024

MarcoGorelli commented Feb 22, 2024

MarcoGorelli commented Feb 27, 2024 •

edited

anmyachev commented Apr 3, 2024

Add slice_rows to interchange protocol #349

Are you sure you want to change the base?

Add slice_rows to interchange protocol #349

Conversation

MarcoGorelli commented Feb 13, 2024

kkraus14 commented Feb 13, 2024 • edited

MarcoGorelli commented Feb 13, 2024

kkraus14 commented Feb 13, 2024

kkraus14 commented Feb 13, 2024

MarcoGorelli commented Feb 13, 2024

MarcoGorelli commented Feb 22, 2024

MarcoGorelli commented Feb 27, 2024 • edited

anmyachev commented Apr 3, 2024

Add `slice_rows` to interchange protocol #349

Add `slice_rows` to interchange protocol #349

kkraus14 commented Feb 13, 2024 •

edited

MarcoGorelli commented Feb 27, 2024 •

edited