Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add methods for the Arrow PyCapsule Protocol to DataFrame/Column interchange protocol objects #342

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jorisvandenbossche
Copy link
Member

Addresses #279

Currently, I added schema and array support to Column, and schema and stream support to DataFrame. I think that's the typical way those objects are used (chunked dataframe, gets chunks+columns, and then single-chunk column), but for completeness (and because the spec doesn't distinguish between single vs multi-chunk objects) we can probably add both the array and stream method to both the DataFrame and Column object.

I also added the schema method to both DataFrame and Column, although those are strictly speaking not "schema" objects. But we don't have a dedicated schema object in the interchange spec (DataFrame has no attribute, and the Column.dtype attribute is a plain tuple)

@rgommers rgommers added interchange-protocol enhancement New feature or request labels Dec 21, 2023
@kkraus14
Copy link
Collaborator

What should the behavior be if the library has its data on say a GPU instead of CPU?

Throw? Copy to CPU?

If we support this CPU only protocol why don't we support other CPU only protocols as well?

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Jan 9, 2024

What should the behavior be if the library has its data on say a GPU instead of CPU? Throw? Copy to CPU?

That's one thing I still needed to add: if you do not support this protocol, I would propose to indicate this by not having the methods (so you get do hasattr(obj, "..") as a proxy for checking if the object supports this protocol).

So if we decide that GPU libraries shouldn't support this, then they should simply not add those methods.

Personally I don't have much experience with GPU, but based on following earlier discussions, I think the general preference is that there is no silent copy?

Certainly given the fact we want to expand this protocol to support GPU data as well (apache/arrow#38325), I think not implementing this right now is the best option for objects backed by GPU data.

If we support this CPU only protocol why don't we support other CPU only protocols as well?

Which other CPU-only protocol are you thinking about? I am not aware of any other dataframe protocol.
(there are of course array protocols, which you might be hinting on, but we already do have DLPack on the buffer/array level that already supports devices. So adding another array protocol at that level is much less needed given that most array libraries nowadays support DLPack. Personally I think adding general Python buffer protocol support to Buffer would be nice, though)

Copy link
Contributor

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

If these methods aren't strictly required, maybe this could be noted in the docstring? I think that would address Keith's concern

Sorry for having taking a while to respond here, it's taken me a while to fill gaps in my knowledge (thanks Joris for patiently explaining and answering questions!)

@rgommers
Copy link
Member

So if we decide that GPU libraries shouldn't support this, then they should simply not add those methods.

I agree - if they'd start copying data to host memory now to support the CPU-only protocol, that would make it much harder to later add the device-agnostic version.

If there was movement and a timeline on apache/arrow#38325, it'd be easier to say "let's not do the CPU-only version". However, it looks like that may take quite a while. Somehow work on these protocols is always slow going (DLPack was/is no exception). So it seems reasonable to me to start supporting the CPU protocol now.

There is indeed the fragmentation on the array side, with too many different CPU-only protocols. However, on the dataframe side things are cleaner. If things could stay limited to:

  • __dataframe__ and from_dataframe at the Python API level
  • the Arrow CPU-only protocol as dunder methods
  • the Arrow device-agnostic protocol as dunder methods

then it looks like things are in pretty good shape.

Copy link
Member

@rgommers rgommers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jorisvandenbossche! I don't have practical experience with this protocol, but it makes sense to me conceptually to add, and this PR looks like a good start to me. It's pretty self-contained, which is nice.

I have two main questions:

(1) Should these methods say to only implement them if the data is natively stored in Arrow format under the hood? If not, it may turn into yet more work for implementers - and require copying on the producer side.

(2) Exporting PyCapsule objects always comes with the "who owns this" question. For DLPack we worked hard at eliminating that problem by specifying a public function for it and making clear that the PyCapsule was supposed to be consumed once after creating and then immediately discarded, so it wasn't possible for PyCapsule's to hang around in user land. I think the intent here is similar, but it's not mentioned at all. Should there be a requirement on this?

Export the Column as an Arrow C array and schema PyCapsule.

If the Column consists of multiple chunks, this method should raise
an error.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this not supported at all? Or if this is supported only at the dataframe level (since the same restriction isn't mentioned there), should this say why and/or refer to that support at dataframe level?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be useful to add a section:

Raises
------

Parameters
----------
requested_schema : PyCapsule, default None
The schema to which the dataframe should be casted, passed as a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "casted" -> "cast"

(also further down)


def __arrow_c_schema__(self) -> object:
"""
Export the schema of the DataFrae to a Arrow C schema PyCapsule.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo in DataFrame

"""
pass

def __arrow_c_stream__(self, requested_schema: Optional[object] = None) -> object:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yet more conflicting terminology? Is "stream" supposed to mean "dataframe" here, rather than CUDA stream? If so, won't that conflict with device support later, and/or confused with DLPack stream support?

"""
pass

def __arrow_c_stream__(self, requested_schema: Optional[object] = None) -> object:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does one differentiate between a DataFrame and a struct column here (assuming those will be supported in the future)?

@kkraus14
Copy link
Collaborator

I am still a strong -1 on this proposal.

The array API didn't implement __array__ or __array_interface__ despite being more widely adopted and used than the equivalent Arrow protocols being added here. It wasn't implemented because they didn't want to fragment implementations or cause downstream pain in leveraging non-CPU, non-numpy implementations.

We have already defined a new interchange protocol that can handle Arrow formatted data zero copy in addition to handling non-CPU memory and non-arrow memory layouts.

@MarcoGorelli
Copy link
Contributor

MarcoGorelli commented Jan 22, 2024

That's OK

@jorisvandenbossche we can still use this in pandas/polars/pyarrow interchange.from_dataframe, e.g.

def from_dataframe(df: SupportsInterchange, *, allow_copy: bool = True) -> DataFrame:
    if hasattr(df, '__arrow_c_schema__'):
        return fastpath_constructor(df, allow_copy=allow_copy)
    return _from_dataframe(
        df.__dataframe__(allow_copy=allow_copy),  # type: ignore[arg-type]
        allow_copy=allow_copy,
    )

even if it's not officially part of the interchange protocol

@jorisvandenbossche
Copy link
Member Author

The array API didn't implement __array__ or __array_interface__ despite being more widely adopted and used than the equivalent Arrow protocols being added here. It wasn't implemented because they didn't want to fragment implementations or cause downstream pain in leveraging non-CPU, non-numpy implementations.

I understand why the array API didn't implement those, but not sure how that is relevant for this discussion? I am not proposing to add this protocol based on some popularity metric, but because it is adding missing features and convenience. The array protocols you mentioned also wouldn't add much additional value for the Array API because it already had DLPack covering the same.
For the DataFrame interchange protocol, though, that is not the case. What I am proposing here has additional value on top of what we already have (allow accessing the data at DataFrame/Column level instead of only at Buffer level + exposing the data through a stable C ABI interface, similarly as DLPack does for the arrays)

@kkraus14 what is not clear to me, is your objection mainly based on the lack of non-cpu device support, or more in general? Assume for a moment that the Arrow community already resolved the discussion around adding device support for the Arrow PyCapsule protocol, and we could directly add it here as well. Would you then be OK with it?

we can still use this in pandas/polars/pyarrow interchange.from_dataframe, ... even if it's not officially part of the interchange protocol

@MarcoGorelli that's certainly possible and would already improve that part of the interchange's usecases, but that still doesn't enable usage of the interchange protocol in places that require Arrow memory (using duckdb as the typical example, they support Arrow memory, and doubt they would ever support the interchange protocol directly. They could of course already easily support it if they want by depending on pyarrow, but that dependency isn't needed with this PR)

@kkraus14
Copy link
Collaborator

kkraus14 commented Feb 1, 2024

@kkraus14 what is not clear to me, is your objection mainly based on the lack of non-cpu device support, or more in general? Assume for a moment that the Arrow community already resolved the discussion around adding device support for the Arrow PyCapsule protocol, and we could directly add it here as well. Would you then be OK with it?

I would be more okay with it, but I still feel funny about giving Arrow memory layout arguably preferential treatment over non Arrow memory layout, though we already do that for strings more or less.

I.E. if someone uses a sentinel value or a byte mask for representing nulls, should they not support this protocol or should they support it via a copy + conversion? Both are unideal from the perspective of a library not using Arrow format for representing nulls supporting a standard.

@jorisvandenbossche
Copy link
Member Author

I still feel funny about giving Arrow memory layout arguably preferential treatment over non Arrow memory layout, though we already do that for strings more or less.

We indeed already use Arrow for several things. But I would also say that Arrow deserves this special treatment because of being the most widely adopted standard for representing tabular data?

I.E. if someone uses a sentinel value or a byte mask for representing nulls, should they not support this protocol or should they support it via a copy + conversion? Both are unideal from the perspective of a library not using Arrow format for representing nulls supporting a standard.

If the consuming library doesn't use Arrow compatible memory, they don't need to consume the dataframe object through this protocol. The existing Column Buffers access is still there, allowing you to get the data exactly as they are.

For the producing side, if the dataframe library doesn't use Arrow compatible memory, they indeed need to make a copy if they want to support this protocol. But that copy would need to be made anyhow, if the consumer wants Arrow data.
(and this is what we did for pandas implementing this protocol directly on pd.DataFrame, i.e. depending on the data (type), it will not always be zero-copy. But note that the current interchange protocol is also not always zero-copy for pandas, in case of string columns, or in case of columns backed by a 2D block with row-major layout)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants