Add methods for the Arrow PyCapsule Protocol to DataFrame/Column interchange protocol objects #342

jorisvandenbossche · 2023-12-21T17:05:05Z

Addresses #279

Currently, I added schema and array support to Column, and schema and stream support to DataFrame. I think that's the typical way those objects are used (chunked dataframe, gets chunks+columns, and then single-chunk column), but for completeness (and because the spec doesn't distinguish between single vs multi-chunk objects) we can probably add both the array and stream method to both the DataFrame and Column object.

I also added the schema method to both DataFrame and Column, although those are strictly speaking not "schema" objects. But we don't have a dedicated schema object in the interchange spec (DataFrame has no attribute, and the Column.dtype attribute is a plain tuple)

…rchange protocol objects

kkraus14 · 2023-12-22T13:52:08Z

What should the behavior be if the library has its data on say a GPU instead of CPU?

Throw? Copy to CPU?

If we support this CPU only protocol why don't we support other CPU only protocols as well?

jorisvandenbossche · 2024-01-09T10:28:53Z

What should the behavior be if the library has its data on say a GPU instead of CPU? Throw? Copy to CPU?

That's one thing I still needed to add: if you do not support this protocol, I would propose to indicate this by not having the methods (so you get do hasattr(obj, "..") as a proxy for checking if the object supports this protocol).

So if we decide that GPU libraries shouldn't support this, then they should simply not add those methods.

Personally I don't have much experience with GPU, but based on following earlier discussions, I think the general preference is that there is no silent copy?

Certainly given the fact we want to expand this protocol to support GPU data as well (apache/arrow#38325), I think not implementing this right now is the best option for objects backed by GPU data.

If we support this CPU only protocol why don't we support other CPU only protocols as well?

Which other CPU-only protocol are you thinking about? I am not aware of any other dataframe protocol.
(there are of course array protocols, which you might be hinting on, but we already do have DLPack on the buffer/array level that already supports devices. So adding another array protocol at that level is much less needed given that most array libraries nowadays support DLPack. Personally I think adding general Python buffer protocol support to Buffer would be nice, though)

MarcoGorelli

Looks good!

If these methods aren't strictly required, maybe this could be noted in the docstring? I think that would address Keith's concern

Sorry for having taking a while to respond here, it's taken me a while to fill gaps in my knowledge (thanks Joris for patiently explaining and answering questions!)

rgommers · 2024-01-18T14:44:47Z

So if we decide that GPU libraries shouldn't support this, then they should simply not add those methods.

I agree - if they'd start copying data to host memory now to support the CPU-only protocol, that would make it much harder to later add the device-agnostic version.

If there was movement and a timeline on apache/arrow#38325, it'd be easier to say "let's not do the CPU-only version". However, it looks like that may take quite a while. Somehow work on these protocols is always slow going (DLPack was/is no exception). So it seems reasonable to me to start supporting the CPU protocol now.

There is indeed the fragmentation on the array side, with too many different CPU-only protocols. However, on the dataframe side things are cleaner. If things could stay limited to:

__dataframe__ and from_dataframe at the Python API level
the Arrow CPU-only protocol as dunder methods
the Arrow device-agnostic protocol as dunder methods

then it looks like things are in pretty good shape.

rgommers

Thanks @jorisvandenbossche! I don't have practical experience with this protocol, but it makes sense to me conceptually to add, and this PR looks like a good start to me. It's pretty self-contained, which is nice.

I have two main questions:

(1) Should these methods say to only implement them if the data is natively stored in Arrow format under the hood? If not, it may turn into yet more work for implementers - and require copying on the producer side.

(2) Exporting PyCapsule objects always comes with the "who owns this" question. For DLPack we worked hard at eliminating that problem by specifying a public function for it and making clear that the PyCapsule was supposed to be consumed once after creating and then immediately discarded, so it wasn't possible for PyCapsule's to hang around in user land. I think the intent here is similar, but it's not mentioned at all. Should there be a requirement on this?

rgommers · 2024-01-18T14:48:47Z