Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Protocol] Make Column.get_buffers() docstring more explicit #272

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

pitrou
Copy link

@pitrou pitrou commented Oct 3, 2023

Several implementations got Column.get_buffers() wrong by assuming the buffers dtypes would be the same as the column dtype. Clarify to eliminate any ambiguity.

See apache/arrow#37598 for example.

Closes #273

Several implementations got ``Column.get_buffers()`` wrong by assuming the buffers dtypes
would be the same as the column dtype. Clarify to eliminate any ambiguity.

See apache/arrow#37598 for example.
@pitrou
Copy link
Author

pitrou commented Oct 3, 2023

@MarcoGorelli
Copy link
Contributor

I do agree, but some care needs to be taken - currently from_dataframe in pandas assumes the buffer dtype is the same as the column dtype, so if the buffer dtype were to be fixed immediately then this would lead to breakage in libraries using the interchange protocol (eg plotly)

Thoughts on the approach suggested in pandas-dev/pandas#54781 (comment) ? I think we should probably wait 1-2 years before updating implementations' buffer dtype

@jorisvandenbossche
Copy link
Member

Thanks @pitrou! I was also just opening #273 about this, but so this PR can close that issue then ;)

@jorisvandenbossche
Copy link
Member

I agree we should be careful in fixing this. But I assume a start that is needed anyway is 1) agreeing that this is the correct interpretation and 2) if so, updating all libraries' from_dataframe function to handle both ways of specifying the buffers' dtypes?

And once that is done, we can see how to go about updating the return value of the buffer's dtype (or how long that should take, etc)

@stinodego
Copy link
Contributor

stinodego commented Oct 3, 2023

updating all libraries' from_dataframe function to handle both ways of specifying the buffers' dtypes?

Implementations of from_dataframe should just disregard the data buffer dtype entirely. column.dtype already tells you what to expect in the data buffer (e.g. column dtype STRING will mean an 8bit UINT data buffer).

If it's implemented like this (everywhere), the data buffer dtype can be changed without breaking from_dataframe.

@pitrou
Copy link
Author

pitrou commented Oct 3, 2023

Implementations of from_dataframe should just disregard the data buffer dtype entirely. column.dtype already tells you what to expect in the data buffer (e.g. dtype STRING will mean an 8bit UINT data buffer).

Where does the spec spell the expected per-buffer dtypes for each column dtype?

Arrow, for example, has different string-like types: one with 32-bit offsets, and another with 64-bit offsets. If the dataframe consumer disregards the "offsets" buffer dtype, then it will misread at least some of the columns exported by Arrow.

@stinodego
Copy link
Contributor

Implementations of from_dataframe should just disregard the data buffer dtype entirely. column.dtype already tells you what to expect in the data buffer (e.g. dtype STRING will mean an 8bit UINT data buffer).

Where does the spec spell the expected per-buffer dtypes for each column dtype?

Arrow, for example, has different string-like types: one with 32-bit offsets, and another with 64-bit offsets. If the dataframe consumer disregards the "offsets" buffer dtype, then it will misread at least some of the columns exported by Arrow.

Yes, I was talking specifically about the data buffer dtype. The offsets buffer and the validity buffer dtypes are very relevant.

@MarcoGorelli
Copy link
Contributor

If it's implemented like this (everywhere), the data buffer dtype can be changed without breaking from_dataframe.

I think we should still take some care though - e.g. if polars 1.0.0 were to fix its buffer dtype and pandas 2.2.0 were to fix from_dataframe to not use the buffer dtype, then anyone with pandas 2.1.2 and polars 1.0.0 installed who tried plotting with plotly would run into issues

@stinodego
Copy link
Contributor

stinodego commented Oct 3, 2023

I think we should still take some care though

For sure! Let's first get the from_dataframe implementations fixed, then we can update the data buffer dtype whenever we feel comfortable (no real rush).

I have a PR for Polars ready (pola-rs/polars#10787), as you can see the change can be very small. I was planning on giving it about 6 months after the from_dataframe fixes from pyarrow and pandas come in. But I can be persuaded to hold off on that a bit longer 😬

@pitrou
Copy link
Author

pitrou commented Oct 3, 2023

Is the spec supposed to be stable? There are a bunch of "TODO" and "TBD" statements, and implementations are generally very recent.

@jorisvandenbossche
Copy link
Member

Implementations of from_dataframe should just disregard the data buffer dtype entirely. column.dtype already tells you what to expect in the data buffer

Thinking further on this path, if we update all implementation to disregard that information (because indeed you don't need it), shouldn't we then rather remove that information from the protocol?
It's probably a much more difficult change to make (and for implementations to correctly update for that). But at the same time, if all our main (and cross-compat tested) implementations disregard that, we are also not really testing that we return the correct information there.
(although we could of course quite easily add some tests for exactly those dtypes to https://github.com/data-apis/dataframe-interchange-tests, to ensure there is consistent behaviour)

@pitrou
Copy link
Author

pitrou commented Oct 5, 2023

Well, if you have a DATETIME column, for example, what is the implied dtype for the data buffer? Is it INT64 perhaps (but it might also be INT32 for a Unix timestamp)? It might be spelled out in the spec, but I'm certainly missing where.

Sidenote: it would be good to list all potential buffer dtypes for each column dtype somewhere, for reference and to let consumers be sure they handle all cases. Also all potential bitwidths and format strings for each DTypeKind.

@stinodego
Copy link
Contributor

Well, if you have a DATETIME column, for example, what is the implied dtype for the data buffer? It might be spelled out in the spec, but I'm certainly missing where.

This is defined by the Arrow C types.

@pitrou
Copy link
Author

pitrou commented Oct 5, 2023

Well, if you have a DATETIME column, for example, what is the implied dtype for the data buffer? It might be spelled out in the spec, but I'm certainly missing where.

This is defined by the Arrow C types.

Hmm, I see. Really, this spec is hard to understand as very important details are hidden in docstrings for various properties and methods.

@jorisvandenbossche
Copy link
Member

In addition it also rather under-tested .. (all those details about expected buffer dtypes should ideally have shared tests in https://github.com/data-apis/dataframe-interchange-tests IMO)

@stinodego
Copy link
Contributor

Implementations of from_dataframe should just disregard the data buffer dtype entirely. column.dtype already tells you what to expect in the data buffer

Thinking further on this path, if we update all implementation to disregard that information (because indeed you don't need it), shouldn't we then rather remove that information from the protocol? It's probably a much more difficult change to make (and for implementations to correctly update for that). But at the same time, if all our main (and cross-compat tested) implementations disregard that, we are also not really testing that we return the correct information there. (although we could of course quite easily add some tests for exactly those dtypes to data-apis/dataframe-interchange-tests, to ensure there is consistent behaviour)

Column.get_buffers should still return the dtype of the data buffer. This is valid information. Just because a from_dataframe implementation does not rely on this information, does not mean it should be discarded.

In fact, I would like to use the data buffer dtype when implementing from_dataframe for Polars (first create a Series out of each buffer, then construct a new Series out of the data/validity/offsets Series). And I would have, if it wasn't for the fact that all existing implementations have the wrong dtype there. This led me to discover the issue in the first place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[protocol] Clarify the exact meaning of the buffer's dtype (Dtype tuple in get_buffers())
4 participants