Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyArrow doesn't permit selective reading with ExtensionArray #2772

Open
agoose77 opened this issue Oct 25, 2023 · 2 comments · May be fixed by #3127
Open

PyArrow doesn't permit selective reading with ExtensionArray #2772

agoose77 opened this issue Oct 25, 2023 · 2 comments · May be fixed by #3127
Assignees
Labels
bug The problem described is something that must be fixed

Comments

@agoose77
Copy link
Collaborator

Version of Awkward Array

main`

Description and code to reproduce

We should stash metadata in the parquet schema, rather than using extensionarray.
We still need ExtensionArray for to_arrow because pa.array doesn't otherwise have a mechanism for stashing metadata.

@agoose77 agoose77 added bug (unverified) The problem described would be a bug, but needs to be triaged bug The problem described is something that must be fixed and removed bug (unverified) The problem described would be a bug, but needs to be triaged labels Oct 25, 2023
@agoose77
Copy link
Collaborator Author

It looks like this error message was added in apache/arrow#20385, which was a response to our own issue apache/arrow#33634

@jpivarski I take this as an indication that we should pursue a solution in Awkward rather than trying to get upstream support for partial reads. What do you think? (I haven't taken too much time to read all of the discourse).

@jpivarski
Copy link
Member

Yeah, we need to do this ourselves. A good work-around will be:

  • ak.to_arrow retains the associated ExtensionArray logic. We still need to round-trip Awkward Arrays through pyarrow.array and these are not storage types that will need per-column selection. Also, with pyarrow.array, there is no Table metadata, so ExtensionArray remains the only way to do this.
  • ak.to_arrow_table changes in two ways: (1) it fills the table with pyarrow.arrays made with extensionarray=False and (2) it puts the Form and other Awkward information into the Table metadata.
  • ak.from_arrow applied to pyarrow.array uses the existing ExtensionArray logic (it has no choice), and ak.from_arrow applied to Table uses the Table metadata to losslessly reconstruct the Awkward Array.
  • The implementation of ak.from_arrow on Tables might proceed by reading the non-ExtensionArray columns, constructing the ExtensionArray type, applying it to the columns, and then using the existing ExtensionArray infrastructure to ensure that the Awkward Array is properly built. Alternatively, it might be an entirely different code path. On the one hand, we'd like to reuse code and treat pyarrow.array and pyarrow.Table in similar ways, but on the other hand, introducing the ExtensionArray could be more complicated than a straight conversion.
  • If the implementation still goes through ExtensionArray, we may want to leave the name of the argument as extensionarray: bool in both ak.to_arrow and ak.to_arrow_table. If not, we might want to deprecate both or just the one in ak.to_arrow_table to be lossless: bool.

@jpivarski jpivarski added this to Unprioritized in Finalization Jan 19, 2024
@jpivarski jpivarski moved this from Unprioritized to P1 (highest) in Finalization Jan 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug The problem described is something that must be fixed
Projects
Finalization
P1 (highest)
Development

Successfully merging a pull request may close this issue.

3 participants