ENH: Add support for writing more ExtensionArray/PyArrow types #257

himikof · 2023-07-08T10:05:07Z

This PR improves the ExtensionArray writing support from #232 to handle other array types (including PyArrow ones).
Before it, such fields were often silently and unexpectedly casted to object dtype and then to OFTString.

Instead of casting the data array to a basic numpy dtype and then using the data array dtype to infer the OGR type in the Cython code, the high-level code now infers the equivalent numpy dtype and passes it into Cython, leaving the ExtensionArray with data as-is. As a side benefit, field data arrays are now never copied during writing.

Addition of DTYPE_OGR_FIELD_TYPES["string"] allows the infer_field_types function to handle the non-ambiguous string data case, where no additional type inference is needed.

jorisvandenbossche

Thanks for the contribution!

I think it might be worth checking the performance of passing down the ArrowExtensionArray as is to the cython code, versus passing down the equivalent numpy array (at least for simple dtypes like int and float). Given that we repeatedly index into that array while iterating over the records, it might be that indexing into a numpy array is faster and it could be worth it to actually do the conversion.

One difficulty is actually getting that numpy array. Pandas has a pyarrow_array_to_numpy_and_mask utility function. Maybe that could be used through __from_arrow__, so one option could be something like:

In [33]: s = pd.Series([1, 2, None], dtype="int64[pyarrow]")

In [34]: pa_arr = s.array._pa_array

In [35]: dtype_mapping = {
    ...:         pa.int8(): pd.Int8Dtype(),
    ...:         pa.int16(): pd.Int16Dtype(),
    ...:         pa.int32(): pd.Int32Dtype(),
    ...:         pa.int64(): pd.Int64Dtype(),
    ...:         pa.uint8(): pd.UInt8Dtype(),
    ...:         pa.uint16(): pd.UInt16Dtype(),
    ...:         pa.uint32(): pd.UInt32Dtype(),
    ...:         pa.uint64(): pd.UInt64Dtype(),
    ...:         pa.bool_(): pd.BooleanDtype(),
    ...:         pa.string(): pd.StringDtype(),
    ...:         pa.float32(): pd.Float32Dtype(),
    ...:         pa.float64(): pd.Float64Dtype(),
    ...:     }

In [36]: masked_arr = dtype_mapping[pa_arr.type].__from_arrow__(pa_arr)

In [37]: masked_arr
Out[37]: 
<IntegerArray>
[1, 2, <NA>]
Length: 3, dtype: Int64

In [38]: masked_arr._data
Out[38]: array([1, 2, 0])

In [39]: masked_arr._mask
Out[39]: array([False, False,  True])

(I am not sure there is an official public method to get the pyarrow Array from the ArrowExtensionArray)

Also for the StringArray (which is currently already working), I expect a single conversion to numpy object array (i.e. converting to Python strings in one go) will be more efficient than iteratively getting a Python string one at a time.

jorisvandenbossche · 2023-07-11T08:24:15Z

(I am not sure there is an official public method to get the pyarrow Array from the ArrowExtensionArray)

Probably just pa.array(s.array), that should be zero copy give you the underlying pyarrow Array

himikof · 2023-07-11T18:50:33Z

Thank you for telling me about pyarrow_array_to_numpy_and_mask and __from_arrow__, I totally missed them!
I must admit I did not think much about the Python code overhead while trying to ensure correct behavior.

Now that I though more about how to implement this efficiently, I wonder if just making the Cython code directly support a pyarrow.Array/pyarrow.ChunkedArray as field_data would be better. It should be just as efficient as ndarray for simple integer/float types (if done carefully), and could be made much more efficient for strings (no need to allocate+encode a str, just access the underlying UTF-8 data directly). It would also allow to implement OFTFooList/OFTBinary writing support in a straightforward way.

So I think the following top-level logic would be needed:

If the array is one of the Pandas masked arrays, use
field_data=col._data, field_dtype=col.dtype.numpy_dtype, field_mask=col.mask
If the array is an ArrowExtensionArray, use
field_data=col.__arrow_array__(), field_dtype=col.dtype, field_mask=None.
The masking will be handled internally in this case.
If the array is some other unknown kind of ExtensionArray, use a slower fallback logic of passing it directly as field_data (accepting the overhead as unavoidable).
If the array is a regular ndarray, use field_data=col, field_dtype=col.dtype, field_mask=None as before.

The Cython code can do one of 3 things to access the values:

Optionally use the Cython typed memoryview (with fused type generics) to access a value of a primitive type from ndarray/pyarrow.Array. This would be even faster than the current __getitem__ call.
Use ndarray.__getitem__ as before (or ExtensionArray.__getitem__ in the slower fallback case).
Use pyarrow.Array.__getitem__/pyarrow.ChunkedArray.__getitem__ and process the returned Scalar. This will allow to easily support types like OFTIntegerList, OFTBinary and do a one-copy UTF-8 OFTString optimized implementation.

If this sounds reasonable I can try to implement this alternative instead.

ENH: Add support for writing more ExtensionArray types

7769bfd

himikof force-pushed the fix-arrow-types-write branch from d682e3d to 7769bfd Compare July 8, 2023 10:12

jorisvandenbossche reviewed Jul 11, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Add support for writing more ExtensionArray/PyArrow types #257

ENH: Add support for writing more ExtensionArray/PyArrow types #257

himikof commented Jul 8, 2023

jorisvandenbossche left a comment

jorisvandenbossche commented Jul 11, 2023

himikof commented Jul 11, 2023

ENH: Add support for writing more ExtensionArray/PyArrow types #257

Are you sure you want to change the base?

ENH: Add support for writing more ExtensionArray/PyArrow types #257

Conversation

himikof commented Jul 8, 2023

jorisvandenbossche left a comment

Choose a reason for hiding this comment

jorisvandenbossche commented Jul 11, 2023

himikof commented Jul 11, 2023