use_arrow=True vs False: different handling of date columns from shapefiles #262

theroggy · 2023-08-10T13:18:40Z

If use_arrow=False, the default, a date column is returned as dtype "datetime64".

If use_arrow=True, a date column is returned as dtype "object" with datetime.date objects as data values.

Not sure what is the way to go, but probably it would be better if the behaviour is the same.

Additional complication: columns of type "datetime64" are written by pyogrio to a column of Date type (tested with shapefile), while "object" columns with datetime.date values seem to be written as a "String" column.

theroggy · 2023-08-10T14:48:18Z

Script to reproduce both aspects + how fiona handles this case: fiona formats dates as string.

from pathlib import Path
import tempfile

import geopandas as gpd
import pyogrio

tmp_dir = Path(tempfile.gettempdir())
url = "https://github.com/theroggy/pysnippets/raw/main/pysnippets/pyogrio/polygon-parcel_31370.zip"
for use_arrow in [True, False]:
    gdf = pyogrio.read_dataframe(url, use_arrow=use_arrow)
    print(
        f"input_file, use_arrow: {use_arrow}, column dtype: {gdf['DATUM'].dtype}, "
        f"value: {gdf['DATUM'][0]}, value type: {type(gdf['DATUM'][0])}"
    )
    written_path = tmp_dir / f"output_{use_arrow}.shp"
    pyogrio.write_dataframe(gdf, written_path)
    gdf = pyogrio.read_dataframe(written_path, use_arrow=False)
    print(
        f"written_file, use_arrow: {use_arrow}, column dtype: {gdf['DATUM'].dtype}, "
        f"value: {gdf['DATUM'][0]}, value type: {type(gdf['DATUM'][0])}"
    )

# Do a round-trip using fiona for comparison
gdf = gpd.read_file(url, engine="fiona")
print(
    f"input_file, fiona, column dtype: {gdf['DATUM'].dtype}, "
    f"value: {gdf['DATUM'][0]}, value type: {type(gdf['DATUM'][0])}"
)
written_path = tmp_dir / "output_fiona.shp"
gdf.to_file(written_path)
gdf = gpd.read_file(written_path, engine="fiona")
print(
    f"written_file, fiona, column dtype: {gdf['DATUM'].dtype}, "
    f"value: {gdf['DATUM'][0]}, value type: {type(gdf['DATUM'][0])}"
)

(relevant) output:

input_file, use_arrow: True, column dtype: object, value: 2020-05-01, value type: <class 'datetime.date'>
written_file, use_arrow: True, column dtype: object, value: 2020-05-01, value type: <class 'str'>
input_file, use_arrow: False, column dtype: datetime64[s], value: 2020-05-01 00:00:00, value type: <class 'pandas._libs.tslibs.timestamps.Timestamp'>
written_file, use_arrow: False, column dtype: datetime64[s], value: 2020-05-01 00:00:00, value type: <class 'pandas._libs.tslibs.timestamps.Timestamp'>
input_file, fiona, column dtype: object, value: 2020-05-01, value type: <class 'str'>
written_file, fiona, column dtype: object, value: 2020-05-01, value type: <class 'str'>

theroggy · 2023-08-10T15:42:43Z

In geofileops I'm thinking about dealing with it as follows, so convert these columns to datetime64:

    # Cast columns that are of object type, but contain datetime.date or datetime.date
    # to proper datetime64 columns.
    if len(result_gdf) > 0:
        for column in result_gdf.select_dtypes(include=["object"]):
            if isinstance(result_gdf[column].iloc[0], (datetime.date, datetime.datetime)):
                result_gdf[column] = pd.to_datetime(result_gdf[column])

kylebarron · 2023-08-10T15:50:32Z

My personal view in all of these data type issues is that in the long term, it makes sense to adopt use_arrow=True as the standard, and deprecate use_arrow=False. For one, there's lots of data type handling that is currently handled manually in use_arrow=False that gets handled automatically by GDAL when using its RFC 86 bindings. The Arrow bindings should be potentially faster, have much lower maintenance requirements, and have zero-copy interoperability with a larger ecosystem.

jorisvandenbossche · 2023-08-14T22:00:25Z

If use_arrow=True, a date column is returned as dtype "object" with datetime.date objects as data values.

What I suppose is happening here, is that we get an arrow table with an actual date column (arrow has a date32 and date64 data types). And then in read_dataframe we convert this pyarrow table to pandas using table.to_pandas().
It is this arrow->pandas conversion where pyarrow chooses to create datetime.date objects by default. But this can actually be controlled by a keyword, so if we want to have the same behaviour, we can specify date_as_object=False in the to_pandas call (and then there is no need to do pd.to_datetime afterwards)

I don't think I have a strong opinion on what option is best (datetime.date vs datetime64[ns]). Ideally in the near future, pandas will actually natively support a "date" data type, and then that would also solve this question.

(I do agree that longer term, we hopefully can just start relying on the ArrowStream-based interface of GDAL)

kylebarron · 2023-08-14T22:24:26Z

But this can actually be controlled by a keyword

Ah right. That reminded me of #241 (comment), so in theory you could use types_mapper here for the pyarrow -> pandas conversion to get the date type you want?

jorisvandenbossche · 2023-08-15T06:19:39Z

You could indeed use types_mapper to preserve the arrow date in an Arrow-backed pandas extension dtype. But you can't use this keyword to choose between object and datetime64, because that are no extension dtypes (and the types_mapper keyword only works with extension dtypes ..)

theroggy changed the title ~~Difference in way date columns are returned by pyarrow=True or False~~ Difference between pyarrow=True or False: date columns from shapefiles Aug 10, 2023

theroggy changed the title ~~Difference between pyarrow=True or False: date columns from shapefiles~~ use_arrow=True vs False: date columns from shapefiles Aug 10, 2023

theroggy changed the title ~~use_arrow=True vs False: date columns from shapefiles~~ use_arrow=True vs False: different handling of date columns from shapefiles Aug 10, 2023

theroggy mentioned this issue Aug 22, 2023

Add param arrow_to_pandas_kwargs to read_dataframe + decrease memory usage #273

Merged

theroggy mentioned this issue Sep 2, 2023

use arrow by default? #278

Open

theroggy added this to the 0.8.0 milestone Jan 27, 2024

brendan-ward modified the milestones: 0.8.0, 0.9.0 May 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use_arrow=True vs False: different handling of date columns from shapefiles #262

use_arrow=True vs False: different handling of date columns from shapefiles #262

theroggy commented Aug 10, 2023 •

edited

theroggy commented Aug 10, 2023 •

edited

theroggy commented Aug 10, 2023 •

edited

kylebarron commented Aug 10, 2023 •

edited

jorisvandenbossche commented Aug 14, 2023

kylebarron commented Aug 14, 2023

jorisvandenbossche commented Aug 15, 2023

use_arrow=True vs False: different handling of date columns from shapefiles #262

use_arrow=True vs False: different handling of date columns from shapefiles #262

Comments

theroggy commented Aug 10, 2023 • edited

theroggy commented Aug 10, 2023 • edited

theroggy commented Aug 10, 2023 • edited

kylebarron commented Aug 10, 2023 • edited

jorisvandenbossche commented Aug 14, 2023

kylebarron commented Aug 14, 2023

jorisvandenbossche commented Aug 15, 2023

theroggy commented Aug 10, 2023 •

edited

theroggy commented Aug 10, 2023 •

edited

theroggy commented Aug 10, 2023 •

edited

kylebarron commented Aug 10, 2023 •

edited