Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use_arrow=True vs False: different handling of date columns from shapefiles #262

Open
theroggy opened this issue Aug 10, 2023 · 6 comments
Milestone

Comments

@theroggy
Copy link
Member

theroggy commented Aug 10, 2023

If use_arrow=False, the default, a date column is returned as dtype "datetime64".

If use_arrow=True, a date column is returned as dtype "object" with datetime.date objects as data values.

Not sure what is the way to go, but probably it would be better if the behaviour is the same.

Additional complication: columns of type "datetime64" are written by pyogrio to a column of Date type (tested with shapefile), while "object" columns with datetime.date values seem to be written as a "String" column.

@theroggy
Copy link
Member Author

theroggy commented Aug 10, 2023

Script to reproduce both aspects + how fiona handles this case: fiona formats dates as string.

from pathlib import Path
import tempfile

import geopandas as gpd
import pyogrio

tmp_dir = Path(tempfile.gettempdir())
url = "https://github.com/theroggy/pysnippets/raw/main/pysnippets/pyogrio/polygon-parcel_31370.zip"
for use_arrow in [True, False]:
    gdf = pyogrio.read_dataframe(url, use_arrow=use_arrow)
    print(
        f"input_file, use_arrow: {use_arrow}, column dtype: {gdf['DATUM'].dtype}, "
        f"value: {gdf['DATUM'][0]}, value type: {type(gdf['DATUM'][0])}"
    )
    written_path = tmp_dir / f"output_{use_arrow}.shp"
    pyogrio.write_dataframe(gdf, written_path)
    gdf = pyogrio.read_dataframe(written_path, use_arrow=False)
    print(
        f"written_file, use_arrow: {use_arrow}, column dtype: {gdf['DATUM'].dtype}, "
        f"value: {gdf['DATUM'][0]}, value type: {type(gdf['DATUM'][0])}"
    )

# Do a round-trip using fiona for comparison
gdf = gpd.read_file(url, engine="fiona")
print(
    f"input_file, fiona, column dtype: {gdf['DATUM'].dtype}, "
    f"value: {gdf['DATUM'][0]}, value type: {type(gdf['DATUM'][0])}"
)
written_path = tmp_dir / "output_fiona.shp"
gdf.to_file(written_path)
gdf = gpd.read_file(written_path, engine="fiona")
print(
    f"written_file, fiona, column dtype: {gdf['DATUM'].dtype}, "
    f"value: {gdf['DATUM'][0]}, value type: {type(gdf['DATUM'][0])}"
)

(relevant) output:

input_file, use_arrow: True, column dtype: object, value: 2020-05-01, value type: <class 'datetime.date'>
written_file, use_arrow: True, column dtype: object, value: 2020-05-01, value type: <class 'str'>
input_file, use_arrow: False, column dtype: datetime64[s], value: 2020-05-01 00:00:00, value type: <class 'pandas._libs.tslibs.timestamps.Timestamp'>
written_file, use_arrow: False, column dtype: datetime64[s], value: 2020-05-01 00:00:00, value type: <class 'pandas._libs.tslibs.timestamps.Timestamp'>
input_file, fiona, column dtype: object, value: 2020-05-01, value type: <class 'str'>
written_file, fiona, column dtype: object, value: 2020-05-01, value type: <class 'str'>

@theroggy theroggy changed the title Difference in way date columns are returned by pyarrow=True or False Difference between pyarrow=True or False: date columns from shapefiles Aug 10, 2023
@theroggy theroggy changed the title Difference between pyarrow=True or False: date columns from shapefiles use_arrow=True vs False: date columns from shapefiles Aug 10, 2023
@theroggy theroggy changed the title use_arrow=True vs False: date columns from shapefiles use_arrow=True vs False: different handling of date columns from shapefiles Aug 10, 2023
@theroggy
Copy link
Member Author

theroggy commented Aug 10, 2023

In geofileops I'm thinking about dealing with it as follows, so convert these columns to datetime64:

    # Cast columns that are of object type, but contain datetime.date or datetime.date
    # to proper datetime64 columns.
    if len(result_gdf) > 0:
        for column in result_gdf.select_dtypes(include=["object"]):
            if isinstance(result_gdf[column].iloc[0], (datetime.date, datetime.datetime)):
                result_gdf[column] = pd.to_datetime(result_gdf[column])

@kylebarron
Copy link
Contributor

kylebarron commented Aug 10, 2023

My personal view in all of these data type issues is that in the long term, it makes sense to adopt use_arrow=True as the standard, and deprecate use_arrow=False. For one, there's lots of data type handling that is currently handled manually in use_arrow=False that gets handled automatically by GDAL when using its RFC 86 bindings. The Arrow bindings should be potentially faster, have much lower maintenance requirements, and have zero-copy interoperability with a larger ecosystem.

@jorisvandenbossche
Copy link
Member

If use_arrow=True, a date column is returned as dtype "object" with datetime.date objects as data values.

What I suppose is happening here, is that we get an arrow table with an actual date column (arrow has a date32 and date64 data types). And then in read_dataframe we convert this pyarrow table to pandas using table.to_pandas().
It is this arrow->pandas conversion where pyarrow chooses to create datetime.date objects by default. But this can actually be controlled by a keyword, so if we want to have the same behaviour, we can specify date_as_object=False in the to_pandas call (and then there is no need to do pd.to_datetime afterwards)

I don't think I have a strong opinion on what option is best (datetime.date vs datetime64[ns]). Ideally in the near future, pandas will actually natively support a "date" data type, and then that would also solve this question.

(I do agree that longer term, we hopefully can just start relying on the ArrowStream-based interface of GDAL)

@kylebarron
Copy link
Contributor

But this can actually be controlled by a keyword

Ah right. That reminded me of #241 (comment), so in theory you could use types_mapper here for the pyarrow -> pandas conversion to get the date type you want?

@jorisvandenbossche
Copy link
Member

You could indeed use types_mapper to preserve the arrow date in an Arrow-backed pandas extension dtype. But you can't use this keyword to choose between object and datetime64, because that are no extension dtypes (and the types_mapper keyword only works with extension dtypes ..)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants