You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
the parquet data format is increasingly popular; existing GDAL-OGR code[0] relies on Apache Arrow libs to ingest parquet .
There exists a pure-python alternate fastparquet[1] also known as python-parquet. The only unusual library dependency for fastparquet is named cramjam[2].
Enhancement -- consider adding fastparquet as an alternate parquet reader implementation in GDAL-OGR.
Other implementations of parquet readers include Apache Polars[3] and DuckDB[4][build]
What would be the purpose of switching to an alternative implementation for Parquet reading ? Is it related to the discussion on the lack of libarrow/libparquet Debian packaging in offficial Debian repositories? But libarrow/libparquet is still packaged in an Apache APT repository, so it is not that bad
Regarding the listed alternatives:
fastparquet: using a Python package to implement a OGR driver is not something realistic, at least as an official prime-time driver, as it would cause serious issues for integration and potentially performance. RFC 76: OGR Python drivers exists, but it has not been designed for performance, and more for quick development of drivers
DuckDB. AFAIK, DuckDB packaging story is pretty much existent, with a static linking strategy that is miles away from traditional packaging habits. And with Parquet support probably being hidden and not available as a lib that others can use
Polars. AFAICS, no advertized C/C++ API. It seems the Python binding is done directly with Rust, and not through C. So not something that could be natively usable by a OGR C++ driver
All in all, nothing obvious to me that would justify making the effort to develop a new implementation of the OGR Parquet driver. libarrow/libparquet is in my perception the reference implementation, is actively developed and maintained, and is feature full.
Feature description
the parquet data format is increasingly popular; existing GDAL-OGR code[0] relies on Apache Arrow libs to ingest parquet .
There exists a pure-python alternate
fastparquet
[1] also known aspython-parquet
. The only unusual library dependency for fastparquet is namedcramjam
[2].Enhancement -- consider adding
fastparquet
as an alternate parquet reader implementation in GDAL-OGR.Other implementations of parquet readers include Apache Polars[3] and DuckDB[4][build]
[0] https://github.com/OSGeo/gdal/blob/master/ogr/ogrsf_frmts/parquet/CMakeLists.txt
[1] https://pypi.org/project/fastparquet/
[2] https://github.com/milesgranger/cramjam
[3] https://pola.rs/
[4] https://github.com/duckdb
Additional context
No response
The text was updated successfully, but these errors were encountered: