Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consider fastparquet #9908

Open
darkblue-b opened this issue May 12, 2024 · 1 comment
Open

consider fastparquet #9908

darkblue-b opened this issue May 12, 2024 · 1 comment

Comments

@darkblue-b
Copy link
Member

darkblue-b commented May 12, 2024

Feature description

the parquet data format is increasingly popular; existing GDAL-OGR code[0] relies on Apache Arrow libs to ingest parquet .

There exists a pure-python alternate fastparquet[1] also known as python-parquet. The only unusual library dependency for fastparquet is named cramjam[2].

Enhancement -- consider adding fastparquet as an alternate parquet reader implementation in GDAL-OGR.

Other implementations of parquet readers include Apache Polars[3] and DuckDB[4][build]

[0] https://github.com/OSGeo/gdal/blob/master/ogr/ogrsf_frmts/parquet/CMakeLists.txt

[1] https://pypi.org/project/fastparquet/
[2] https://github.com/milesgranger/cramjam
[3] https://pola.rs/
[4] https://github.com/duckdb

Additional context

No response

@rouault
Copy link
Member

rouault commented May 12, 2024

What would be the purpose of switching to an alternative implementation for Parquet reading ? Is it related to the discussion on the lack of libarrow/libparquet Debian packaging in offficial Debian repositories? But libarrow/libparquet is still packaged in an Apache APT repository, so it is not that bad

Regarding the listed alternatives:

  • fastparquet: using a Python package to implement a OGR driver is not something realistic, at least as an official prime-time driver, as it would cause serious issues for integration and potentially performance. RFC 76: OGR Python drivers exists, but it has not been designed for performance, and more for quick development of drivers
  • DuckDB. AFAIK, DuckDB packaging story is pretty much existent, with a static linking strategy that is miles away from traditional packaging habits. And with Parquet support probably being hidden and not available as a lib that others can use
  • Polars. AFAICS, no advertized C/C++ API. It seems the Python binding is done directly with Rust, and not through C. So not something that could be natively usable by a OGR C++ driver

All in all, nothing obvious to me that would justify making the effort to develop a new implementation of the OGR Parquet driver. libarrow/libparquet is in my perception the reference implementation, is actively developed and maintained, and is feature full.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants