Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cli: Add option to export to DuckDB #94

Open
mwip opened this issue Apr 29, 2024 · 3 comments
Open

cli: Add option to export to DuckDB #94

mwip opened this issue Apr 29, 2024 · 3 comments

Comments

@mwip
Copy link

mwip commented Apr 29, 2024

I love your project enabling next-gen OSM analyses with the power of DuckDB. Wouldn't it be cool, to enable specifying a DuckDB file (and optionally table) to store the data in? I know DuckDB has a native OSM reader. But you abstracted to the convenience of Points, Lines and Polygons.

What I think would be a cool possible outcome is, e.g.

quackosm monaco-latest.osm.pbf --duckdb monaco_osm.duckdb 'osm'

instead of what's currently possible:

quackosm monaco-latest.osm.pbf
⚫◗ create table osm as
      select feature_id, tags, st_geomfromwkb(geometry) as geometry
      from read_parquet('files/monaco-latest_nofilter_noclip_compact.geoparquet');

If you think this is an interesting idea, I'd love to contribute. Maybe we could discuss possible entry points for where to start.

@RaczeQ
Copy link
Collaborator

RaczeQ commented Apr 29, 2024

Hello @mwip 👋🏻
I think that this might be an interesting idea. As you've mentioned, the current and probably the easiest option would be to just run a CREATE TABLE query after the whole process and automate it.

If I were to add such functionality, I'd add new functions internally next to the existing ones:

  • convert_pbf_to_gpq - takes downloaded pbf file and returns geoparquet (or parquet in wkt form)
  • convert_geometry_to_gpq - takes geometry, downloads pbf file(s) and returns geoparquet (or parquet in wkt form)
  • get_features_gdf - takes downloaded pbf file and returns geodataframe
  • get_features_gdf_from_geometry - takes geometry, downloads pbf file(s) and returns geodataframe

And new ones:

  • convert_pbf_to_duckdb - takes downloaded pbf file and saves it in a given duckdb file and table name
  • convert_geometry_to_duckdb - takes geometry, downloads pbf file(s) and saves it in a given duckdb file and table name

For this use case, the duckdb file could have a generated name, if not provided, just like with the geoparquet file and have the default table name like osm_features.

The easiest way to do this would be to copy get_features_gdf and get_features_gdf_from_geometry functions:

def get_features_gdf(

def get_features_gdf_from_geometry(

And just replace Geoparquet to GeoDataFrame transform with the CREATE TABLE query to save it in the DuckDB table (and maybe automatically remove the generated geoparquet file after successful table creation).

The direct streaming of data to DuckDB could be achieved but would require more sophisticated changes in the process (mainly because of the duplicated rows removal when multiple PBF files are parsed at once).

Things to keep in mind:

  • DuckDB file could not exist, so it should be created (will be created by Python API if not exist, but only final file, not the directories)
  • If file and table exist, the query shouldn't run (overridden with ignore_cache flag parameter)
  • The changes in the API would be required in 3 places: PbfFileReader, functions.py file and cli.py file.

For the CLI changes, I have few ideas:

  • Plain --duckdb flag - data saved in the file with a generated name in the default table name osm_features
quackosm monaco-latest.osm.pbf --duckdb
Finished operation in 0:00:00
files/monaco-latest_nofilter_noclip_compact.duckdb::osm_features
  • Passed result file with .duckdb or .db extension (--duckdb flag would be optional, inferred from the output file extension)
quackosm monaco-latest.osm.pbf --output monaco.duckdb
Finished operation in 0:00:00
monaco.duckdb::osm_features
  • Passed table name - saved in a given table name
quackosm monaco-latest.osm.pbf --output monaco.duckdb --output-table-name osm
Finished operation in 0:00:00
monaco.duckdb::osm

Some additional CLI errors should probably be thrown if there are illogical combinations of parameters (such as the output file name with the geoparquet extension and the passed --duckdb or --output-table-name flags.

You are welcome to try and implement it, I'll try to give you more hints and help with docs (examples) and tests if needed.

@mwip
Copy link
Author

mwip commented May 8, 2024

Thanks considering this and providing the relevant starting points. I am a little busy, but plan on working on this eventually and provide a PR. If anyone beat me to this, that is fine with me. 😄

@mwip
Copy link
Author

mwip commented May 20, 2024

I was finally able to start working on this and hope I'm able to provide a PR soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants