Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large OSM files require different approach to reading features #272

Open
brendan-ward opened this issue Aug 17, 2023 · 10 comments
Open

Large OSM files require different approach to reading features #272

brendan-ward opened this issue Aug 17, 2023 · 10 comments

Comments

@brendan-ward
Copy link
Member

#271 tries to count records directly from an OSM data source, which appears to work for small files, but produces inconsistent results for larger datasets (e.g., Alabama), which appears to be an artifact of layers being an artificial construct within OSM files rather than a meaningful structure within the dataset.

According to the GDAL docs, by default too many features may accumulate for a given layer while attempting to read a different layer, and produce an error. If you then enable interleaved reading, this avoids the error, but iterating over features may produce a count of 0 instead of the true amount.

From the example for interleaved reading in the GDAL docs, it is not immediately obvious how to alter our reading workflow. Perhaps we need to determine the index of the requested layer, and read all features of all previous layers first? ogr2ogr appears to do the right thing here, so we should look at that for reference.

@brendan-ward
Copy link
Member Author

Looks like the trick is to detect that the data source is OSM with interleaved reading, and then execute an SQL statement to set the specific layer of interest.

We can use OGR_DS_ExecuteSQL to do this after opening the data source.

@theroggy
Copy link
Member

theroggy commented Aug 17, 2023

I suppose use_arrow by default doesn't treat this any better, or does it?

-> based on my own test in the context of #269, use_arrow has the same problems except that it is 2 x as fast, most likely because it probably doesn't count the features first before fetching them...

@tfardet
Copy link

tfardet commented Aug 24, 2023

I can confirm that either #271 or some other recent commit solves the issue for me (main is working while 0.6.0 gave me empty dataframes).

As mentioned in #269, use_arrow=True is still necessary, though.

@brendan-ward
Copy link
Member Author

#271 has been merged into main (should have resolved this as part of that); it should no longer be necessary for use_arrow=True; per manual testing it has been verified to work for larger OSM files on MacOS and Windows.

If it isn't working yet for you, please share more details about what part isn't working.

@tfardet
Copy link

tfardet commented Aug 25, 2023

Ah right, the issue without arrow is associated to a DataLayerError: Could not iterate over features: Non closed ring detected. because OGR_GEOMETRY_ACCEPT_UNCLOSED_RING was set to "NO", so unrelated, sorry for the noise.
You can probably close this issue then :)

EDIT: on second thought, is it normal that the behavior differs with and without arrow? Should I open an issue about this?

@jorisvandenbossche
Copy link
Member

It might still be an issue to report to GDAL to also honor that setting when exporting to Arrow.

@theroggy
Copy link
Member

theroggy commented Aug 25, 2023

Ah right, the issue without arrow is associated to a DataLayerError: Could not iterate over features: Non closed ring detected. because OGR_GEOMETRY_ACCEPT_UNCLOSED_RING was set to "NO", so unrelated, sorry for the noise. You can probably close this issue then :)

EDIT: on second thought, is it normal that the behavior differs with and without arrow? Should I open an issue about this?

@tfardet I wonder... OGR_GEOMETRY_ACCEPT_UNCLOSED_RING is not supposed to give an error by default... it should only give a warning. Is it possible you set OGR_GEOMETRY_ACCEPT_UNCLOSED_RING="NO" somewhere? Maybe in a environment variable?

import pyogrio

print(
    "OGR_GEOMETRY_ACCEPT_UNCLOSED_RING: "
    f"{pyogrio.get_gdal_config_option('OGR_GEOMETRY_ACCEPT_UNCLOSED_RING')}"
)

Result:

OGR_GEOMETRY_ACCEPT_UNCLOSED_RING: None

@tfardet
Copy link

tfardet commented Aug 25, 2023

@tfardet I wonder... OGR_GEOMETRY_ACCEPT_UNCLOSED_RING is not supposed to give an error by default... it should only give a warning. Is it possible you set OGR_GEOMETRY_ACCEPT_UNCLOSED_RING="NO" somewhere? Maybe in a environment variable?

Yes, I set it manually to get rid of the warning about the non closed ring with Arrow, where it seems to simply discard the related geometry.
But it raises an error without Arrow.

@theroggy
Copy link
Member

theroggy commented Aug 25, 2023

Ok, I understand now.

The variable name implies that you specify if you want to Accept unclosed rings or not. So if you set it to "NO", this means you don't Accept them and hence an error is raised. If you want to get rid of the warning without error you should set it to "YES", not "NO".

The inconsistent behaviour/bug is in the arrow implementation: if you set OGR_GEOMETRY_ACCEPT_UNCLOSED_RING="NO" it should also give and error if consistency is wanted.

@theroggy
Copy link
Member

theroggy commented Sep 3, 2023

@tfardet do you open an issue in https://github.com/OSGeo/gdal/issues for this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants