Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple PBF input discouraged for the time being #3925

Open
nilsnolde opened this issue Jan 20, 2023 · 10 comments
Open

Multiple PBF input discouraged for the time being #3925

nilsnolde opened this issue Jan 20, 2023 · 10 comments

Comments

@nilsnolde
Copy link
Member

nilsnolde commented Jan 20, 2023

There has likely been a bug with ingesting multiple PBFs and until we had the chance to look into it in detail, we'd encourage people to please merge PBFs until it's addressed, ref #3908 (comment).

It's easy and fast using osmium:
osmium merge PBF1 PBF2 PBF3 -o merged.pbf

@nilsnolde nilsnolde pinned this issue Jan 20, 2023
@ImreSamu
Copy link
Contributor

osmium merge PBF1 PBF2 PBF3 -o merged.pbf

based on my experiences :

Sometimes osmium merge can also cause hidden data errors!

1. gaps.

It should be checked that the two areas (pbf) are accurately connected and that there is no gap between them.
Because if there is, there may be connectivity problems in the road network ( or the ferry network ).

For geofabrik extracts, you can check this by using the extract .poly files.

example

Route finding through the ferry network is unlikely to be perfect here.
image

And the geofabrik .poly files may also change as they are updated. So it doesn't hurt to check regularly!

2. Care should be taken when merging extracts of the -latest.osm.pbf

in the geofabrik "raw directory index" you will find the exact date version! and this is so much better!

be careful:

  • the "gcc-states-latest.osm.pbf" not a stable ; changing every day!
  • never merge 2 different date versions! ( pbf1-230110.osm.pbf pbf2-230117.osm.pbf is not compatible ! )

And even if they were downloaded at the same time - there is sometimes a tiny chance that the two extracts are not from the same osm base, since we download the data while the geofabrik extracts are being updated.

if you are using -latest files .. the osmosis_replication_timestamp should be the same !

$ osmium fileinfo andorra-latest.osm.pbf
File:
  Name: andorra-latest.osm.pbf
  Format: PBF
  Compression: none
  Size: 2426572
Header:
  Bounding boxes:
    (1.412368,42.4276,1.787481,42.65717)
  With history: no
  Options:
    generator=osmium/1.14.0
    osmosis_replication_base_url=http://download.geofabrik.de/europe/andorra-updates
    osmosis_replication_sequence_number=3581
    osmosis_replication_timestamp=2023-01-19T21:21:54Z                  <------------------------------  !!!!! 
    pbf_dense_nodes=true
    pbf_optional_feature_0=Sort.Type_then_ID
    sorting=Type_then_ID
    timestamp=2023-01-19T21:21:54Z

Best practice for paranoids :

@nilsnolde
Copy link
Member Author

Right, you can get unlucky with -latest.

Can the gaps in Geofabrik's geometries be fixed? Like is it in a repo one can PR to? It'd be good if the community could help maintain that I think. AFAIK they use the default complete-ways strategy, so I'd hope it's still very unlikely that it'd miss ferry edges which tend to be exactly as long as the whole ferry trip. Did you run into that before? Just curious.

@ImreSamu
Copy link
Contributor

... that it'd miss ferry edges ...

the probability is small .. but not impossible.

The bigger danger is not looking at the map.
( ~ There may be extreme cases that may escape the attention of users. )

For example, France + Great Britain merge
many people forget about Jersey and Guernsey. 😄
and it is possible to imagine a route that passes through these two islands. But as this is a very rare case it is difficult to spot.

image

Can the gaps in Geofabrik's geometries be fixed?

maybe ..

I'm used to not merging in the first place and then I don't have to compare the fit and gaps between the map data in detail.

And just checking the country polygons is not enough, you also need to check the continent polygon.

@kevinkreiser
Copy link
Member

Excellent explanations with pictures here, much appreciated! I feel like these are probably the best reasons to stop supporting it, just to save others from having to discover all the possible pitfalls; basically that geofabrik is where people typically get extracts and they aren't usually considering what could go wrong when combining them. apologies for the stubborness on my part @nilsnolde @dnesbitt61

@nilsnolde
Copy link
Member Author

Hm tbh @kevinkreiser, I think you were right, we should at least try to investigate. Actually, while I agree, it’s a great summary of pitfalls, I don’t think any of that is relevant to allowing single or multiple files.

If we only allow one file, we’d force people to merge. That doesn’t eliminate the pitfalls, it rather masks them even more. I think the only thing that’s important for us (or me anyways), is the de-duplication of data. I don’t know the code there, but it seems troublesome, so my first thought was „why not remove it“. And deduplication is the only thing osmium would do for us. All the other problems arising from merging separate regions (via osmium or Valhalla) are still there.

So I guess my conclusion is that we have to become more sensitive to the issues @ImreSamu points out, and possibly have a paragraph in some doc about this. But it seems to me, no matter what we do, users have to be as sensitive to what they’re doing.

@nilsnolde
Copy link
Member Author

nilsnolde commented Jan 21, 2023

@ImreSamu what would be great to have is like a detailed tutorial about merging/working with OSM extracts. It’s not very specific to routing engines, could be in general.

Are you aware of smth like that? Would love to link it.

@ImreSamu
Copy link
Contributor

@nilsnolde :

what would be great to have is like a detailed tutorial about merging/working with OSM extracts.

agree ;

on the other hand .. It is difficult to write a good tutorial.
For example, the more I know about the subject, the more I tend to be cautious.
( ~ "There are unknown unknowns" )

maybe we have to add this:

My "paranoid" tutorial

image

image

Similar problem : for the USA perfect routing - you need Canada ..

image

@kevinkreiser
Copy link
Member

kevinkreiser commented Jan 21, 2023

when people use multiple pbfs we should log warn a link to this issue 😄 that might be a pretty good way to raise awareness of the hell that is splicing osm together

@nilsnolde
Copy link
Member Author

True that. A bit more formatting/cleanup and it’s kind of a tutorial:) thanks @ImreSamu

@kevinkreiser
Copy link
Member

So apart from the general issues with combining extracts as mentioned above there is also the technical issues involving duplicated data from multiple pbfs. as mentioned in #3908 the code is currently not robust to duplication which normaly just ends up in adding extra edges to the graph which isnt in itself the end of the world (nice pun). but we do limit the total number of edges that can originate from a node and this ends up in erroring out on some more extreme cases once they are duplicated by overlapping data.

the trick to being able to deduplicate the edges is being able to recognize when two sequences of waynodes in the waynodes file are the same edge. the problem with that is that we only know which way they are by their way_index which is simply an index into the way attribute data and that is not sorted (i dont think) by wayid so duplicates happen over there. further annoying is that the waynodes who are duplicates arent near each other then in their container. so we need a to tie duplicates to each other in a simple way. we cant use a map or whatever because its too huge so we need to somehow recognize duplicates based on sorting. so how can we do that..

after this bit of code:

way_nodes.sort(

we have the waynodes sorted by osm node id. which means we will have duplicates of the same node next to each other in the list. the other thing we have is the way_index so we can tell that they are duplicates because they dont have the same way index. if they did have the same way index then they are still valid even if dups because you can have ways coming back on themselves. anyway. if we see adjacent waynodes with node matching way_index then we can simply remove them from the way nodes (probably just mark them as ignored not remove becuase that is expensive). then when we go to run over the full set of them to build the graph we just skip them and no dups show up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants