Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensuring all nodes are handled before ways #1

Open
missinglink opened this issue Mar 24, 2018 · 6 comments
Open

Ensuring all nodes are handled before ways #1

missinglink opened this issue Mar 24, 2018 · 6 comments

Comments

@missinglink
Copy link

Heya,

Is there any mechanism that would allow me to ensure that all calls to ReadNode have been completed before the first time ReadWay is called?

I would like to denormalize ways, so I need to ensure that all the nodes are in memory before processing the ways.

@thomersch
Copy link
Owner

Hi, thanks for opening an issue!

I really can't know when parsing nodes is completed as there is no guarantee that OSM PBFs are in the correct order. Single-threading the processing could ensure that all nodes will be processed first, as long as the input file is sorted.

The only strategy in which you can be completely sure that all nodes have been processed is reading the file once, only parsing nodes and then reparse the file and ignoring the nodes.

@thomersch
Copy link
Owner

One question: Have you encountered this problem with any files? If so, I would like to investigate if you can provide me with an example file.

BTW, in my own applications I process without any additional safeguards (https://github.com/thomersch/grandine/blob/master/cmd/spatialize/spatialize.go#L44) and haven't had any issues so far.

@missinglink
Copy link
Author

missinglink commented Mar 27, 2018

Yes, unfortunately, I have run it to it a few times, I have a library based off this parser https://github.com/missinglink/pbf which has a bunch of different commands available.

In that repo I also link to my fork of the parser where I added some extra features such as a PBF indexer which can be used for random file access on the PBF file, it's pretty neat but not really fast eough for production use.

I also added another feature called 'breakpoints' which was my attempt at being able to know when the nodes are complete so I can start on the ways, it's a difficult thing to write because of how it's not possible to know the contents of a block until after it's been decompressed (which is done in parallel).

I also noticed that some blocks contains nodes ways and relations (more common on geofabrik extracts) while some files have blocks which only ever contain one type per block (as per the ex-mapzen metro extracts).

Doing multiple passes on the file is a good workaround but it's not very convenient for the planet file, so I was hoping to find a solution which would allow me to make ReadWay not trigger until the last call to ReadNode had returned.

I'll write up an example and post it below.

@missinglink
Copy link
Author

something like this:

package main

import (
	"flag"
	"fmt"
	"log"
	"os"
	"sync"

	"github.com/thomersch/gosmparse"
)

type handler struct {
	nodes map[int64]gosmparse.Node
	mutex *sync.Mutex
}

func (d *handler) ReadNode(n gosmparse.Node) {
	d.mutex.Lock()
	d.nodes[n.ID] = n
	d.mutex.Unlock()
}

func (d *handler) ReadWay(w gosmparse.Way) {
	for _, ref := range w.NodeIDs {
		if _, ok := d.nodes[ref]; !ok {
			fmt.Println("could not find node", ref)
		}
	}
}

func (d *handler) ReadRelation(r gosmparse.Relation) {
	/* no-op */
}

func main() {
	source := flag.String("in", "osm.pbf", "")
	flag.Parse()

	f, err := os.Open(*source)
	if err != nil {
		log.Fatal(err)
	}
	dec := gosmparse.NewDecoder(f)

	dh := handler{
		nodes: make(map[int64]gosmparse.Node),
		mutex: &sync.Mutex{},
	}

	err = dec.Parse(&dh)
	if err != nil {
		log.Fatal(err)
	}
}
$ go run example.go --in /media/flash/berlin.osm.pbf 
fatal error: concurrent map read and map write

goroutine 20 [running]:
runtime.throw(0x544d78, 0x21)
	/usr/local/go/src/runtime/panic.go:619 +0x81 fp=0xc42502bda8 sp=0xc42502bd88 pc=0x428291
runtime.mapaccess2_fast64(0x519220, 0xc420082450, 0x7632e, 0xc42502be68, 0x92546cce)
....

if I fix the map access error with a mutex then the map may or may not contain all the nodes I need, depending on the size of the extract (smaller extracts are more prone to this)

@missinglink
Copy link
Author

I really like this library and I changed to using it over another one, unfortunately having to do multiple passes on the file negates the speed benefits of this library vs others.

Do you have any ideas how I might be able to add an option which prevents the ways/rels to be processed until all their dependents are finished?

@thomersch
Copy link
Owner

Sorry, I totally forgot this issue existed.

Unfortunately it is kinda hard to resolve this issue, because of the before mentioned lack of guarantees. Collecting the blocks from a file is single-threaded in gosmparse, but processing is dependent on GOMAXPROCS. So now I added the possibility to configure this independently in 81c340c. If you wish, you can set decoder.Workers to one, which would ensure that features are returned strictly in file order.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants