-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Used generator to parse/stream *.osm.pbf files for increased memory efficiency #3
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for a slow reponse as I was too busy for my own work.
Yes, you're right that the original parse_osm_pbf()
function takes ages to process large .pbf files. I think using yield
and generatores is a really excellent idea and can be useful in various cases.
I tested your function stream_osm_pbf()
on my machine. I found a problem with it - it works only when layer_name='points'
. Could you have a look at the test_stream_osm_pbf()
function I added to the docstring? Below is the results I got:
>>> test_stream_osm_pbf()
First row of 'lines':
First row of 'multilinestrings':
First row of 'points':
{'type': 'Feature', 'geometry_type': 'Point', 'geometry_coordinates_0': -0.5134241, 'geometry_coordinates_1': 52.6555853, 'properties_osm_id': '488432', 'properties_name': None, 'properties_barrier': None, 'properties_highway': None, 'properties_ref': None, 'properties_address': None, 'properties_is_in': None, 'properties_place': None, 'properties_man_made': None, 'properties_other_tags': '"odbl"=>"clean"', 'id': 488432}
First row of 'multipolygons':
First row of 'other_relations':
However, given the same working environment, the function parse_osm_pbf()
still works ok.
Please can you have a look at the comments I added to each bit of the changes? Let's find out if that problem exists.
@@ -12,10 +12,12 @@ | |||
import ogr | |||
import pandas as pd | |||
import rapidjson | |||
import json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose rapidjson
should do the work that json
does here. Can you please remove import json
here?
import shapefile | ||
import shapely.geometry | ||
from pyhelpers.dir import cd, regulate_input_data_dir | ||
from pyhelpers.store import load_pickle | ||
from flatten_json import flatten |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To conform with the import style, i.e. import module.name
(except "pyhelpers" which is used quite frequenly), please can you change this line to import flatten_json
?
@@ -23,3 +23,4 @@ SQLAlchemy-Utils>=0.34.1 | |||
tqdm>=4.32.2 | |||
xlrd>=1.2.0 | |||
XlsxWriter>=1.1.8 | |||
flatten_json>=0.1.7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess "flatten-json" (rather than "flatten_json") is the formal name in the PyPI repository. May use "flatten-json>=0.1.7" instead?
''' | ||
Generator that returns an iterator that can be used to stream parsed OSM data. | ||
|
||
:type path_to_osm_pbf: str | ||
:param path_to_osm_pbf: Path to the *.osm.pbf file you wish to parse and stream | ||
|
||
:type layer_data: str | ||
:param layer_data: The layer of the *.osm.pbf file that you wish to parse and stream. | ||
Options include: points, lines, multilinestrings, multipolygons, & other_relations | ||
|
||
:rtype: Iterator | ||
:return: An iterator object that can be used to stream the parsed data from the corresponding layer in the *.osm.pbf file | ||
|
||
------- | ||
Example | ||
------- | ||
|
||
stream_data = stream_osm_pbf('/algeria.osm.pbf', 'lines') | ||
|
||
for row in stream_data: | ||
print(row) | ||
''' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"""
:param path_to_osm_pbf: [str] path to the *.osm.pbf file you wish to parse and stream
:param layer_name: [str] 'points', 'lines', 'multilinestrings', 'multipolygons', & 'other_relations'
:return: [generator] allowing for streaming the *.osm.pbf data of the given layer
Testing e.g.
import pydriosm as dri
subregion_name = 'Rutland'
dri.download_subregion_osm_file(subregion_name, osm_file_format='.osm.pbf', update=False)
_, path_to_osm_pbf = dri.get_default_path_to_osm_file(subregion_name, osm_file_format='.osm.pbf')
def test_stream_osm_pbf():
for lyr_name in ['lines', 'multilinestrings', 'points', 'multipolygons', 'other_relations']:
stream_data = stream_osm_pbf(path_to_osm_pbf, lyr_name)
print("First row of '{}': ".format(lyr_name))
for row in stream_data:
print(row)
break
>>> test_stream_osm_pbf()
First row of 'lines':
First row of 'multilinestrings':
First row of 'points':
{'type': 'Feature', 'geometry_type': 'Point', 'geometry_coordinates_0': -0.5134241, 'geometry_coordinates_1': 52.6555853, 'properties_osm_id': '488432', 'properties_name': None, 'properties_barrier': None, 'properties_highway': None, 'properties_ref': None, 'properties_address': None, 'properties_is_in': None, 'properties_place': None, 'properties_man_made': None, 'properties_other_tags': '"odbl"=>"clean"', 'id': 488432}
First row of 'multipolygons':
First row of 'other_relations':
"""
raw_data = ogr.Open(path_to_osm_pbf) | ||
layer = raw_data.GetLayer(layer_name) | ||
for feature in layer: | ||
yield flatten(json.loads(feat.ExportToJson())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest you may change some variable names.
raw_osm_pbf = ogr.Open(path_to_osm_pbf)
layer = raw_osm_pbf.GetLayer(layer_name)
for feat in layer:
yield flatten_json.flatten(rapidjson.loads(feat.ExportToJson()))
I apologize for the delay as well, I will get to this as soon as I can. You're right, it only seems to be parsing the |
Hi @manesioz , You might have noticed that I have made some changes to the package and here is an updated version, but didn't change anything on the functions relevant to this pull request. I just wanted to know if you're still interested in working on this pull request. If so, I'd be happy to work together and help if needed. Otherwise, I might want to close this pull request and move on. Feel free to let me know what you think. |
Hey Mike, I am still interested I have just been busy lately so I apologize for the delay. I could not easily identify why only the |
I also found it very strange. Probably there was something wrong with the GDAL/OGR installation and/or settings. I couldn't figure out where exactly the problem is. |
Yes, possibly.. It's harder to debug GDAL/OGR since I'm not familiar with how it works under-the-hood. If you want you can close this for the time being, since I currently don't have much time to work on it. Thanks for your work, its a great package. |
fixes #2
stream_osm_pbf()
generator that returns iterator with parsed datarequirements.txt
accordinglyREADME.md
accordinglyNote: This function streams layer-specific data. The reason for this was due to the fact that you may often want to stream data to a database, and each layer has a different schema, so for consistency sake I made it layer-specific.