Used generator to parse/stream *.osm.pbf files for increased memory efficiency #3

manesioz · 2019-10-12T02:38:31Z

fixes #2

Added stream_osm_pbf() generator that returns iterator with parsed data
Updated requirements.txt accordingly
Updated README.md accordingly

Note: This function streams layer-specific data. The reason for this was due to the fact that you may often want to stream data to a database, and each layer has a different schema, so for consistency sake I made it layer-specific.

mikeqfu

Apologies for a slow reponse as I was too busy for my own work.

Yes, you're right that the original parse_osm_pbf() function takes ages to process large .pbf files. I think using yield and generatores is a really excellent idea and can be useful in various cases.

I tested your function stream_osm_pbf() on my machine. I found a problem with it - it works only when layer_name='points'. Could you have a look at the test_stream_osm_pbf() function I added to the docstring? Below is the results I got:

>>> test_stream_osm_pbf()
First row of 'lines': 
First row of 'multilinestrings': 
First row of 'points': 
{'type': 'Feature', 'geometry_type': 'Point', 'geometry_coordinates_0': -0.5134241, 'geometry_coordinates_1': 52.6555853, 'properties_osm_id': '488432', 'properties_name': None, 'properties_barrier': None, 'properties_highway': None, 'properties_ref': None, 'properties_address': None, 'properties_is_in': None, 'properties_place': None, 'properties_man_made': None, 'properties_other_tags': '"odbl"=>"clean"', 'id': 488432}
First row of 'multipolygons': 
First row of 'other_relations':

However, given the same working environment, the function parse_osm_pbf() still works ok.

Please can you have a look at the comments I added to each bit of the changes? Let's find out if that problem exists.

mikeqfu · 2019-10-16T10:02:02Z

pydriosm/read_GeoFabrik.py

@@ -12,10 +12,12 @@
 import ogr
 import pandas as pd
 import rapidjson
+import json


I suppose rapidjson should do the work that json does here. Can you please remove import json here?

mikeqfu · 2019-10-16T10:09:55Z

pydriosm/read_GeoFabrik.py

 import shapefile
 import shapely.geometry
 from pyhelpers.dir import cd, regulate_input_data_dir
 from pyhelpers.store import load_pickle
+from flatten_json import flatten


To conform with the import style, i.e. import module.name (except "pyhelpers" which is used quite frequenly), please can you change this line to import flatten_json?

mikeqfu · 2019-10-16T10:13:26Z

requirements.txt

@@ -23,3 +23,4 @@ SQLAlchemy-Utils>=0.34.1
 tqdm>=4.32.2
 xlrd>=1.2.0
 XlsxWriter>=1.1.8
+flatten_json>=0.1.7


I guess "flatten-json" (rather than "flatten_json") is the formal name in the PyPI repository. May use "flatten-json>=0.1.7" instead?

mikeqfu · 2019-10-21T09:56:51Z

pydriosm/read_GeoFabrik.py

+    '''
+    Generator that returns an iterator that can be used to stream parsed OSM data. 
+
+    :type path_to_osm_pbf: str 
+    :param path_to_osm_pbf: Path to the *.osm.pbf file you wish to parse and stream 
+
+    :type layer_data: str 
+    :param layer_data: The layer of the *.osm.pbf file that you wish to parse and stream. 
+                       Options include: points, lines, multilinestrings, multipolygons, & other_relations
+
+    :rtype: Iterator 
+    :return: An iterator object that can be used to stream the parsed data from the corresponding layer in the *.osm.pbf file
+
+    -------
+    Example
+    -------
+
+    stream_data = stream_osm_pbf('/algeria.osm.pbf', 'lines')
+
+    for row in stream_data: 
+        print(row)
+    '''


""" :param path_to_osm_pbf: [str] path to the *.osm.pbf file you wish to parse and stream :param layer_name: [str] 'points', 'lines', 'multilinestrings', 'multipolygons', & 'other_relations' :return: [generator] allowing for streaming the *.osm.pbf data of the given layer Testing e.g. import pydriosm as dri subregion_name = 'Rutland' dri.download_subregion_osm_file(subregion_name, osm_file_format='.osm.pbf', update=False) _, path_to_osm_pbf = dri.get_default_path_to_osm_file(subregion_name, osm_file_format='.osm.pbf') def test_stream_osm_pbf(): for lyr_name in ['lines', 'multilinestrings', 'points', 'multipolygons', 'other_relations']: stream_data = stream_osm_pbf(path_to_osm_pbf, lyr_name) print("First row of '{}': ".format(lyr_name)) for row in stream_data: print(row) break >>> test_stream_osm_pbf() First row of 'lines': First row of 'multilinestrings': First row of 'points': {'type': 'Feature', 'geometry_type': 'Point', 'geometry_coordinates_0': -0.5134241, 'geometry_coordinates_1': 52.6555853, 'properties_osm_id': '488432', 'properties_name': None, 'properties_barrier': None, 'properties_highway': None, 'properties_ref': None, 'properties_address': None, 'properties_is_in': None, 'properties_place': None, 'properties_man_made': None, 'properties_other_tags': '"odbl"=>"clean"', 'id': 488432} First row of 'multipolygons': First row of 'other_relations': """

mikeqfu · 2019-10-21T09:56:59Z

pydriosm/read_GeoFabrik.py

+    raw_data = ogr.Open(path_to_osm_pbf)
+    layer = raw_data.GetLayer(layer_name)
+    for feature in layer: 
+        yield flatten(json.loads(feat.ExportToJson()))


I would suggest you may change some variable names.

raw_osm_pbf = ogr.Open(path_to_osm_pbf) layer = raw_osm_pbf.GetLayer(layer_name) for feat in layer: yield flatten_json.flatten(rapidjson.loads(feat.ExportToJson()))

manesioz · 2019-10-29T16:00:54Z

I apologize for the delay as well, I will get to this as soon as I can. You're right, it only seems to be parsing the points layer which is odd, I will look into that more before the other style adjustments. Thank you for your feedback!

mikeqfu · 2019-12-03T13:57:36Z

I apologize for the delay as well, I will get to this as soon as I can. You're right, it only seems to be parsing the points layer which is odd, I will look into that more before the other style adjustments. Thank you for your feedback!

Hi @manesioz , You might have noticed that I have made some changes to the package and here is an updated version, but didn't change anything on the functions relevant to this pull request. I just wanted to know if you're still interested in working on this pull request. If so, I'd be happy to work together and help if needed. Otherwise, I might want to close this pull request and move on. Feel free to let me know what you think.

manesioz · 2019-12-04T14:51:06Z

Hey Mike, I am still interested I have just been busy lately so I apologize for the delay. I could not easily identify why only the points layer was parsed, do you have any ideas?

mikeqfu · 2019-12-05T21:42:29Z

Hey Mike, I am still interested I have just been busy lately so I apologize for the delay. I could not easily identify why only the points layer was parsed, do you have any ideas?

I also found it very strange. Probably there was something wrong with the GDAL/OGR installation and/or settings. I couldn't figure out where exactly the problem is.

manesioz · 2019-12-06T14:05:44Z

Yes, possibly.. It's harder to debug GDAL/OGR since I'm not familiar with how it works under-the-hood. If you want you can close this for the time being, since I currently don't have much time to work on it. Thanks for your work, its a great package.

Zachary Manesiotis added 2 commits October 12, 2019 02:28

add stream_osm_pbf function, update requirements.txt

5ca6616

updated README.md to document new function

0fb08ef

mikeqfu requested changes Oct 21, 2019

View reviewed changes

manesioz closed this Dec 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Used generator to parse/stream *.osm.pbf files for increased memory efficiency #3

Used generator to parse/stream *.osm.pbf files for increased memory efficiency #3

manesioz commented Oct 12, 2019 •

edited

mikeqfu left a comment

mikeqfu Oct 16, 2019

mikeqfu Oct 16, 2019

mikeqfu Oct 16, 2019

mikeqfu Oct 21, 2019

mikeqfu Oct 21, 2019

manesioz commented Oct 29, 2019

mikeqfu commented Dec 3, 2019

manesioz commented Dec 4, 2019

mikeqfu commented Dec 5, 2019

manesioz commented Dec 6, 2019

Used generator to parse/stream *.osm.pbf files for increased memory efficiency #3

Used generator to parse/stream *.osm.pbf files for increased memory efficiency #3

Conversation

manesioz commented Oct 12, 2019 • edited

mikeqfu left a comment

Choose a reason for hiding this comment

mikeqfu Oct 16, 2019

Choose a reason for hiding this comment

mikeqfu Oct 16, 2019

Choose a reason for hiding this comment

mikeqfu Oct 16, 2019

Choose a reason for hiding this comment

mikeqfu Oct 21, 2019

Choose a reason for hiding this comment

mikeqfu Oct 21, 2019

Choose a reason for hiding this comment

manesioz commented Oct 29, 2019

mikeqfu commented Dec 3, 2019

manesioz commented Dec 4, 2019

mikeqfu commented Dec 5, 2019

manesioz commented Dec 6, 2019

manesioz commented Oct 12, 2019 •

edited