Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Used generator to parse/stream *.osm.pbf files for increased memory efficiency #3

Closed
wants to merge 2 commits into from

Conversation

manesioz
Copy link

@manesioz manesioz commented Oct 12, 2019

fixes #2

  • Added stream_osm_pbf() generator that returns iterator with parsed data
  • Updated requirements.txt accordingly
  • Updated README.md accordingly

Note: This function streams layer-specific data. The reason for this was due to the fact that you may often want to stream data to a database, and each layer has a different schema, so for consistency sake I made it layer-specific.

Copy link
Owner

@mikeqfu mikeqfu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for a slow reponse as I was too busy for my own work.

Yes, you're right that the original parse_osm_pbf() function takes ages to process large .pbf files. I think using yield and generatores is a really excellent idea and can be useful in various cases.

I tested your function stream_osm_pbf() on my machine. I found a problem with it - it works only when layer_name='points'. Could you have a look at the test_stream_osm_pbf() function I added to the docstring? Below is the results I got:

>>> test_stream_osm_pbf()
First row of 'lines': 
First row of 'multilinestrings': 
First row of 'points': 
{'type': 'Feature', 'geometry_type': 'Point', 'geometry_coordinates_0': -0.5134241, 'geometry_coordinates_1': 52.6555853, 'properties_osm_id': '488432', 'properties_name': None, 'properties_barrier': None, 'properties_highway': None, 'properties_ref': None, 'properties_address': None, 'properties_is_in': None, 'properties_place': None, 'properties_man_made': None, 'properties_other_tags': '"odbl"=>"clean"', 'id': 488432}
First row of 'multipolygons': 
First row of 'other_relations': 

However, given the same working environment, the function parse_osm_pbf() still works ok.

Please can you have a look at the comments I added to each bit of the changes? Let's find out if that problem exists.

@@ -12,10 +12,12 @@
import ogr
import pandas as pd
import rapidjson
import json
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose rapidjson should do the work that json does here. Can you please remove import json here?

import shapefile
import shapely.geometry
from pyhelpers.dir import cd, regulate_input_data_dir
from pyhelpers.store import load_pickle
from flatten_json import flatten
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To conform with the import style, i.e. import module.name (except "pyhelpers" which is used quite frequenly), please can you change this line to import flatten_json?

@@ -23,3 +23,4 @@ SQLAlchemy-Utils>=0.34.1
tqdm>=4.32.2
xlrd>=1.2.0
XlsxWriter>=1.1.8
flatten_json>=0.1.7
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess "flatten-json" (rather than "flatten_json") is the formal name in the PyPI repository. May use "flatten-json>=0.1.7" instead?

Comment on lines +560 to +581
'''
Generator that returns an iterator that can be used to stream parsed OSM data.

:type path_to_osm_pbf: str
:param path_to_osm_pbf: Path to the *.osm.pbf file you wish to parse and stream

:type layer_data: str
:param layer_data: The layer of the *.osm.pbf file that you wish to parse and stream.
Options include: points, lines, multilinestrings, multipolygons, & other_relations

:rtype: Iterator
:return: An iterator object that can be used to stream the parsed data from the corresponding layer in the *.osm.pbf file

-------
Example
-------

stream_data = stream_osm_pbf('/algeria.osm.pbf', 'lines')

for row in stream_data:
print(row)
'''
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"""
:param path_to_osm_pbf: [str] path to the *.osm.pbf file you wish to parse and stream
:param layer_name: [str] 'points', 'lines', 'multilinestrings', 'multipolygons', & 'other_relations'
:return: [generator] allowing for streaming the *.osm.pbf data of the given layer

Testing e.g.
    import pydriosm as dri
    subregion_name = 'Rutland'
    dri.download_subregion_osm_file(subregion_name, osm_file_format='.osm.pbf', update=False)
    _, path_to_osm_pbf = dri.get_default_path_to_osm_file(subregion_name, osm_file_format='.osm.pbf')

    def test_stream_osm_pbf():
        for lyr_name in ['lines', 'multilinestrings', 'points', 'multipolygons', 'other_relations']:
            stream_data = stream_osm_pbf(path_to_osm_pbf, lyr_name)
            print("First row of '{}': ".format(lyr_name))
            for row in stream_data:
                print(row)
                break


>>> test_stream_osm_pbf()
First row of 'lines': 
First row of 'multilinestrings': 
First row of 'points': 
{'type': 'Feature', 'geometry_type': 'Point', 'geometry_coordinates_0': -0.5134241, 'geometry_coordinates_1': 52.6555853, 'properties_osm_id': '488432', 'properties_name': None, 'properties_barrier': None, 'properties_highway': None, 'properties_ref': None, 'properties_address': None, 'properties_is_in': None, 'properties_place': None, 'properties_man_made': None, 'properties_other_tags': '"odbl"=>"clean"', 'id': 488432}
First row of 'multipolygons': 
First row of 'other_relations': 
"""

Comment on lines +583 to +586
raw_data = ogr.Open(path_to_osm_pbf)
layer = raw_data.GetLayer(layer_name)
for feature in layer:
yield flatten(json.loads(feat.ExportToJson()))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest you may change some variable names.

raw_osm_pbf = ogr.Open(path_to_osm_pbf)
layer = raw_osm_pbf.GetLayer(layer_name)
for feat in layer:
    yield flatten_json.flatten(rapidjson.loads(feat.ExportToJson()))

@manesioz
Copy link
Author

I apologize for the delay as well, I will get to this as soon as I can. You're right, it only seems to be parsing the points layer which is odd, I will look into that more before the other style adjustments. Thank you for your feedback!

@mikeqfu
Copy link
Owner

mikeqfu commented Dec 3, 2019

I apologize for the delay as well, I will get to this as soon as I can. You're right, it only seems to be parsing the points layer which is odd, I will look into that more before the other style adjustments. Thank you for your feedback!

Hi @manesioz , You might have noticed that I have made some changes to the package and here is an updated version, but didn't change anything on the functions relevant to this pull request. I just wanted to know if you're still interested in working on this pull request. If so, I'd be happy to work together and help if needed. Otherwise, I might want to close this pull request and move on. Feel free to let me know what you think.

@manesioz
Copy link
Author

manesioz commented Dec 4, 2019

Hey Mike, I am still interested I have just been busy lately so I apologize for the delay. I could not easily identify why only the points layer was parsed, do you have any ideas?

@mikeqfu
Copy link
Owner

mikeqfu commented Dec 5, 2019

Hey Mike, I am still interested I have just been busy lately so I apologize for the delay. I could not easily identify why only the points layer was parsed, do you have any ideas?

I also found it very strange. Probably there was something wrong with the GDAL/OGR installation and/or settings. I couldn't figure out where exactly the problem is.

@manesioz
Copy link
Author

manesioz commented Dec 6, 2019

Yes, possibly.. It's harder to debug GDAL/OGR since I'm not familiar with how it works under-the-hood. If you want you can close this for the time being, since I currently don't have much time to work on it. Thanks for your work, its a great package.

@manesioz manesioz closed this Dec 21, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make generator to parse data for performance improvement with large files
2 participants