Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The exporting of resulting OSM files can potentially be sped up #37

Open
Vectorial1024 opened this issue Jul 23, 2023 · 4 comments
Open

Comments

@Vectorial1024
Copy link

This requires confirmation later, but I noticed on this StackOverflow discussion:

https://stackoverflow.com/questions/44560655/python-writelines-and-write-huge-time-difference

Why is there such a large difference in the file writing time for write() and writelines() even though it is the same data?


I have used ogr2osm for a while, and I notice that it can be quite slow on larger files. Like, unusually slow.

It seems the exporting can be sped up. Will investigate later.

@Vectorial1024
Copy link
Author

Benchmarking the existing method

First, I must admit my current PC is at mid-high tier, and so things might be faster than average. But the point should still stand even for slower computers. Also, extra care must be taken because the files to be processed can be very large.


Some details:

  • Command run: python -m ogr2osm -t test_translate -o target.osm source.geojson
  • Size of data source: about 430 MB
  • Measuring the duration: adding some basic measurement at DataWriterContextManager.output using time.time()
  • I/O are all on SSD

I run the command for 5 times.

Measured time (average): 19.455 seconds

@Vectorial1024 Vectorial1024 changed the title The exporting of resulting files can potentially be sped up The exporting of resulting OSM files can potentially be sped up Jul 23, 2023
@Vectorial1024
Copy link
Author

One thing that sticks out when doing some detailed profiling:

Beginning to time the export
Writing file header
Writing nodes
Writing took (to_xml, write): 7.521965265274048, 0.6387271881103516
Writing ways
Writing took (to_xml, write): 10.0064537525177, 0.5809998512268066
Writing relations
Writing took (to_xml, write): 0, 0
Writing file footer
Time elapsed was 18.999 seconds

It is actually the to_xml part which is slow, not the IO.

It seems we may continue with some sort of multi-threading.

@Vectorial1024
Copy link
Author

Hmmm. We are already using lxml for fast export.

Spawning new threads does not work due to Python's GIL, which effectively encourages single-threaded code.

Playing around with multiprocessing did not bring much immediate results because we will need to do extra work to pass values into the subprocesses. This might be viable in the long term, but not something that can be done in a single day.

If we are able to somehow utilize multi-processing effectively, then perhaps there will be a significant speedup.

@Vectorial1024
Copy link
Author

This just dropped a few days ago:

https://www.bitecode.dev/p/whats-up-python-the-gil-removed-a

THe removal of GIL in Python can be very useful to this speed up: instead of spawning difficult-to-control subprocesses to parallelize XML-to-string, we may finally have a easy-to-control multi-threaded XML-to-string process to speed up exporting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant