New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make ds9 region file parser fast #48
Comments
@cdeil - can you do some basic profiling (%prun) to see what the bottleneck is? |
All the time is spent in
I have no idea how to optimise that ... this should probably be rewritten with |
Thanks for checking! |
The slowness is likely due to my lack of knowledge instead pyparsing is inherently slow. My recollection is that I wanted to keep passing the relevant part of the original string during the parsing, which somehow complicated the code. But not sure if this is the main reason for slow performance. I note that matplotlib depends on pyparsing. So the pyparsing may not be an extra dependency. |
This speed issue in the DS9 parser in pyregion is still of interest. Long-term it would be nice if we could have one easy to understand parser that's complete and reasonably fast (speed is not the most important thing, it doesn't have to be super fast). So pyregion uses http://pyparsing.wikispaces.com/ and one could try to understand the parser here and why it's slow (profile and see which part of the parser is slow and if it can be restructured to be faster). In https://github.com/astropy/regions/blob/master/regions/io/read_ds9.py there's a handwritten parser. I don't know how complete it is or could be made with some work. To me, from a quick look, the parser in pyregion and regions looks about the same complexity. In either case I think I would have to read and think for an hour or two to grok them. (cc @keflavich and I think @astrofrog and @joleroi as authors) If someone becomes really interested in this topic and wants to prototype and benchmark parsers, other options worth looking at are http://www.dabeaz.com/ply/ and https://github.com/erikrose/parsimonious . |
One more thing I forgot to mention: the nice thing about https://github.com/erikrose/parsimonious is that (from the looks of it, I haven't tried) one just writes down the grammar of the text to be parsed in BNF form. Independent of whether parsimonious is used or not (probably not) in the end, having the ds9 grammar written down would be nice, e.g. in the docstring of the parser I don't presume the BNF for DS9 region files exists, but it could probably be extracted from the pyparsing parser code in pyregion, which looks a bit similar. |
We'll talk this week, but my recollection from last year was either no one understood how to write a grammar for ds9, or we tried and found it incompatible. With a functional parser like the one we've started on, we can guarantee 100% coverage given enough programmer time. I don't know if the same is true for a grammar-based approach; it's not obvious to me whether ds9 regions follow a single self-consistent grammar. Probably the existing code could be made more clear by breaking it down into individual region types; right now, there are extra |
What about the pyparsing based parser in pyregion? Is there a problem apart from the speed issue? |
This is the table with the current status
I agree, we should work until the table linked above shows 100% for coverage everywhere for the regions package. Speed is already much better than pyregion, so we probably don't need to worry about that (as @cdeil said). We discussed this already a year ago, and couldn't come up with a complete DS9 grammar.
See the script to generate the table, there are many warnings like this
Also, most of the files that cannot be parsed with the regions package, show only a 75% coverage with pyregion. So I would be reluctant to try going down the grammar avenue again. |
I didn't know that pyregion is not a complete DS9 region parser and I think wasn't part of the DS9 grammar / parse discussions last year (or forgot). OK, fine with me to go with a hand-written parser if you think it's a good (or even the only possible) solution. The
The @joleroi or @keflavich - Can you do such a docs update this week or should I put it on my TODO list? Note that @olebole as Debian Astro lead asked me a week ago if he should include |
Agreed that both pyregion and region need to coexist for at least a while. Could you make the docs thing into a separate issue? Maybe we'll be able to address it this week, but it should be an Issue. |
@keflavich - OK. I made a separate issue suggesting the docs improvements here: astropy/regions#124 |
I have made a pull request (#111) that should improve the coverage. As far as I can tell, it now can parse all the regions in the test. The current version of the regions_pyregion_comparison.py reports coverage around 82%, but this is because the script overestimates the number of defined regions in the file.
I think that the reason that pyregion is slow is likely not due to the inherent performance of the pyparsing, but more likely due to the inefficient grammar I created (and things inbetween). By no means I am an expert on this grammar thing and also the pyparsing. |
The docs contain this note: "pyregion is rather slow, likely due to a inefficient parser."
( see http://pyregion.readthedocs.org/en/latest/index.html#documentation )
It's really slow ... parsing 1000 simple circle regions
takes 7 seconds:
Astropy bundles ply, so if someone takes this on it might be worth considering using
ply
(currently pyparsing is used).I'm using
pyregion
in a data analysis / image plotting pipeline and region file parsing is actually the bottleneck, but it's acceptable and I'm not very familiar with parsing and region files, so I don't plan to take this on ... just wanted to make a GH issue so this known issue is not forgotten.The text was updated successfully, but these errors were encountered: