Make ds9 region file parser fast #48

cdeil · 2014-10-30T20:38:28Z

The docs contain this note: "pyregion is rather slow, likely due to a inefficient parser."
( see http://pyregion.readthedocs.org/en/latest/index.html#documentation )

It's really slow ... parsing 1000 simple circle regions

with open('ds9.reg', 'w') as f:
    for _ in range(1000):
        f.write('circle(0,0,1)\n')

takes 7 seconds:

%time pyregion.open('ds9.reg')
CPU times: user 7.59 s, sys: 20 ms, total: 7.61 s

Astropy bundles ply, so if someone takes this on it might be worth considering using ply (currently pyparsing is used).

I'm using pyregion in a data analysis / image plotting pipeline and region file parsing is actually the bottleneck, but it's acceptable and I'm not very familiar with parsing and region files, so I don't plan to take this on ... just wanted to make a GH issue so this known issue is not forgotten.

The text was updated successfully, but these errors were encountered:

astrofrog · 2014-10-31T09:34:14Z

@cdeil - can you do some basic profiling (%prun) to see what the bottleneck is?

cdeil · 2014-11-01T11:14:33Z

All the time is spent in pyparsing.

RegionParser.parseLine calls self.parser.parseString(l) where self.parser is a pyparsing object created in RegionParser.__init__ (see here).

I have no idea how to optimise that ... this should probably be rewritten with ply (because it's bundled in astropy, i.e. the extra pyparsing dependency is avoided) and only then one should start profiling ... but it's some effort and requires someone that knows (or has time to learn) how ply works.

astrofrog · 2014-11-01T12:10:32Z

Thanks for checking!

leejjoon · 2014-11-02T16:40:44Z

The slowness is likely due to my lack of knowledge instead pyparsing is inherently slow.

My recollection is that I wanted to keep passing the relevant part of the original string during the parsing, which somehow complicated the code. But not sure if this is the main reason for slow performance.

I note that matplotlib depends on pyparsing. So the pyparsing may not be an extra dependency.

cdeil · 2017-05-06T22:50:10Z

This speed issue in the DS9 parser in pyregion is still of interest.
Next week @joleroi will give a presentation on regions (but also mentioning pyregion) at PyAstro17 and has updated a comparison: astropy/regions#123

Long-term it would be nice if we could have one easy to understand parser that's complete and reasonably fast (speed is not the most important thing, it doesn't have to be super fast).

So pyregion uses http://pyparsing.wikispaces.com/ and one could try to understand the parser here and why it's slow (profile and see which part of the parser is slow and if it can be restructured to be faster).

In https://github.com/astropy/regions/blob/master/regions/io/read_ds9.py there's a handwritten parser. I don't know how complete it is or could be made with some work.

To me, from a quick look, the parser in pyregion and regions looks about the same complexity. In either case I think I would have to read and think for an hour or two to grok them. (cc @keflavich and I think @astrofrog and @joleroi as authors)

If someone becomes really interested in this topic and wants to prototype and benchmark parsers, other options worth looking at are http://www.dabeaz.com/ply/ and https://github.com/erikrose/parsimonious .
I don't have time now or in the coming weeks, but I might in the summer if no-one figures out and finishes up this DS9 parsing part.

cdeil · 2017-05-06T23:04:48Z

One more thing I forgot to mention: the nice thing about https://github.com/erikrose/parsimonious is that (from the looks of it, I haven't tried) one just writes down the grammar of the text to be parsed in BNF form. Independent of whether parsimonious is used or not (probably not) in the end, having the ds9 grammar written down would be nice, e.g. in the docstring of the parser .py file, or the developer docs section.

I don't presume the BNF for DS9 region files exists, but it could probably be extracted from the pyparsing parser code in pyregion, which looks a bit similar.

keflavich · 2017-05-06T23:59:09Z

We'll talk this week, but my recollection from last year was either no one understood how to write a grammar for ds9, or we tried and found it incompatible.

With a functional parser like the one we've started on, we can guarantee 100% coverage given enough programmer time. I don't know if the same is true for a grammar-based approach; it's not obvious to me whether ds9 regions follow a single self-consistent grammar.

Probably the existing code could be made more clear by breaking it down into individual region types; right now, there are extra if statements around for the cases that have indefinite numbers of parameters (e.g., polygons, annuli). I won't object to a BNF-based approach if it can be demonstrated, but I have to see it to believe it.

cdeil · 2017-05-07T08:04:42Z

how to write a grammar for ds9,

What about the pyparsing based parser in pyregion? Is there a problem apart from the speed issue?
Can you point out a (ideally small) example that can't be parsed at all or isn't parsed correctly?
If yes, maybe make a new issue in the pyregion tracker (to keep this one focused on the speed problem)?

joleroi · 2017-05-08T13:13:17Z

This is the table with the current status

With a functional parser like the one we've started on, we can guarantee 100% coverage given enough programmer time.

I agree, we should work until the table linked above shows 100% for coverage everywhere for the regions package. Speed is already much better than pyregion, so we probably don't need to worry about that (as @cdeil said). We discussed this already a year ago, and couldn't come up with a complete DS9 grammar.

Can you point out a (ideally small) example that can't be parsed at all or isn't parsed correctly?

See the script to generate the table, there are many warnings like this

UserWarning: Failed to parse : box(18:13:27.558,+31:22:30.76,13.32",6.66",19.98",9.99",45.2623) # font="helvetica 10 bold roman" text={Box Annulus}
  warnings.warn("Failed to parse : " + l)

Also, most of the files that cannot be parsed with the regions package, show only a 75% coverage with pyregion. So I would be reluctant to try going down the grammar avenue again.

cdeil · 2017-05-08T15:45:57Z

I didn't know that pyregion is not a complete DS9 region parser and I think wasn't part of the DS9 grammar / parse discussions last year (or forgot).

OK, fine with me to go with a hand-written parser if you think it's a good (or even the only possible) solution.

The pyregions docs are a bit all over the place.

On the front page there's a note "pyregion is rather slow, likely due to a inefficient parser. Any contribution will be welcome." and another one pointing to the regions package without saying anything about the relation of the two.
At the top of http://pyregion.readthedocs.io/en/latest/getting_started.html there's a statement containing "It reads most of the region files created by ds9.". I think it would be good to give an example of such a case and say what happens in the docs.

The regions docs also don't really say anything about the relation or regions and pyregion.
I think it would be good to add an About section to both, briefly describing what they are with a few bullet points.
Given the discussion here I think it's roughly: pyregion is a legacy package, will likely remain as-is. (Would still be nice to increase test coverage a bit, improve docs a bit and make one more release).
regions is supposed to be a full replacement and contain a full DS9 region parser. I think the "Shape" and "ShapeList" abstractions from pyregion might be good ones and should become part of the regions.io.ds9 API as an intermediate layer?

@joleroi or @keflavich - Can you do such a docs update this week or should I put it on my TODO list?

Note that @olebole as Debian Astro lead asked me a week ago if he should include pyregion and / or regions in Debian. My answer was that "both" would be good. I think there's many pyregion users and it would be good to have it also, at least for the next 5 years, then maybe it can be dropped at some point (like pywcs and pyfits were recently now that there's Astropy).
For regions I don't think we can make it into Astropy 2.0. It's far from finished / polished / user-tested. It'll be under development for another year and we don't want to commit to stable API like putting as astropy.regions implies, no?

keflavich · 2017-05-08T16:47:14Z

Agreed that both pyregion and region need to coexist for at least a while.

Could you make the docs thing into a separate issue? Maybe we'll be able to address it this week, but it should be an Issue.

cdeil · 2017-05-08T16:55:17Z

@keflavich - OK. I made a separate issue suggesting the docs improvements here: astropy/regions#124

leejjoon · 2017-05-08T21:33:05Z

coverage of pyregion

I have made a pull request (#111) that should improve the coverage. As far as I can tell, it now can parse all the regions in the test. The current version of the regions_pyregion_comparison.py reports coverage around 82%, but this is because the script overestimates the number of defined regions in the file.

speed of pyregion

I think that the reason that pyregion is slow is likely not due to the inherent performance of the pyparsing, but more likely due to the inefficient grammar I created (and things inbetween). By no means I am an expert on this grammar thing and also the pyparsing.

cdeil added this to the wishlist milestone Aug 1, 2016

cdeil added the enhancement label Aug 1, 2016

cdeil mentioned this issue Aug 10, 2016

Pixel Representations astropy/regions#66

Open

cdeil mentioned this issue May 6, 2017

Correctness and speed comparison to pyregion astropy/regions#45

Closed

2 tasks

cdeil mentioned this issue May 8, 2017

Add "about" page describing regions and pyregion astropy/regions#124

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make ds9 region file parser fast #48

Make ds9 region file parser fast #48

cdeil commented Oct 30, 2014

astrofrog commented Oct 31, 2014

cdeil commented Nov 1, 2014

astrofrog commented Nov 1, 2014

leejjoon commented Nov 2, 2014

cdeil commented May 6, 2017

cdeil commented May 6, 2017

keflavich commented May 6, 2017

cdeil commented May 7, 2017

joleroi commented May 8, 2017

cdeil commented May 8, 2017

keflavich commented May 8, 2017

cdeil commented May 8, 2017

leejjoon commented May 8, 2017

Make ds9 region file parser fast #48

Make ds9 region file parser fast #48

Comments

cdeil commented Oct 30, 2014

astrofrog commented Oct 31, 2014

cdeil commented Nov 1, 2014

astrofrog commented Nov 1, 2014

leejjoon commented Nov 2, 2014

cdeil commented May 6, 2017

cdeil commented May 6, 2017

keflavich commented May 6, 2017

cdeil commented May 7, 2017

joleroi commented May 8, 2017

cdeil commented May 8, 2017

keflavich commented May 8, 2017

cdeil commented May 8, 2017

leejjoon commented May 8, 2017