Earley parser produces wrong parse Tree #1283

MegaIng · 2023-06-01T11:24:03Z

Over on SO someone asked this question: https://stackoverflow.com/questions/76366280/parsing-formulas-using-lark-ebnf/76381256

As far as I can tell, it shows a bug in the earley parser where it duplicates some Tokens because of enormous amounts of ambiguities. I don't have the expertise with earley to figure out what is happening or the time to create a minimal example right now.

chanicpanic · 2023-06-02T03:01:31Z

I have it down to:

grammar = """
start: (a+)*
!a.1: "A" |
"""

l = Lark(grammar)
tree = l.parse("A")
print(tree.pretty())

Output:

start
  a
  a     A
  a
  a     A

I suspect that the issue is somewhere in ForestToParseTree with resolve_ambiguity=True. I do not see the same problem in the tree returned when using ambiguity="explicit". I'll look into it more when I have a chance.

chanicpanic · 2023-06-06T05:07:04Z

The issue is that when ambiguity="resolve", the ParseTreeBuilder will use the ChildFilterLALR filters. However, the ForestToParseTree cache breaks the ChildFilterLALR assumption that it is safe to change the parse tree. We have two options:

Disable the ForestToParseTree cache when ambiguity="resolve".
Always use the regular ChildFilter for earley.

Either will work, but they may have different performance implications.

erezsh · 2023-06-07T07:10:00Z

@chanicpanic Thanks for looking into it!

Can you explain more about the performance implications?

chanicpanic · 2023-06-10T05:38:43Z

The ForestToParseTree cache and ChildFilterLALR filter are both optimizations that happen to be incompatible with each other. So, the question is: which optimization gives the most performance benefit for earley with ambiguity resolution?

While the ForestToParseTree cache provides a significant speed-up for explicit ambiguous parses, its impact on ambiguity resolution is likely minimal. My hypothesis is that the cache is only relevant when a rule matches an empty string, and even then, the forest nodes have to be visited in a particular order.

On the other hand, I believe that child filtering is an operation that is performed for many grammars and inputs. So, the ChildFilterLALR would be used much more often.

Thus, I am leaning toward option 1 because I expect the ChildFilterLALR to have a greater overall performance benefit.

erezsh · 2023-06-25T10:03:13Z

@chanicpanic Thanks for the explanation.

I think it's best to make an evidence-based choice, which makes me wonder, do you have an idea for what a good benchmark grammar&input for Earley would be?

I could run benchmarks on trick-grammars like start: start start | "(" start ")" |, and normal grammars that are low on ambiguity. But maybe you know some real-world examples that are ambiguous and would serve as a better measurement?

cc: @MegaIng

Shildifreak · 2023-07-22T20:18:28Z

I stumbled across what I think is the same problem.
Here is the minimal example from my grammar:

GRAMMAR = r'''
start : _s x _s
x : "X"?
_s : " "?
'''

l = Lark(GRAMMAR)

# behave as expected
print(l.parse( " X " ).pretty())
print(l.parse( "X" ).pretty())
print(l.parse( " " ).pretty())

# produces two x nodes
print(l.parse( "" ).pretty())

output:

start
  x

start
  x

start
  x

start
  x
  x

…parser#1283)

chanicpanic · 2023-08-01T20:17:32Z

I think it's best to make an evidence-based choice, which makes me wonder, do you have an idea for what a good benchmark grammar&input for Earley would be?

I could run benchmarks on trick-grammars like start: start start | "(" start ")" |, and normal grammars that are low on ambiguity. But maybe you know some real-world examples that are ambiguous and would serve as a better measurement?

@erezsh I don't have any real-world grammars that I think would be a better benchmark for Earley.
For this issue, I think it would be good to use the grammars in this thread, or variations that incorporate the same patterns, which are known to utilize the ForestToParseTree cache with ambiguity="resolve". Also, it would be good to benchmark with normal grammars that reflect more typical lark usage.

I made branches for the two options that can be used for comparison:

MegaIng mentioned this issue Jul 30, 2023

Empty matches appearing unnecessarily when repeating empty rules ambiguously #1312

Open

chanicpanic added a commit to chanicpanic/lark that referenced this issue Aug 1, 2023

Add tests for issue lark-parser#1283

1b6c1dd

chanicpanic added a commit to chanicpanic/lark that referenced this issue Aug 1, 2023

Always use ChildFilter for Earley (issue lark-parser#1283)

8e623ef

chanicpanic added a commit to chanicpanic/lark that referenced this issue Aug 1, 2023

Disable ForestToParseTree cache when ambiguity='resolve' (issue lark-…

18562ae

…parser#1283)

erezsh added the Earley Issues regarding the Earley parser label Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Earley parser produces wrong parse Tree #1283

Earley parser produces wrong parse Tree #1283

MegaIng commented Jun 1, 2023

chanicpanic commented Jun 2, 2023

chanicpanic commented Jun 6, 2023

erezsh commented Jun 7, 2023

chanicpanic commented Jun 10, 2023

erezsh commented Jun 25, 2023 •

edited

Shildifreak commented Jul 22, 2023

chanicpanic commented Aug 1, 2023

Earley parser produces wrong parse Tree #1283

Earley parser produces wrong parse Tree #1283

Comments

MegaIng commented Jun 1, 2023

chanicpanic commented Jun 2, 2023

chanicpanic commented Jun 6, 2023

erezsh commented Jun 7, 2023

chanicpanic commented Jun 10, 2023

erezsh commented Jun 25, 2023 • edited

Shildifreak commented Jul 22, 2023

chanicpanic commented Aug 1, 2023

erezsh commented Jun 25, 2023 •

edited