Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

example: reconstruct_python.py failing in half cases #1300

Open
zarnovican opened this issue Jul 14, 2023 · 1 comment
Open

example: reconstruct_python.py failing in half cases #1300

zarnovican opened this issue Jul 14, 2023 · 1 comment
Labels
bug Earley Issues regarding the Earley parser

Comments

@zarnovican
Copy link

Short description

When running the example examples/advanced/reconstruct_python.py, the code fails on AssertionError in approx 50% of cases.

Traceback (most recent call last):
  File "/home/zarnovic/git/lark/examples/advanced/reconstruct_python.py", line 86, in <module>
    test()
  File "/home/zarnovic/git/lark/examples/advanced/reconstruct_python.py", line 80, in test
    assert tree == tree_new
           ^^^^^^^^^^^^^^^^
AssertionError

To Reproduce

Execute the example in a loop. "1" indicate AssertionError. "0" is success.

$ cd examples/advanced/
$ for i in `seq 10`; do python reconstruct_python.py 2>&1 | grep -c AssertionError; done
1
1
1
1
0
1
0
0
1
0

Test environment

OS: Linux
Python: 3.11
Lark: 1.1.5

I have tested few older Pythons and older Larks as well. No change.

Long Description

As I understand the example, it is converting text Python to AST, then back to Python via "Reconstructor" and then, again to AST. The assertion is that the first and second AST trees should be the same.

After some debugging, I was able to isolate much smaller reproducer:

from lark import Token, Lark
from lark.reconstruct import Reconstructor
from lark.indenter import PythonIndenter

# Official Python grammar by Lark
python_parser3 = Lark.open_from_package('lark', 'python.lark', ['grammars'],
                                        parser='lalr', postlex=PythonIndenter(), start='file_input',
                                        maybe_placeholders=False    # Necessary for reconstructor
                                        )

def special(sym):
    return Token('SPECIAL', sym.name)

def test():
    r = Reconstructor(python_parser3, {'_NEWLINE': special, })
    tree = python_parser3.parse('''foo('a', 'b')\n''')
    output = r.reconstruct(tree)
    print(output)

if __name__ == '__main__':
    test()

The above code is converting foo('a', 'b') to AST and back to python. The Reconstructor produces four variations of the code

$ for i in `seq 10`; do python reconstruct_python_test.py; done | sort -u
foo('a''b',)_NEWLINE
foo('a','b',)_NEWLINE
_NEWLINE foo('a''b',)_NEWLINE
_NEWLINE foo('a','b',)_NEWLINE

Notice the missing comma "," between the two arguments. 'a''b' are strings concatenated into one argument, while 'a','b' are two arguments. The same problem happens in the original example on line:

python_parser3 = Lark.open_from_package('lark', 'python.lark', ['grammars'],

Hence the AssertionError.

Problem: Python grammar

I'm not experienced enough to understand how the Reconstructor works. Maybe the problem is ambiguity in the Python grammar defined in Lark. I cannot fathom, however, how could the Reconstructor arrive from a tree node "arguments" with two children to string concat 😕

Problem: non-determism

My initial motivation for root-cause analysis was actually finding the source of non-determinism. Why, if I put the loop inside the Python process, it returns consistent results, while the loop outside the process differ ? I haven't found "the line" where it is happening. The closest I came is this line, where the output unreduced_tree from the parser is already "random", while the input is not AFAIK.

In my opinion, the source of entrophy is Lark's usage of set()s. Every time the Python process is executed, the elements in sets are iterated over in a different order. This is expected behavior from Python point-of-view. But it has cascading effects on Larks processing, when, for example, rules are inspected in random order.

🤔 Maybe, one way of fixing it is to switch from set() to dict() and use some dummy value. AFAIK Python guarantees that the dict keys will be iterated in the same order.

@erezsh erezsh added Earley Issues regarding the Earley parser bug labels Aug 23, 2023
@erezsh
Copy link
Member

erezsh commented Oct 23, 2023

The new release (1.1.8) should solve Earley non-determinism.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Earley Issues regarding the Earley parser
Projects
None yet
Development

No branches or pull requests

2 participants