Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make lark.lark parse the same grammar as load_grammar.py, and make grammar.md document it more fully. #1388

Open
wants to merge 15 commits into
base: master
Choose a base branch
from

Conversation

RossPatterson
Copy link

Make lark.lark parse the same grammar as load_grammar.py, and make grammar.md document it more fully.

Fixes #391, #1382, #1384.

  • All scenarios have tests in new file tests/test_lark_lark.py. Each test passed load_grammar.py and failed lark.lark before these changes, and passed both after. All existing un-skipped unit tests continue to pass after these changes. The skipped Earley tests have not been checked.

  • Updates to the formal Lark grammar (lark/grammars/lark.lark) to make it match the actual parser (lark/load_grammar.py)

  1. Token aliases (e.g., NAME_1 -> NAME_2) are no longer allowed.
    • load_grammar.py appears to accept them, but forces the new name to be a rule-name.
    • tests/test_grammar.py.test_alias_in_terminal says they should raise a GrammarError.
    • Making this change necessitated splitting the shared expansions rule, and all the rules it calls, into rule_expansions and token_expansions (and similarly named rules for them to call).
  2. Rule aliases (name_1 -> name_2) are now only valid for alternates at the top-level of a rule.
  3. The item in a %ignore directive must now be a single token.
    • load_grammar.py accepts multiple tokens, but only processes the first.
  4. The %extend directive is now defined for rules.
  5. The %extend directive is now defined for tokens.
  6. Token templates are no longer allowed (at least until Issue Token templates #555 is resolved).
  7. Tokens can no longer include rules.
  8. The %override directive now supports overriding token definitions.
  9. Rule modifiers are no longer allowed in rule references (e.g., rule1: !?rule2)
  10. Rule modifiers are now allowed to be ?!, in addition to the existing !, ?, and !?.
  • Several tests in tests/tests/test_grammar.py have been copied to new file tests/test_grammar_formal.py and updated to parse using lark.lark instead of load_grammar.py. Only the tests that verify how the Lark grammar is parsed could be implemented. Those that verify how a Lark-generated parser operates could not, because there's no (easy?) way to create a parser from the parse tree.

  • Oddities not changed:

    • Literal ranges are defined in both lark.lark and load_grammar.py as STRINGs, but must actually be single characters.
      • load_grammar.py initially accepts multi-character strings, but later rejects them.
      • Attempting to replace STRING ".." STRING in lark.lark with /./ ".." /./ failed, as Lark accepts Unicode escapes as strings (e.g. test_unicode_literal_range_escape2 uses "\U0000FFFF".."\U00010002").
      • Attempting to replace STRING ".." STRING in lark.lark with /.|"\\U[0-9a-fA-F]+"/ ".." /.|"\\U[0-9a-fA-F]+"/ failed, causing lots of unrelated tests to fail.
    • load_grammar.py does not allow STRING tokens to be empty (i.e., ""). There is no (easy?) way to prevent that in lark.lark. The obvious answer, adding /(?<!"")/, broke tests/test_parser.py._TestParser.test_backslash2.

@RossPatterson RossPatterson mentioned this pull request Feb 1, 2024
Copy link
Member

@erezsh erezsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are some suggestions and questions I have regarding your PR. I tried to get everything down but I might think of more after you address them.

docs/grammar.md Outdated Show resolved Hide resolved
docs/grammar.md Outdated Show resolved Hide resolved
docs/grammar.md Show resolved Hide resolved
docs/grammar.md Outdated
### %override

Override a rule or terminals, affecting all references to it, even in imported grammars.

Useful for implementing an inheritance pattern when importing grammars.

**Syntax:**
```html
%override <TERMINAL> ... terminal definition ...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not accurate..

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/lark-parser/lark/blob/master/lark/load_grammar.py#L162-L165 and https://github.com/lark-parser/lark/blob/master/lark/load_grammar.py#L1260-L1262 make it look like you can override a terminal, by supplying the full definition, preceded by %override. What did I misunderstand?

Again, if the description was less words and more productions, I'd have just written <override> ::= "%override" (<rule> | <term>).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I should have been more specific.

The syntax is %override <terminal definition>, or %override <TERMINAL> : <items>

Either way, but the way you write it is confusing, and would make the reader think the syntax is different.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed by 83a374f.

@@ -74,6 +74,7 @@ Lark will parse "((hello world))" as:
The brackets do not appear in the tree by design. The words appear because they are matched by a named terminal.


<a name="shaping_the_tree"></a>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be automatic, no?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd hope so, but I saw you used a similar construct at https://github.com/lark-parser/lark/blob/master/docs/grammar.md?plain=1#L242, and Sphinx ain't one of my strong suits, so I assumed an HTML anchor was necessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Funny, I'm really not sure why it's there. But as you can see, it isn't needed: https://lark-parser.readthedocs.io/en/stable/grammar.html#directives

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed by 83a374f.

tests/test_grammar_formal.py Outdated Show resolved Hide resolved
# '%ignore expects a value', '%ignore %import\n'
self.assertRaisesRegex(UnexpectedToken, 'Unexpected token Token..__ANON_2., .%import..', l.parse, '%ignore %import\n')

# def test_empty_literal(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why all the commented out parts?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted it to be clear which test_grammar tests weren't implementable for lark.lark. Sort of a "show your work" thing. But I can remove them.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 9493f81.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate your aim for completeness, but I think it was more confusing than anything.

If you think it's important information, perhaps just write a list of the names, without Python syntax. like

# List of non-testable constructs: empty literal, kraken, foobar, etc.


name: RULE
| TOKEN

_VBAR: _NL? "|"
OP: /[+*]|[?](?![a-z])/
RULE: /!?[_?]?[a-z][_a-z0-9]*/
RULE: /_?[a-z][_a-z0-9]*/
RULE_MODIFIERS: /!|![?](?=[a-z])|[?]!?(?=[a-z])/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need the ?=s here? I don't recall that ? or ! are valid anywhere else in the grammar..?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I originally had a comment explaining that, but it would have been the only part of the entire grammar that had commentary, so I removed it. The ? rule modifier is not allowed on a rule whose name begins with _ (https://github.com/lark-parser/lark/blob/master/lark/load_grammar.py#L1057-L1059), and that bit or regex blocks that usage. So maybe I should put the comment back in.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see. But wouldn't users be able to bypass it by adding whitespace between them?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drat! I missed a test case. But no, this test fails similarly in load_grammar.py, the original lark.lark, and my modified version:

            rule1: rule2
            ?! rule2: "a"
        """

load_grammar.py:

Traceback (most recent call last):
  File "C:\Ross\Source\lark\lark\parsers\lalr_parser_state.py", line 77, in feed_token
    action, arg = states[state][token.type]
KeyError: 'OP'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Ross\Source\lark\tests\test_lark_lark.py", line 169, in test_11_rule_modifier_query_bang_space_lg
    Lark(g)
...
  File "C:\Ross\Source\lark\lark\lexer.py", line 674, in lex
    raise UnexpectedToken(token, e.allowed, state=parser_state, token_history=[last_token], terminals_by_name=self.root_lexer.terminals_by_name)
lark.exceptions.UnexpectedToken: Unexpected token Token('OP', '?') at line 3, column 13.
Expected one of:
        * "%declare"
        * "%import"
        * "%extend"
        * RULE
        * "%override"
        * _NL
        * $END
        * "%ignore"
        * TOKEN
        * RULE_MODIFIERS
Previous tokens: [Token('_NL', '\n            ')]

Original lark.lark:

Traceback (most recent call last):
  File "C:\Ross\Source\lark\lark\lexer.py", line 665, in lex
    yield lexer.next_token(lexer_state, parser_state)
  File "C:\Ross\Source\lark\lark\lexer.py", line 598, in next_token
    raise UnexpectedCharacters(lex_state.text, line_ctr.char_pos, line_ctr.line, line_ctr.column,
lark.exceptions.UnexpectedCharacters: No terminal matches '?' in the current parser context, at line 3 col 13

            ?! rule2: "a"
            ^
...
During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Ross\Source\lark\tests\test_lark_lark.py", line 176, in test_11_rule_modifier_query_bang_space_ll
    self.lark_parser.parse(g)
...
  File "C:\Ross\Source\lark\lark\lexer.py", line 674, in lex
    raise UnexpectedToken(token, e.allowed, state=parser_state, token_history=[last_token], terminals_by_name=self.root_lexer.terminals_by_name)
lark.exceptions.UnexpectedToken: Unexpected token Token('OP', '?') at line 3, column 13.
Expected one of:
        * _NL
        * "%import"
        * "%ignore"
        * "%override"
        * TOKEN
        * $END
        * "%declare"
        * RULE
Previous tokens: [Token('_NL', '\n            ')]

Modified lark.lark:

Traceback (most recent call last):
  File "C:\Ross\Source\lark\lark\lexer.py", line 665, in lex
    yield lexer.next_token(lexer_state, parser_state)
  File "C:\Ross\Source\lark\lark\lexer.py", line 598, in next_token
    raise UnexpectedCharacters(lex_state.text, line_ctr.char_pos, line_ctr.line, line_ctr.column,
lark.exceptions.UnexpectedCharacters: No terminal matches '?' in the current parser context, at line 3 col 13

            ?! rule2: "a"
            ^
...
During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Ross\Source\lark\tests\test_lark_lark.py", line 176, in test_11_rule_modifier_query_bang_space_ll
    self.lark_parser.parse(g)
...
  File "C:\Ross\Source\lark\lark\lexer.py", line 674, in lex
    raise UnexpectedToken(token, e.allowed, state=parser_state, token_history=[last_token], terminals_by_name=self.root_lexer.terminals_by_name)
lark.exceptions.UnexpectedToken: Unexpected token Token('OP', '?') at line 3, column 13.
Expected one of:
        * "%declare"
        * RULE
        * RULE_MODIFIERS
        * $END
        * "%ignore"
        * "%extend"
        * "%import"
        * TOKEN
        * _NL
        * "%override"
Previous tokens: [Token('_NL', '\n            ')]

?atom: "(" expansions ")"
| "[" expansions "]" -> maybe
| value
?token_atom: "(" token_expansions ")"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

token_atom and rule_atom are the same, why not merge them using templates? Same with rule_expr and atom_expr, rule_expansion etc.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that the original lark.lark can derive atom -> value -> template_usage, but templates aren't allowed for terminals (pending resolution of Issue #555). I didn't see any way, with templates, to prevent that derivation. As far as I can tell, templates are effectively a text-substitution macro capability (although I understand the implementation is a little more sophisticated).

I suppose I could have done something like this:

_expansion{expr_type}: expr_type*
_expr{atom_type}: atom_type [OP | "~" NUMBER [".." NUMBER]]
_atom{expansion_type, value_type}: "(" expansion_type ")"
                                 | "[" expansion_type "]" -> maybe
                                 | value_type

?rule_expansions: rule_alias (_VBAR rule_alias)*
?rule_inner_expansions: rule_expansion (_VBAR rule_expansion)*
?rule_alias: rule_expansion ["->" RULE]
?rule_expansion: _expansion(rule_expr)
?rule_expr: _expr(rule_atom)
?rule_atom: _atom(rule_inner_expansions, rule_value)
?rule_value: RULE "{" rule_value ("," rule_value)* "}" -> rule_template_usage
           | RULE
           | value

?token_expansions: token_expansion (_VBAR token_expansion)*
?token_expansion: _expansion(token_expr)
?token_expr: _expr(token_atom)
?token_atom: _atom(token_expansions, token_value)

?value: STRING ".." STRING -> literal_range
      | TOKEN
      | (REGEXP | STRING) -> literal

Is it what you're suggesting? It doesn't read as cleanly to my eyes, but I'll grant that I've been studying Lark pretty hard lately.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before we dig into that, let's come to an agreement in the main conversation (regarding restrictive-validating grammar vs relaxed grammar + validating visitor).

Afterwards if it's still relevant, I will try to craft a solution to this problem.

@MegaIng
Copy link
Member

MegaIng commented Feb 1, 2024

My general opinion is that lark.lark should reflect the syntax as defined by the BNF inside of load_grammar.py and not the semantic checks implemented on top of that (and therefore most of this PR shouldn't be done in it's current form). But if @erezsh is of a different mind, I am not going to argue too much. I also wouldn't be opposed to making the syntax specfified in load_grammar.py more restrictive and therefore removing semantic checks in other parts of the code. My "dream" would be that the BNF hard coded in load_grammar.py could be generated from lark.lark and we only have one source of truth. Further diverging the two grammars is IMO not a good idea.

@RossPatterson
Copy link
Author

RossPatterson commented Feb 1, 2024

My general opinion is that lark.lark should reflect the syntax as defined by the BNF inside of load_grammar.py and not the semantic checks implemented on top of that (and therefore most of this PR shouldn't be done in it's current form).

I see your point, but I think for anyone who isn't writing code in the Lark codebase, lark.lark is a much more accessible definition of the Lark grammar than the implementation in load_grammar.py. Some of the things I noted (e.g., literal_range being STRINGs) sounds more to me like syntax than semantics, but that may just be bike-shedding.

My "dream" would be that the BNF hard coded in load_grammar.py could be generated from lark.lark and we only have one source of truth.

Mine as well. I was actually quite surprised at how different they were. It was your comments in Issue #1355 that pointed me down this path.

Further diverging the two grammars is IMO not a good idea.

I guess I see it more as specifying the same grammar in two forms, not two grammars. But you're right about semantics - there are some things (e.g., my comments about literal_range) that Lark can't easily express today, and that require post-processing.

@RossPatterson
Copy link
Author

I missed two lines that I should have deleted in the test files, which I used for before/after testing (... 'grammars/lark.lark-ORIG' ...). I'll remove them.

…rk_lark.py (trailing whitespace in test_06_*())

2. Remove what was left of "Literals can be one of: ..." under "Terminals" in grammar.md.

3. Address @erezsh's point about inlined terminals under "Terminals" in grammar.md.

4. Remove "lark.lark-ORIG" references in test_lark_lark.py and test_grammar_formal.py that shouldn't have been pushed.

5. Address @erezsh's point about f.readlines() in test_grammar_formal.py.

6. Address @erezsh's point commented-out tests in test_grammar_formal.py.
@RossPatterson
Copy link
Author

I missed two lines that I should have deleted in the test files, which I used for before/after testing (... 'grammars/lark.lark-ORIG' ...). I'll remove them.

Fixed in 9493f81.

@erezsh
Copy link
Member

erezsh commented Feb 2, 2024

@MegaIng @RossPatterson I can see both points of view, and I think the determining factor should be practicality above all else.

In an ideal world, load_grammar would only preload a minimal parser, that would parse lark.lark and use it to create the Lark parser that supports the full syntax. That would be the cleanest approach imho. But it would also be slower, probably in a very noticeable way. (did we ever do this experiment?)

As for the BNF in load_grammar, it is non-restrictive on purpose. By moving the validation from the parser to the post-parsing code, we can perform more sophisticated validation, and provide custom error messages that are more helpful and friendly.

As for what to do with lark.lark, in my mind it's still an open question, and depends heavily on what are its main uses. Is it to parse and manipulate lark grammars programatically? Is it to validate correct grammars? Is it to be used in a lark editor? etc.

I think the options that we are facing are:

  1. Leave lark.lark as a non-validating grammar, that is still capable of parsing every correct lark grammar

  2. Same as 1, but pair it with an "official" visitor that performs extra validation with better error messages

  3. Make lark.lark as restrictive as possible, to match load_grammar

I'm leaning towards (2), because I'm can't really think of good use-cases for 3, except than it being a "reference grammar" for human eyes. But I don't know if that's a good enough reason, especially since that can be accomplished using better docs, or with comments within lark.lark

What do you guys think?

(needless to say, regardless of what we choose, I'm all for improving the documentation regarding the grammar)

@RossPatterson
Copy link
Author

As for what to do with lark.lark, in my mind it's still an open question, and depends heavily on what are its main uses. Is it to parse and manipulate lark grammars programatically?

That's part of what drove me down this path. I've got a use case where I'm playing with @MegaIng's lark2railroad, combined with my own recent contributions to railroad-diagrams, to generate documentation for small programs that use Lark to parse their input.

Is it to validate correct grammars?

That's another part of my motivation. As noted in Issue #1355, I attempted to use aliases on rule-alternates inside a rule-group, and received a weird result (from load_grammar.py, which as you implied, has great contextual error messages). It turns out, although neither lark.lark nor the documetation says so, that's invalid syntax. Now, I can simply avoid doing that again, but I try to contribute incremental improvements to open source projects that I use, so here we are :-)

I think the options that we are facing are:
...
What do you guys think?

I think having an inaccurate reference grammar for human purposes is almost worse than not having one at all. I think having a partial grammar for parsing purposes is somewhat useful, but will be surprising to anyone who tries to use it reliably. And I think better documentation might obviate the need for a perfect reference grammar, except for the cases like lark2railroad that are actually trying to parse Lark grammars from random authors.

I'm not going to be offended by however the decision turns out. If it's something I can help with, fine, if not, also fine. This isn't my project, I'm just a (so far very happy) user who has the skills to contribute a bit.

@erezsh
Copy link
Member

erezsh commented Feb 2, 2024

I think having an inaccurate reference grammar for human purposes

@RossPatterson Can you explain which option this refers to? I don't think I advocated at any point for an inaccurate grammar, only a non-validating one

which as you implied, has great contextual error messages

The load_grammar error messages are definitely better than the generic lark error messages. If I overlooked or missed something, it's not proof that the approach is bad

Do you have any objections to option 2 that I suggested? Do you think it doesn't serve your efforts as well as option 3? Do you think it's more confusing?

Frankly, I don't accept or reject PR based on who will be offended. I'm just trying to make Lark as good as it can be. (with the limited time that I can spare for it)

@RossPatterson
Copy link
Author

I think having an inaccurate reference grammar for human purposes

@RossPatterson Can you explain which option this refers to? I don't think I advocated at any point for an inaccurate grammar, only a non-validating one

Sorry, I misunderstood.

The load_grammar error messages are definitely better than the generic lark error messages. If I overlooked or missed something, it's not proof that the approach is bad

No, in fact, like I said, it's great. The load_grammar.py code is hard to understand, but the error messages are great.

Do you have any objections to option 2 that I suggested? Do you think it doesn't serve your efforts as well as option 3? Do you think it's more confusing?

No, I have no objections to option 2. My efforts will be served as long as lark.lark can parse legitimate Lark grammars, and as long as I know the things I shouldn't do.

Frankly, I don't accept or reject PR based on who will be offended. I'm just trying to make Lark as good as it can be. (with the limited time that I can spare for it)

I understand completely. I was trying to say that I'm OK with whatever you decide to do.

@erezsh
Copy link
Member

erezsh commented Feb 5, 2024

@MegaIng Any objections to option 2? i.e. keep lark.lark non-validating and pair it with a visitor that performs the same extra validation as load_grammar?

@MegaIng
Copy link
Member

MegaIng commented Feb 5, 2024

Currently no, I think that is the ideal solution. Preferably the same visitor is used both internally and externally and is the only source of GrammerErrors in load_grammer to make sure that everything is in sync.

@erezsh
Copy link
Member

erezsh commented Feb 5, 2024

Preferably the same visitor is used both internally and externally and is the only source of GrammerErrors

That seems a bit optimistic, but I agree it would be nice if we could get there.

@RossPatterson Would you be okay adapting your PR to solution 2?

That means getting lark.lark as close as possible to load_grammar, and adding a validating visitor for the rest (or somehow using the one in load_grammar)

If you'd like, we can split this PR to two, one for the docs, which I think we all agree can be merged already as is (or with tiny changes), and a separate one for lark.lark. But it's up to you.

@MegaIng
Copy link
Member

MegaIng commented Feb 5, 2024

That means getting lark.lark as close as possible to load_grammar

To note: lark.lark should probably still use ebnf and not restrict itself to bnf like the hardcoded grammar does, and IMO it's ok to change the grammar in load_grammar a bit if it makes stuff easier to sync up (if it doesn't have a performance hit and ofcourse doesn't break anything)

@RossPatterson
Copy link
Author

@RossPatterson Would you be okay adapting your PR to solution 2?

That means getting lark.lark as close as possible to load_grammar, and adding a validating visitor for the rest (or somehow using the one in load_grammar)

Yup, I can do that. I've already had to do the analysis to find the differences, and to fix them. So it looks like I'll just be backing out some of the fixes, and creating their equivalent in a visitor.

If you'd like, we can split this PR to two, one for the docs, which I think we all agree can be merged already as is (or with tiny changes), and a separate one for lark.lark. But it's up to you.

I'm fine with keeping them together.

@RossPatterson
Copy link
Author

To note: lark.lark should probably still use ebnf and not restrict itself to bnf like the hardcoded grammar does,

I'm personally biased in favor of BNF, because it's so well defined, whereas "EBNF" means any of a variety of possibilities. But I agree, lark.lark should continue to be in the Lark variant of EBNF.

IMO it's ok to change the grammar in load_grammar a bit if it makes stuff easier to sync up (if it doesn't have a performance hit and ofcourse doesn't break anything)

I don't anticipate that being necessary. But I've been wrong before :-)

@RossPatterson
Copy link
Author

I sat down tonight to start implementing option 2, and quickly found myself categorizing most of my changes (at the top of this PR) as "making lark.lark accept the same syntax as load_grammar", and therefore I was planning not to revert them. In fact, when I was done, the only changes I felt weren't either adding omitted directives or removing illegal syntax were three of the last four:

  • 7 Tokens can no longer include rules.
  • 9 Rule modifiers are no longer allowed in rule references (e.g., rule1: !?rule2)
  • 10 Rule modifiers are now allowed to be ?!, in addition to the existing !, ?, and !?.

I somehow doubt this is what either of you expect.

I think the changes that correct the %extend and %override syntax (4, 5, and 8) should obviously be retained. The changes that correct the token rule syntax (1 and 6) seem appropriate to keep as well. The rule aliases change (2), %ignore syntax (3) and tokens-can't-include-rules (7) are on the fence, and my personal feeling is that 2 and 3 should be kept and 7 should be reverted and moved to the new visitor. I'm conflicted on rule references (9) - it seems like it's a syntax issue, but it's only implied by the RULE terminal definition, it isn't explicitly stated in any rule. So I'm inclined to revert it an move it to the visitor. I think the last one, the query-bang rule modifier (10), is a real nit, and ought to be in the visitor instead.

So, before I go off and start messing things up, can you tell me if I'm reading things right or wrong, and if wrong, how so?

@MegaIng
Copy link
Member

MegaIng commented Feb 7, 2024

@RossPatterson I think you misunderstood what we (at least I) meant with option 2. At the top of load_grammar.py, there is a BNF + terminals description of the exact grammar (the syntax) that is parsed by the file. The idea is that lark.lark should exactly match that, ignoring every other post processing step (the semantics) done in load_grammar.py. Then we also provide an extra visitor somewhere that does all the other validation (i.e. the semantic checks). If you think it's reasonable to move stuff from semantics to syntax, then you can do that, but please that also do the move in the BNF grammar defined in load_grammar.py.

@RossPatterson
Copy link
Author

I reverted all my changes to lark.lark, and did what @MegaIng suggested - made it do exactly what the BNF in load_grammar.py does.

  1. In rule, split out the modifiers to rule_modifiers.
  2. In token, removed token_params.
  3. In statement, added %extend.
  4. In statement, added %override token.
  5. In alias, the RULE usage now does not accept ! and ? modifiers (see RULE_MODIFIERS below).
  6. In value > template_usage, changed name to RULE, to prevent accepting TOKENs.
  7. In RULE, split out the rule modifiers to RULE_MODIFIERS.

I built the validating visitor, lark_validator_vistor.py::LarkValidatorVisitor. It is invoked by a class-level validate(tree) method. It checks:

  1. That tokens don't have aliases.
  2. That %ignore has only one token specified. Note that load_grammar doesn't do this, you get unexpected results from %ignore A B.
  3. That rules don't have aliass on "inner expansions". Note that load_grammar doesn't do this, you get unexpected results instead.

There wasn't any existing doc on using the lark.lark grammar, so I put some instructions for the validator at the top of the grammar.

I also moved the doc for templates in grammar.md from the "Terminals" section to the "Rules" section, since you can't use templates with terminals, and left behind the "Templates are not allowed with terminals." part

Some notes:

  1. lark.lark inlines several rules that load_grammar does not. I haven't changed this, because it will change trees produced by lark.lark, but I think I should.
    • expansions
    • expansion
    • value
  2. load_grammar inlines one rule that lark.lark does not. I haven't changed this, because it will change trees produced by lark.lark, but I think I should.
    • name
  3. load_grammar defines OP as [+*]|[?](?![a-z_]), while lark.lark omits the underscore in the not-followed-by set. I assume load_grammar does this to help in recognizing the difference between atom? and ?rulename:, but it's a little too arcane for me to be sure.
  4. load_grammar defines RULE_MODIFIERS as (!|![?]?|[?]!?)(?=[_a-z]). Why does it use both ! and ![?]?? Why not just ![?]??
  5. lark.lark imports ESCAPED_STRING from common.lark as _STRING, while load_grammar, of course, defines it itself. They're expressed differently, but I believe they accept the same set of values. I have left them both as they were originally.
  6. load_grammar defines STRING as "(\\"|\\\\|[^"\n])*?"i?. Why use *?? Isn't that the same as *?
  7. lark.lark imports SIGNED_INT from common.lark as NUMBER, and WS_INLINE from common.lark, while load_grammar, of course, defines them itself. They're expressed differently, but they accept the same set of values. I have left them both as they were originally.
  8. lark.lark and load_grammar define COMMENT slightly differently, but the resulting regexp is identical. I have left them both as they were originally.

@MegaIng
Copy link
Member

MegaIng commented Feb 9, 2024

Very nice, this is about what I imagined :-) Thank you for going through this effort.

I haven't changed this, because it will change trees produced by lark.lark, but I think I should.

I think that's fine. I don't think there are too many people out there relying on the details of that grammar, and I think we changed it before without warning.

lark.lark inlines several rules that load_grammar does not.

alias and expr are also never inlined by lark.lark despite having a ? because the [] makes it so that we always have a None as a second child. I think this should both be changed to use ()? instead and be inlined. IMO that is more inline with that would be expected from the parsing, matches load_grammar better and is probably? easier to work with. But it would require changes in the validator you wrote.

It might make sense to write an Interpreter instead of a Visitor which allows slighter better control about what happens when.

There wasn't any existing doc on using the lark.lark grammar, so I put some instructions for the validator at the top of the grammar.

We should have a doc page with a reference list for our "stdlib" grammars, but that can be a different PR. An example using the validator might be nice.

Copy link
Member

@MegaIng MegaIng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With regard to tests, also see that we are already testing lark.lark:

# Check that the `lark.lark` grammar represents can parse every example used in these tests.

It might make sense to extend the tests there to include LarkValidatorVisitor and make sure that both also raise the same error messages although it might be hard to check that we don't expect grammar analysis errors. I don't think earley should be raising any? But I might be wrong with that.

docs/grammar.md Outdated Show resolved Hide resolved
docs/grammar.md Outdated Show resolved Hide resolved
docs/grammar.md Show resolved Hide resolved
tests/test_lark_lark.py Outdated Show resolved Hide resolved
lark/grammars/lark.lark Outdated Show resolved Hide resolved
@RossPatterson
Copy link
Author

RossPatterson commented Feb 10, 2024

I haven't changed this, because it will change trees produced by lark.lark, but I think I should.

I think that's fine. I don't think there are too many people out there relying on the details of that grammar, and I think we changed it before without warning.

OK, I removed inlining from expansions, expansion, and value in lark.lark.

Fixed by 2ec5ef3.

@RossPatterson
Copy link
Author

It might make sense to write an Interpreter instead of a Visitor which allows slighter better control about what happens when.

The Visitor model seems to fit well for the validator. By its nature, it finds all the tokens, aliases, and ignores, which I'd have to write code to hunt down if I switched to Interpreter. Is there a subtlty I'm missing here?

@RossPatterson
Copy link
Author

There wasn't any existing doc on using the lark.lark grammar, so I put some instructions for the validator at the top of the grammar.

We should have a doc page with a reference list for our "stdlib" grammars, but that can be a different PR. An example using the validator might be nice.

If someone wants to write that page, I'll certainly write of an example for the validator.

@erezsh
Copy link
Member

erezsh commented Feb 10, 2024

@RossPatterson Nice work. Let me know once you've cleaned it up a bit, and I'll give it another review.

I'll see if I find the time to write that doc page. I imagine it should be pretty short and straight-forward.

@RossPatterson
Copy link
Author

alias and expr are also never inlined by lark.lark despite having a ? because the [] makes it so that we always have a None as a second child. I think this should both be changed to use ()? instead and be inlined. IMO that is more inline with that would be expected from the parsing, matches load_grammar better and is probably? easier to work with. But it would require changes in the validator you wrote.

Fixing the validator was pretty easy. Fixed in e9c026e.

@RossPatterson
Copy link
Author

With regard to tests, also see that we are already testing lark.lark:

# Check that the `lark.lark` grammar represents can parse every example used in these tests.

It might make sense to extend the tests there to include LarkValidatorVisitor and make sure that both also raise the same error messages although it might be hard to check that we don't expect grammar analysis errors. I don't think earley should be raising any? But I might be wrong with that.

I'm working on blending my test_lark_lark.py and test_grammar_formal.py into test_parser.py and test_grammar.py. I'm using the technique used in test_parser.py for running a set of tests against multiple parsers.

@RossPatterson
Copy link
Author

I'm working on blending my test_lark_lark.py and test_grammar_formal.py into test_parser.py and test_grammar.py. I'm using the technique used in test_parser.py for running a set of tests against multiple parsers.

While doing that, I found that neither the original lark.lark nor my version pass test_grammar.py testcase test_line_breaks. Apparently load_grammar.py accepts escaping of newlines (i.e., backslash-newline) to continue a statement. If I'm following things correctly, that's because load_grammar includes BACKSLASH in its ignore-set, while lark.lark does not.

Fixed in 7f02bd1.

@RossPatterson
Copy link
Author

I'll be offline for the next few weeks, but I'll wrap this up after I return in mid-March.

@RossPatterson
Copy link
Author

While testing and verifying my changes, I realized there are a few more differences between load_grammar.py and lark.lark, which result in differently-shaped parse trees. Specifically:

  1. The load_grammar.py version of rule pushes the optionality of rule_modifiers and prioritydown into rule_modifiers and priority, while lark.lark does not. That means the load_grammar.py tree always contains nodes for rule_modifiers and priority node, while the lark.lark tree may not. I'm going to change lark.lark to do what load_grammar.py does, because the validator is affected by this difference. FYI, term does not do the same thing with priority as rule does. Both load_grammar.py and lark.lark agree that the priority is optional in term.
  2. load_grammar.py calls token names TERMINAL, and calls token definitions term, while lark.lark uses TOKEN and token. I'm not going to change this, because it seems like a big change.
  3. The load_grammar.py version of value splits the recognition of RULE and TERMINAL into nonterminal and terminal, while lark.lark uses the combined name (which exists in both grammars). I'm not going to change this, because leaving 2 above intact means this will be different anyway.
  4. In load_grammar.py, name is inlined, while in lark.lark, it is not. I don't know what to do about this, if anything.

@RossPatterson
Copy link
Author

  1. The load_grammar.py version of rule pushes the optionality of rule_modifiers and prioritydown into rule_modifiers and priority, while lark.lark does not.

FIxed by 4f7a5eb.

@RossPatterson
Copy link
Author

@erezsh @MegaIng
Can one of you take a look at test_grammar.TestGrammar.test_undefined_term? It expects a GrammarError, but the error that is actually raised seems wrong. It says Rule 'A' used but not defined (in rule start)", but the entire grammar is just start: A. Shouldn't it raise "Terminal 'A' ..." instead?

@MegaIng
Copy link
Member

MegaIng commented Mar 15, 2024

@erezsh @MegaIng Can one of you take a look at test_grammar.TestGrammar.test_undefined_term? It expects a GrammarError, but the error that is actually raised seems wrong. It says Rule 'A' used but not defined (in rule start)", but the entire grammar is just start: A. Shouldn't it raise "Terminal 'A' ..." instead?

That's a bug that was introducded in #1018 that we both missed, specfically see here:

lark/lark/load_grammar.py

Lines 1367 to 1369 in 7061908

for sym in _find_used_symbols(exp):
if sym not in self._definitions and sym not in params:
self._grammar_error(d.is_term, "{Type} '{name}' used but not defined (in {type2} {name2})", sym, name)

That shouldn't be d.is_term, that is referring to whether we are in the definition of a symbol/rule, but _grammar_error uses it to refer to the name itself.

@RossPatterson
Copy link
Author

That's a bug that was introducded in #1018 that we both missed, specfically see here:
...
That shouldn't be d.is_term, that is referring to whether we are in the definition of a symbol/rule, but _grammar_error uses it to refer to the name itself.

Fixed in 40576d2.

@RossPatterson
Copy link
Author

As of daac65d, everything I intended to do is done. @erezsh @MegaIng, I think this meets your expectations, and is ready for code review.

@RossPatterson
Copy link
Author

Is this work still of interest? I think it's ready for final review, and probably for merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Lark grammar?
3 participants