Make lark.lark parse the same grammar as load_grammar.py, and make grammar.md document it more fully. #1388

RossPatterson · 2024-02-01T01:54:02Z

Make lark.lark parse the same grammar as load_grammar.py, and make grammar.md document it more fully.

All scenarios have tests in new file tests/test_lark_lark.py. Each test passed load_grammar.py and failed lark.lark before these changes, and passed both after. All existing un-skipped unit tests continue to pass after these changes. The skipped Earley tests have not been checked.
Updates to the formal Lark grammar (lark/grammars/lark.lark) to make it match the actual parser (lark/load_grammar.py)

Token aliases (e.g., NAME_1 -> NAME_2) are no longer allowed.
- load_grammar.py appears to accept them, but forces the new name to be a rule-name.
- tests/test_grammar.py.test_alias_in_terminal says they should raise a GrammarError.
- Making this change necessitated splitting the shared expansions rule, and all the rules it calls, into rule_expansions and token_expansions (and similarly named rules for them to call).
Rule aliases (name_1 -> name_2) are now only valid for alternates at the top-level of a rule.
- load_grammar.py accepts them at all levels, but as Issue GrammarError: Rule 'anycase' used but not defined (in rule pipesyn) #1355 points out, they are not supported below the top level.
The item in a %ignore directive must now be a single token.
- load_grammar.py accepts multiple tokens, but only processes the first.
The %extend directive is now defined for rules.
The %extend directive is now defined for tokens.
Token templates are no longer allowed (at least until Issue Token templates #555 is resolved).
Tokens can no longer include rules.
The %override directive now supports overriding token definitions.
Rule modifiers are no longer allowed in rule references (e.g., rule1: !?rule2)
Rule modifiers are now allowed to be ?!, in addition to the existing !, ?, and !?.

Several tests in tests/tests/test_grammar.py have been copied to new file tests/test_grammar_formal.py and updated to parse using lark.lark instead of load_grammar.py. Only the tests that verify how the Lark grammar is parsed could be implemented. Those that verify how a Lark-generated parser operates could not, because there's no (easy?) way to create a parser from the parse tree.
Oddities not changed:
- Literal ranges are defined in both lark.lark and load_grammar.py as STRINGs, but must actually be single characters.
  - load_grammar.py initially accepts multi-character strings, but later rejects them.
  - Attempting to replace STRING ".." STRING in lark.lark with /./ ".." /./ failed, as Lark accepts Unicode escapes as strings (e.g. test_unicode_literal_range_escape2 uses "\U0000FFFF".."\U00010002").
  - Attempting to replace STRING ".." STRING in lark.lark with /.|"\\U[0-9a-fA-F]+"/ ".." /.|"\\U[0-9a-fA-F]+"/ failed, causing lots of unrelated tests to fail.
- load_grammar.py does not allow STRING tokens to be empty (i.e., ""). There is no (easy?) way to prevent that in lark.lark. The obvious answer, adding /(?<!"")/, broke tests/test_parser.py._TestParser.test_backslash2.

…ammar.md document it more fully.

erezsh

Here are some suggestions and questions I have regarding your PR. I tried to get everything down but I might think of more after you address them.

docs/grammar.md

erezsh · 2024-02-01T11:15:18Z

docs/grammar.md

 ### %override

 Override a rule or terminals, affecting all references to it, even in imported grammars.

 Useful for implementing an inheritance pattern when importing grammars.

+**Syntax:**
+```html
+%override <TERMINAL> ... terminal definition ...


Not accurate..

https://github.com/lark-parser/lark/blob/master/lark/load_grammar.py#L162-L165 and https://github.com/lark-parser/lark/blob/master/lark/load_grammar.py#L1260-L1262 make it look like you can override a terminal, by supplying the full definition, preceded by %override. What did I misunderstand?

Again, if the description was less words and more productions, I'd have just written <override> ::= "%override" (<rule> | <term>).

Sorry, I should have been more specific.

The syntax is %override <terminal definition>, or %override <TERMINAL> : <items>

Either way, but the way you write it is confusing, and would make the reader think the syntax is different.

Fixed by 83a374f.

erezsh · 2024-02-01T11:15:46Z

docs/tree_construction.md

@@ -74,6 +74,7 @@ Lark will parse "((hello world))" as:
 The brackets do not appear in the tree by design. The words appear because they are matched by a named terminal.


+<a name="shaping_the_tree"></a>


This should be automatic, no?

I'd hope so, but I saw you used a similar construct at https://github.com/lark-parser/lark/blob/master/docs/grammar.md?plain=1#L242, and Sphinx ain't one of my strong suits, so I assumed an HTML anchor was necessary.

Funny, I'm really not sure why it's there. But as you can see, it isn't needed: https://lark-parser.readthedocs.io/en/stable/grammar.html#directives

Fixed by 83a374f.

tests/test_grammar_formal.py

erezsh · 2024-02-01T11:21:24Z

tests/test_grammar_formal.py

+        # '%ignore expects a value', '%ignore %import\n'
+        self.assertRaisesRegex(UnexpectedToken, 'Unexpected token Token..__ANON_2., .%import..', l.parse, '%ignore %import\n')
+
+    # def test_empty_literal(self):


Why all the commented out parts?

I wanted it to be clear which test_grammar tests weren't implementable for lark.lark. Sort of a "show your work" thing. But I can remove them.

Fixed in 9493f81.

I appreciate your aim for completeness, but I think it was more confusing than anything.

If you think it's important information, perhaps just write a list of the names, without Python syntax. like

# List of non-testable constructs: empty literal, kraken, foobar, etc.

erezsh · 2024-02-01T11:24:45Z

lark/grammars/lark.lark


 name: RULE
    | TOKEN

 _VBAR: _NL? "|"
 OP: /[+*]|[?](?![a-z])/
-RULE: /!?[_?]?[a-z][_a-z0-9]*/
+RULE: /_?[a-z][_a-z0-9]*/
+RULE_MODIFIERS: /!|![?](?=[a-z])|[?]!?(?=[a-z])/


Why do you need the ?=s here? I don't recall that ? or ! are valid anywhere else in the grammar..?

Yeah, I originally had a comment explaining that, but it would have been the only part of the entire grammar that had commentary, so I removed it. The ? rule modifier is not allowed on a rule whose name begins with _ (https://github.com/lark-parser/lark/blob/master/lark/load_grammar.py#L1057-L1059), and that bit or regex blocks that usage. So maybe I should put the comment back in.

Oh I see. But wouldn't users be able to bypass it by adding whitespace between them?

Drat! I missed a test case. But no, this test fails similarly in load_grammar.py, the original lark.lark, and my modified version:

rule1: rule2 ?! rule2: "a" """

load_grammar.py:

Traceback (most recent call last): File "C:\Ross\Source\lark\lark\parsers\lalr_parser_state.py", line 77, in feed_token action, arg = states[state][token.type] KeyError: 'OP' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "C:\Ross\Source\lark\tests\test_lark_lark.py", line 169, in test_11_rule_modifier_query_bang_space_lg Lark(g) ... File "C:\Ross\Source\lark\lark\lexer.py", line 674, in lex raise UnexpectedToken(token, e.allowed, state=parser_state, token_history=[last_token], terminals_by_name=self.root_lexer.terminals_by_name) lark.exceptions.UnexpectedToken: Unexpected token Token('OP', '?') at line 3, column 13. Expected one of: * "%declare" * "%import" * "%extend" * RULE * "%override" * _NL * $END * "%ignore" * TOKEN * RULE_MODIFIERS Previous tokens: [Token('_NL', '\n ')]

Original lark.lark:

Traceback (most recent call last): File "C:\Ross\Source\lark\lark\lexer.py", line 665, in lex yield lexer.next_token(lexer_state, parser_state) File "C:\Ross\Source\lark\lark\lexer.py", line 598, in next_token raise UnexpectedCharacters(lex_state.text, line_ctr.char_pos, line_ctr.line, line_ctr.column, lark.exceptions.UnexpectedCharacters: No terminal matches '?' in the current parser context, at line 3 col 13 ?! rule2: "a" ^ ... During handling of the above exception, another exception occurred: Traceback (most recent call last): File "C:\Ross\Source\lark\tests\test_lark_lark.py", line 176, in test_11_rule_modifier_query_bang_space_ll self.lark_parser.parse(g) ... File "C:\Ross\Source\lark\lark\lexer.py", line 674, in lex raise UnexpectedToken(token, e.allowed, state=parser_state, token_history=[last_token], terminals_by_name=self.root_lexer.terminals_by_name) lark.exceptions.UnexpectedToken: Unexpected token Token('OP', '?') at line 3, column 13. Expected one of: * _NL * "%import" * "%ignore" * "%override" * TOKEN * $END * "%declare" * RULE Previous tokens: [Token('_NL', '\n ')]

Modified lark.lark:

Traceback (most recent call last): File "C:\Ross\Source\lark\lark\lexer.py", line 665, in lex yield lexer.next_token(lexer_state, parser_state) File "C:\Ross\Source\lark\lark\lexer.py", line 598, in next_token raise UnexpectedCharacters(lex_state.text, line_ctr.char_pos, line_ctr.line, line_ctr.column, lark.exceptions.UnexpectedCharacters: No terminal matches '?' in the current parser context, at line 3 col 13 ?! rule2: "a" ^ ... During handling of the above exception, another exception occurred: Traceback (most recent call last): File "C:\Ross\Source\lark\tests\test_lark_lark.py", line 176, in test_11_rule_modifier_query_bang_space_ll self.lark_parser.parse(g) ... File "C:\Ross\Source\lark\lark\lexer.py", line 674, in lex raise UnexpectedToken(token, e.allowed, state=parser_state, token_history=[last_token], terminals_by_name=self.root_lexer.terminals_by_name) lark.exceptions.UnexpectedToken: Unexpected token Token('OP', '?') at line 3, column 13. Expected one of: * "%declare" * RULE * RULE_MODIFIERS * $END * "%ignore" * "%extend" * "%import" * TOKEN * _NL * "%override" Previous tokens: [Token('_NL', '\n ')]

erezsh · 2024-02-01T11:27:53Z

lark/grammars/lark.lark

-?atom: "(" expansions ")"
-     | "[" expansions "]" -> maybe
-     | value
+?token_atom: "(" token_expansions ")"


token_atom and rule_atom are the same, why not merge them using templates? Same with rule_expr and atom_expr, rule_expansion etc.

The problem is that the original lark.lark can derive atom -> value -> template_usage, but templates aren't allowed for terminals (pending resolution of Issue #555). I didn't see any way, with templates, to prevent that derivation. As far as I can tell, templates are effectively a text-substitution macro capability (although I understand the implementation is a little more sophisticated).

I suppose I could have done something like this:

_expansion{expr_type}: expr_type* _expr{atom_type}: atom_type [OP | "~" NUMBER [".." NUMBER]] _atom{expansion_type, value_type}: "(" expansion_type ")" | "[" expansion_type "]" -> maybe | value_type ?rule_expansions: rule_alias (_VBAR rule_alias)* ?rule_inner_expansions: rule_expansion (_VBAR rule_expansion)* ?rule_alias: rule_expansion ["->" RULE] ?rule_expansion: _expansion(rule_expr) ?rule_expr: _expr(rule_atom) ?rule_atom: _atom(rule_inner_expansions, rule_value) ?rule_value: RULE "{" rule_value ("," rule_value)* "}" -> rule_template_usage | RULE | value ?token_expansions: token_expansion (_VBAR token_expansion)* ?token_expansion: _expansion(token_expr) ?token_expr: _expr(token_atom) ?token_atom: _atom(token_expansions, token_value) ?value: STRING ".." STRING -> literal_range | TOKEN | (REGEXP | STRING) -> literal

Is it what you're suggesting? It doesn't read as cleanly to my eyes, but I'll grant that I've been studying Lark pretty hard lately.

Before we dig into that, let's come to an agreement in the main conversation (regarding restrictive-validating grammar vs relaxed grammar + validating visitor).

Afterwards if it's still relevant, I will try to craft a solution to this problem.

MegaIng · 2024-02-01T18:18:21Z

My general opinion is that lark.lark should reflect the syntax as defined by the BNF inside of load_grammar.py and not the semantic checks implemented on top of that (and therefore most of this PR shouldn't be done in it's current form). But if @erezsh is of a different mind, I am not going to argue too much. I also wouldn't be opposed to making the syntax specfified in load_grammar.py more restrictive and therefore removing semantic checks in other parts of the code. My "dream" would be that the BNF hard coded in load_grammar.py could be generated from lark.lark and we only have one source of truth. Further diverging the two grammars is IMO not a good idea.

RossPatterson · 2024-02-01T18:34:06Z

My general opinion is that lark.lark should reflect the syntax as defined by the BNF inside of load_grammar.py and not the semantic checks implemented on top of that (and therefore most of this PR shouldn't be done in it's current form).

I see your point, but I think for anyone who isn't writing code in the Lark codebase, lark.lark is a much more accessible definition of the Lark grammar than the implementation in load_grammar.py. Some of the things I noted (e.g., literal_range being STRINGs) sounds more to me like syntax than semantics, but that may just be bike-shedding.

My "dream" would be that the BNF hard coded in load_grammar.py could be generated from lark.lark and we only have one source of truth.

Mine as well. I was actually quite surprised at how different they were. It was your comments in Issue #1355 that pointed me down this path.

Further diverging the two grammars is IMO not a good idea.

I guess I see it more as specifying the same grammar in two forms, not two grammars. But you're right about semantics - there are some things (e.g., my comments about literal_range) that Lark can't easily express today, and that require post-processing.

RossPatterson · 2024-02-01T18:43:31Z

I missed two lines that I should have deleted in the test files, which I used for before/after testing (... 'grammars/lark.lark-ORIG' ...). I'll remove them.

@erezsh

…rk_lark.py (trailing whitespace in test_06_*()) 2. Remove what was left of "Literals can be one of: ..." under "Terminals" in grammar.md. 3. Address @erezsh's point about inlined terminals under "Terminals" in grammar.md. 4. Remove "lark.lark-ORIG" references in test_lark_lark.py and test_grammar_formal.py that shouldn't have been pushed. 5. Address @erezsh's point about f.readlines() in test_grammar_formal.py. 6. Address @erezsh's point commented-out tests in test_grammar_formal.py.

RossPatterson · 2024-02-01T19:05:20Z

I missed two lines that I should have deleted in the test files, which I used for before/after testing (... 'grammars/lark.lark-ORIG' ...). I'll remove them.

Fixed in 9493f81.

erezsh · 2024-02-02T13:13:26Z

@MegaIng @RossPatterson I can see both points of view, and I think the determining factor should be practicality above all else.

In an ideal world, load_grammar would only preload a minimal parser, that would parse lark.lark and use it to create the Lark parser that supports the full syntax. That would be the cleanest approach imho. But it would also be slower, probably in a very noticeable way. (did we ever do this experiment?)

As for the BNF in load_grammar, it is non-restrictive on purpose. By moving the validation from the parser to the post-parsing code, we can perform more sophisticated validation, and provide custom error messages that are more helpful and friendly.

As for what to do with lark.lark, in my mind it's still an open question, and depends heavily on what are its main uses. Is it to parse and manipulate lark grammars programatically? Is it to validate correct grammars? Is it to be used in a lark editor? etc.

I think the options that we are facing are:

Leave lark.lark as a non-validating grammar, that is still capable of parsing every correct lark grammar
Same as 1, but pair it with an "official" visitor that performs extra validation with better error messages
Make lark.lark as restrictive as possible, to match load_grammar

I'm leaning towards (2), because I'm can't really think of good use-cases for 3, except than it being a "reference grammar" for human eyes. But I don't know if that's a good enough reason, especially since that can be accomplished using better docs, or with comments within lark.lark

What do you guys think?

(needless to say, regardless of what we choose, I'm all for improving the documentation regarding the grammar)

RossPatterson · 2024-02-02T15:57:53Z

As for what to do with lark.lark, in my mind it's still an open question, and depends heavily on what are its main uses. Is it to parse and manipulate lark grammars programatically?

That's part of what drove me down this path. I've got a use case where I'm playing with @MegaIng's lark2railroad, combined with my own recent contributions to railroad-diagrams, to generate documentation for small programs that use Lark to parse their input.

Is it to validate correct grammars?

That's another part of my motivation. As noted in Issue #1355, I attempted to use aliases on rule-alternates inside a rule-group, and received a weird result (from load_grammar.py, which as you implied, has great contextual error messages). It turns out, although neither lark.lark nor the documetation says so, that's invalid syntax. Now, I can simply avoid doing that again, but I try to contribute incremental improvements to open source projects that I use, so here we are :-)

I think the options that we are facing are:
...
What do you guys think?

I think having an inaccurate reference grammar for human purposes is almost worse than not having one at all. I think having a partial grammar for parsing purposes is somewhat useful, but will be surprising to anyone who tries to use it reliably. And I think better documentation might obviate the need for a perfect reference grammar, except for the cases like lark2railroad that are actually trying to parse Lark grammars from random authors.

I'm not going to be offended by however the decision turns out. If it's something I can help with, fine, if not, also fine. This isn't my project, I'm just a (so far very happy) user who has the skills to contribute a bit.

erezsh · 2024-02-02T16:27:07Z

I think having an inaccurate reference grammar for human purposes

@RossPatterson Can you explain which option this refers to? I don't think I advocated at any point for an inaccurate grammar, only a non-validating one

which as you implied, has great contextual error messages

The load_grammar error messages are definitely better than the generic lark error messages. If I overlooked or missed something, it's not proof that the approach is bad

Do you have any objections to option 2 that I suggested? Do you think it doesn't serve your efforts as well as option 3? Do you think it's more confusing?

Frankly, I don't accept or reject PR based on who will be offended. I'm just trying to make Lark as good as it can be. (with the limited time that I can spare for it)

RossPatterson · 2024-02-02T22:29:42Z

I think having an inaccurate reference grammar for human purposes

@RossPatterson Can you explain which option this refers to? I don't think I advocated at any point for an inaccurate grammar, only a non-validating one

Sorry, I misunderstood.

The load_grammar error messages are definitely better than the generic lark error messages. If I overlooked or missed something, it's not proof that the approach is bad

No, in fact, like I said, it's great. The load_grammar.py code is hard to understand, but the error messages are great.

Do you have any objections to option 2 that I suggested? Do you think it doesn't serve your efforts as well as option 3? Do you think it's more confusing?

No, I have no objections to option 2. My efforts will be served as long as lark.lark can parse legitimate Lark grammars, and as long as I know the things I shouldn't do.

Frankly, I don't accept or reject PR based on who will be offended. I'm just trying to make Lark as good as it can be. (with the limited time that I can spare for it)

I understand completely. I was trying to say that I'm OK with whatever you decide to do.

erezsh · 2024-02-05T04:07:22Z

@MegaIng Any objections to option 2? i.e. keep lark.lark non-validating and pair it with a visitor that performs the same extra validation as load_grammar?

MegaIng · 2024-02-05T07:48:42Z

Currently no, I think that is the ideal solution. Preferably the same visitor is used both internally and externally and is the only source of GrammerErrors in load_grammer to make sure that everything is in sync.

erezsh · 2024-02-05T08:05:29Z

Preferably the same visitor is used both internally and externally and is the only source of GrammerErrors

That seems a bit optimistic, but I agree it would be nice if we could get there.

@RossPatterson Would you be okay adapting your PR to solution 2?

That means getting lark.lark as close as possible to load_grammar, and adding a validating visitor for the rest (or somehow using the one in load_grammar)

If you'd like, we can split this PR to two, one for the docs, which I think we all agree can be merged already as is (or with tiny changes), and a separate one for lark.lark. But it's up to you.

MegaIng · 2024-02-05T08:10:57Z

That means getting lark.lark as close as possible to load_grammar

To note: lark.lark should probably still use ebnf and not restrict itself to bnf like the hardcoded grammar does, and IMO it's ok to change the grammar in load_grammar a bit if it makes stuff easier to sync up (if it doesn't have a performance hit and ofcourse doesn't break anything)

RossPatterson · 2024-02-06T19:07:46Z

@RossPatterson Would you be okay adapting your PR to solution 2?

That means getting lark.lark as close as possible to load_grammar, and adding a validating visitor for the rest (or somehow using the one in load_grammar)

Yup, I can do that. I've already had to do the analysis to find the differences, and to fix them. So it looks like I'll just be backing out some of the fixes, and creating their equivalent in a visitor.

If you'd like, we can split this PR to two, one for the docs, which I think we all agree can be merged already as is (or with tiny changes), and a separate one for lark.lark. But it's up to you.

I'm fine with keeping them together.

RossPatterson · 2024-02-06T19:10:42Z

To note: lark.lark should probably still use ebnf and not restrict itself to bnf like the hardcoded grammar does,

I'm personally biased in favor of BNF, because it's so well defined, whereas "EBNF" means any of a variety of possibilities. But I agree, lark.lark should continue to be in the Lark variant of EBNF.

IMO it's ok to change the grammar in load_grammar a bit if it makes stuff easier to sync up (if it doesn't have a performance hit and ofcourse doesn't break anything)

I don't anticipate that being necessary. But I've been wrong before :-)

RossPatterson · 2024-02-07T02:06:39Z

I sat down tonight to start implementing option 2, and quickly found myself categorizing most of my changes (at the top of this PR) as "making lark.lark accept the same syntax as load_grammar", and therefore I was planning not to revert them. In fact, when I was done, the only changes I felt weren't either adding omitted directives or removing illegal syntax were three of the last four:

7 Tokens can no longer include rules.
9 Rule modifiers are no longer allowed in rule references (e.g., rule1: !?rule2)
10 Rule modifiers are now allowed to be ?!, in addition to the existing !, ?, and !?.

I somehow doubt this is what either of you expect.

I think the changes that correct the %extend and %override syntax (4, 5, and 8) should obviously be retained. The changes that correct the token rule syntax (1 and 6) seem appropriate to keep as well. The rule aliases change (2), %ignore syntax (3) and tokens-can't-include-rules (7) are on the fence, and my personal feeling is that 2 and 3 should be kept and 7 should be reverted and moved to the new visitor. I'm conflicted on rule references (9) - it seems like it's a syntax issue, but it's only implied by the RULE terminal definition, it isn't explicitly stated in any rule. So I'm inclined to revert it an move it to the visitor. I think the last one, the query-bang rule modifier (10), is a real nit, and ought to be in the visitor instead.

So, before I go off and start messing things up, can you tell me if I'm reading things right or wrong, and if wrong, how so?

MegaIng · 2024-02-07T02:26:26Z

@RossPatterson I think you misunderstood what we (at least I) meant with option 2. At the top of load_grammar.py, there is a BNF + terminals description of the exact grammar (the syntax) that is parsed by the file. The idea is that lark.lark should exactly match that, ignoring every other post processing step (the semantics) done in load_grammar.py. Then we also provide an extra visitor somewhere that does all the other validation (i.e. the semantic checks). If you think it's reasonable to move stuff from semantics to syntax, then you can do that, but please that also do the move in the BNF grammar defined in load_grammar.py.

…a visitor that enforces some of its restrictions.

RossPatterson · 2024-02-09T03:10:43Z

I reverted all my changes to lark.lark, and did what @MegaIng suggested - made it do exactly what the BNF in load_grammar.py does.

In rule, split out the modifiers to rule_modifiers.
In token, removed token_params.
In statement, added %extend.
In statement, added %override token.
In alias, the RULE usage now does not accept ! and ? modifiers (see RULE_MODIFIERS below).
In value > template_usage, changed name to RULE, to prevent accepting TOKENs.
In RULE, split out the rule modifiers to RULE_MODIFIERS.

I built the validating visitor, lark_validator_vistor.py::LarkValidatorVisitor. It is invoked by a class-level validate(tree) method. It checks:

That tokens don't have aliases.
That %ignore has only one token specified. Note that load_grammar doesn't do this, you get unexpected results from %ignore A B.
That rules don't have aliass on "inner expansions". Note that load_grammar doesn't do this, you get unexpected results instead.

There wasn't any existing doc on using the lark.lark grammar, so I put some instructions for the validator at the top of the grammar.

I also moved the doc for templates in grammar.md from the "Terminals" section to the "Rules" section, since you can't use templates with terminals, and left behind the "Templates are not allowed with terminals." part

Some notes:

lark.lark inlines several rules that load_grammar does not. I haven't changed this, because it will change trees produced by lark.lark, but I think I should.
- expansions
- expansion
- value
load_grammar inlines one rule that lark.lark does not. I haven't changed this, because it will change trees produced by lark.lark, but I think I should.
- name
load_grammar defines OP as [+*]|[?](?![a-z_]), while lark.lark omits the underscore in the not-followed-by set. I assume load_grammar does this to help in recognizing the difference between atom? and ?rulename:, but it's a little too arcane for me to be sure.
load_grammar defines RULE_MODIFIERS as (!|![?]?|[?]!?)(?=[_a-z]). Why does it use both ! and ![?]?? Why not just ![?]??
lark.lark imports ESCAPED_STRING from common.lark as _STRING, while load_grammar, of course, defines it itself. They're expressed differently, but I believe they accept the same set of values. I have left them both as they were originally.
load_grammar defines STRING as "(\\"|\\\\|[^"\n])*?"i?. Why use *?? Isn't that the same as *?
lark.lark imports SIGNED_INT from common.lark as NUMBER, and WS_INLINE from common.lark, while load_grammar, of course, defines them itself. They're expressed differently, but they accept the same set of values. I have left them both as they were originally.
lark.lark and load_grammar define COMMENT slightly differently, but the resulting regexp is identical. I have left them both as they were originally.

MegaIng · 2024-02-09T04:12:35Z

Very nice, this is about what I imagined :-) Thank you for going through this effort.

I haven't changed this, because it will change trees produced by lark.lark, but I think I should.

I think that's fine. I don't think there are too many people out there relying on the details of that grammar, and I think we changed it before without warning.

lark.lark inlines several rules that load_grammar does not.

alias and expr are also never inlined by lark.lark despite having a ? because the [] makes it so that we always have a None as a second child. I think this should both be changed to use ()? instead and be inlined. IMO that is more inline with that would be expected from the parsing, matches load_grammar better and is probably? easier to work with. But it would require changes in the validator you wrote.

It might make sense to write an Interpreter instead of a Visitor which allows slighter better control about what happens when.

There wasn't any existing doc on using the lark.lark grammar, so I put some instructions for the validator at the top of the grammar.

We should have a doc page with a reference list for our "stdlib" grammars, but that can be a different PR. An example using the validator might be nice.

MegaIng

With regard to tests, also see that we are already testing lark.lark:

lark/tests/test_parser.py

Line 1047 in 7646fb3

    
           # Check that the `lark.lark` grammar represents can parse every example used in these tests.

It might make sense to extend the tests there to include LarkValidatorVisitor and make sure that both also raise the same error messages although it might be hard to check that we don't expect grammar analysis errors. I don't think earley should be raising any? But I might be wrong with that.

docs/grammar.md

tests/test_lark_lark.py

lark/grammars/lark.lark

RossPatterson · 2024-02-10T00:24:21Z

I haven't changed this, because it will change trees produced by lark.lark, but I think I should.

I think that's fine. I don't think there are too many people out there relying on the details of that grammar, and I think we changed it before without warning.

OK, I removed inlining from expansions, expansion, and value in lark.lark.

Fixed by 2ec5ef3.

…alue in lark.lark'

RossPatterson · 2024-02-10T00:48:44Z

It might make sense to write an Interpreter instead of a Visitor which allows slighter better control about what happens when.

The Visitor model seems to fit well for the validator. By its nature, it finds all the tokens, aliases, and ignores, which I'd have to write code to hunt down if I switched to Interpreter. Is there a subtlty I'm missing here?

RossPatterson · 2024-02-10T00:50:02Z

There wasn't any existing doc on using the lark.lark grammar, so I put some instructions for the validator at the top of the grammar.

We should have a doc page with a reference list for our "stdlib" grammars, but that can be a different PR. An example using the validator might be nice.

If someone wants to write that page, I'll certainly write of an example for the validator.

erezsh · 2024-02-10T12:08:32Z

@RossPatterson Nice work. Let me know once you've cleaned it up a bit, and I'll give it another review.

I'll see if I find the time to write that doc page. I imagine it should be pretty short and straight-forward.

… they can be inlined.

RossPatterson · 2024-02-10T18:11:25Z

alias and expr are also never inlined by lark.lark despite having a ? because the [] makes it so that we always have a None as a second child. I think this should both be changed to use ()? instead and be inlined. IMO that is more inline with that would be expected from the parsing, matches load_grammar better and is probably? easier to work with. But it would require changes in the validator you wrote.

Fixing the validator was pretty easy. Fixed in e9c026e.

…with a nice message.

RossPatterson · 2024-02-13T20:18:00Z

With regard to tests, also see that we are already testing lark.lark:

lark/tests/test_parser.py

Line 1047 in 7646fb3

# Check that the `lark.lark` grammar represents can parse every example used in these tests.

It might make sense to extend the tests there to include LarkValidatorVisitor and make sure that both also raise the same error messages although it might be hard to check that we don't expect grammar analysis errors. I don't think earley should be raising any? But I might be wrong with that.

I'm working on blending my test_lark_lark.py and test_grammar_formal.py into test_parser.py and test_grammar.py. I'm using the technique used in test_parser.py for running a set of tests against multiple parsers.

…_grammar does.

RossPatterson · 2024-02-13T20:24:06Z

I'm working on blending my test_lark_lark.py and test_grammar_formal.py into test_parser.py and test_grammar.py. I'm using the technique used in test_parser.py for running a set of tests against multiple parsers.

While doing that, I found that neither the original lark.lark nor my version pass test_grammar.py testcase test_line_breaks. Apparently load_grammar.py accepts escaping of newlines (i.e., backslash-newline) to continue a statement. If I'm following things correctly, that's because load_grammar includes BACKSLASH in its ignore-set, while lark.lark does not.

Fixed in 7f02bd1.

RossPatterson · 2024-02-16T20:01:28Z

I'll be offline for the next few weeks, but I'll wrap this up after I return in mid-March.

RossPatterson · 2024-03-15T01:25:25Z

While testing and verifying my changes, I realized there are a few more differences between load_grammar.py and lark.lark, which result in differently-shaped parse trees. Specifically:

The load_grammar.py version of rule pushes the optionality of rule_modifiers and prioritydown into rule_modifiers and priority, while lark.lark does not. That means the load_grammar.py tree always contains nodes for rule_modifiers and priority node, while the lark.lark tree may not. I'm going to change lark.lark to do what load_grammar.py does, because the validator is affected by this difference. FYI, term does not do the same thing with priority as rule does. Both load_grammar.py and lark.lark agree that the priority is optional in term.
load_grammar.py calls token names TERMINAL, and calls token definitions term, while lark.lark uses TOKEN and token. I'm not going to change this, because it seems like a big change.
The load_grammar.py version of value splits the recognition of RULE and TERMINAL into nonterminal and terminal, while lark.lark uses the combined name (which exists in both grammars). I'm not going to change this, because leaving 2 above intact means this will be different anyway.
In load_grammar.py, name is inlined, while in lark.lark, it is not. I don't know what to do about this, if anything.

…ers and priority, like load_grammar does

RossPatterson · 2024-03-15T01:28:16Z

The load_grammar.py version of rule pushes the optionality of rule_modifiers and prioritydown into rule_modifiers and priority, while lark.lark does not.

FIxed by 4f7a5eb.

RossPatterson · 2024-03-15T01:42:27Z

@erezsh @MegaIng
Can one of you take a look at test_grammar.TestGrammar.test_undefined_term? It expects a GrammarError, but the error that is actually raised seems wrong. It says Rule 'A' used but not defined (in rule start)", but the entire grammar is just start: A. Shouldn't it raise "Terminal 'A' ..." instead?

MegaIng · 2024-03-15T04:19:51Z

@erezsh @MegaIng Can one of you take a look at test_grammar.TestGrammar.test_undefined_term? It expects a GrammarError, but the error that is actually raised seems wrong. It says Rule 'A' used but not defined (in rule start)", but the entire grammar is just start: A. Shouldn't it raise "Terminal 'A' ..." instead?

That's a bug that was introducded in #1018 that we both missed, specfically see here:

lark/lark/load_grammar.py

Lines 1367 to 1369 in 7061908

    
           for sym in _find_used_symbols(exp): 
        
               if sym not in self._definitions and sym not in params: 
        
                   self._grammar_error(d.is_term, "{Type} '{name}' used but not defined (in {type2} {name2})", sym, name)

That shouldn't be d.is_term, that is referring to whether we are in the definition of a symbol/rule, but _grammar_error uses it to refer to the name itself.

RossPatterson · 2024-03-15T18:30:36Z

That's a bug that was introducded in #1018 that we both missed, specfically see here:
...
That shouldn't be d.is_term, that is referring to whether we are in the definition of a symbol/rule, but _grammar_error uses it to refer to the name itself.

Fixed in 40576d2.

RossPatterson · 2024-03-15T19:27:02Z

As of daac65d, everything I intended to do is done. @erezsh @MegaIng, I think this meets your expectations, and is ready for code review.

RossPatterson · 2024-04-21T11:54:52Z

Is this work still of interest? I think it's ready for final review, and probably for merging.

Make lark.lark parse the same grammar as load_grammar.py, and make gr…

db1a5a5

…ammar.md document it more fully.

RossPatterson mentioned this pull request Feb 1, 2024

Lark grammar? #391

Open

erezsh requested changes Feb 1, 2024

View reviewed changes

DOH!

7a2880f

Remove unnessary anchor; coalesce ENBF item sets; fix %override grammar

83a374f

RossPatterson added 2 commits February 8, 2024 21:21

Revert lark.lark to its original form.

fdffb5f

Make lark.lark accept the same input as load_grammar.py, and provide …

95c5742

…a visitor that enforces some of its restrictions.

MegaIng suggested changes Feb 9, 2024

View reviewed changes

docs/grammar.md Outdated Show resolved Hide resolved

docs/grammar.md Outdated Show resolved Hide resolved

docs/grammar.md Show resolved Hide resolved

tests/test_lark_lark.py Outdated Show resolved Hide resolved

lark/grammars/lark.lark Outdated Show resolved Hide resolved

RossPatterson added 2 commits February 9, 2024 18:28

Address some review comments.

200d6b5

Fix review comment re: templates in terminals.

0fb28f9

Fix review comment: Remove inlining from expansions, expansion, and v…

2ec5ef3

…alue in lark.lark'

Address review comment: Make alias and expr optionals, not maybes, so…

e9c026e

… they can be inlined.

Address review comment: Make '%declare rule' fail in post-processing …

9bf7ddf

…with a nice message.

lark.lark doesn't allow backslash-nl as a line-continuation, but load…

7f02bd1

…_grammar does.

Push optionality of rule_modifiers and priority down into rule_modifi…

4f7a5eb

…ers and priority, like load_grammar does

Fix bug introduced in lark-parser#1018

40576d2

Issue lark-parser#1388 is ready for review.

daac65d

		@@ -74,6 +74,7 @@ Lark will parse "((hello world))" as:
		The brackets do not appear in the tree by design. The words appear because they are matched by a named terminal.


		<a name="shaping_the_tree"></a>

Make lark.lark parse the same grammar as load_grammar.py, and make grammar.md document it more fully. #1388

Are you sure you want to change the base?

Make lark.lark parse the same grammar as load_grammar.py, and make grammar.md document it more fully. #1388

Conversation

RossPatterson commented Feb 1, 2024

erezsh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RossPatterson Feb 1, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MegaIng commented Feb 1, 2024

RossPatterson commented Feb 1, 2024 • edited

RossPatterson commented Feb 1, 2024

RossPatterson commented Feb 1, 2024

erezsh commented Feb 2, 2024 • edited

RossPatterson commented Feb 2, 2024

erezsh commented Feb 2, 2024 • edited

RossPatterson commented Feb 2, 2024

erezsh commented Feb 5, 2024

MegaIng commented Feb 5, 2024

erezsh commented Feb 5, 2024 • edited

MegaIng commented Feb 5, 2024

RossPatterson commented Feb 6, 2024

RossPatterson commented Feb 6, 2024

RossPatterson commented Feb 7, 2024

MegaIng commented Feb 7, 2024

RossPatterson commented Feb 9, 2024

MegaIng commented Feb 9, 2024

MegaIng left a comment

Choose a reason for hiding this comment

RossPatterson commented Feb 10, 2024 • edited

RossPatterson commented Feb 10, 2024

RossPatterson commented Feb 10, 2024

erezsh commented Feb 10, 2024

RossPatterson commented Feb 10, 2024

RossPatterson commented Feb 13, 2024

RossPatterson commented Feb 13, 2024

RossPatterson commented Feb 16, 2024

RossPatterson commented Mar 15, 2024

RossPatterson commented Mar 15, 2024

RossPatterson commented Mar 15, 2024

MegaIng commented Mar 15, 2024

RossPatterson commented Mar 15, 2024

RossPatterson commented Mar 15, 2024

RossPatterson commented Apr 21, 2024

RossPatterson Feb 1, 2024 •

edited

RossPatterson commented Feb 1, 2024 •

edited

erezsh commented Feb 2, 2024 •

edited

erezsh commented Feb 2, 2024 •

edited

erezsh commented Feb 5, 2024 •

edited

RossPatterson commented Feb 10, 2024 •

edited