Reimplementation of end symbol (Issue #237) #880

erezsh · 2021-04-18T03:47:03Z

No description provided.

MegaIng

Do you remember what the problem was last time you tried to implement this? And is it fixed?

MegaIng · 2021-04-18T09:05:02Z

lark/parsers/lalr_interactive_parser.py

        for token in self.lexer_state.lex(self.parser_state):
            self.parser_state.feed_token(token)

    def feed_eof(self, last_token=None):
-        """Feed a '$END' Token. Borrows from 'last_token' if given."""
-        eof = Token.new_borrow_pos('$END', '', last_token) if last_token is not None else Token('$END', '', 0, 1, 1)
+        """Feed a END Token. Borrows from 'last_token' if given."""


Sorry, overlooked this in the review: Shouldn't we feed multiple ENDs here as well?

Hmm I think you're right.

erezsh · 2021-04-18T14:41:56Z

I don't remember the problem, only that it messed some grammars up. But I changed the LALR logic since, made it a little cleaner, so maybe that was solved on the way?

I'll make sure to do thorough tests before merging it in.

ThatXliner · 2021-04-18T19:58:32Z

I'm trying to test this right now. What's the terminal to use? $END?

MegaIng · 2021-04-18T19:59:18Z

No, It is just $

ThatXliner · 2021-04-18T20:40:41Z

Maybe I'm going too far but is this supposed to work with custom Indenters? Because I'm facing a problem where

lark.exceptions.UnexpectedToken: Unexpected token Token('__$END$__', '') at line 1, column 1.

MegaIng · 2021-04-18T21:17:01Z

It is supposed to work. Can you give a short example script?

erezsh · 2021-04-18T23:21:35Z

on_error was broken before, but now I fixed it. So if you were using it, you should try again with the latest commit.

ThatXliner · 2021-04-20T00:28:38Z

Now it's just really slow. Or maybe it's just my grammar....

EDIT: Ok, never mind. Apparently, it was kinda hanging when I did

(value (_NL|$|_DEDENT)+ )

but removing _DEDENT works...

(I'm making a programming language-type thing. It's whitespace sensitive)

SECOND EDIT:

So I'm trying to make an error and there we go:

Traceback (most recent call last):
  File "/Users/bryanhu/projects/langs/tylaireum/lark/lark/parsers/lalr_parser.py", line 127, in feed_token
    action, arg = states[state][token.type]
KeyError: 'IDENTIFIER'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/bryanhu/.pyenv/versions/3.7.9/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/Users/bryanhu/.pyenv/versions/3.7.9/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Users/bryanhu/projects/langs/tylaireum/tylaireum/__main__.py", line 79, in <module>
    print(parser.grammar.parse(INLINE).pretty())
  File "/Users/bryanhu/projects/langs/tylaireum/lark/lark/lark.py", line 552, in parse
    return self.parser.parse(text, start=start, on_error=on_error)
  File "/Users/bryanhu/projects/langs/tylaireum/lark/lark/parser_frontends.py", line 107, in parse
    return self.parser.parse(stream, start, **kw)
  File "/Users/bryanhu/projects/langs/tylaireum/lark/lark/parsers/lalr_parser.py", line 42, in parse
    return self.parser.parse(lexer, start)
  File "/Users/bryanhu/projects/langs/tylaireum/lark/lark/parsers/lalr_parser.py", line 176, in parse
    return self.parse_from_state(parser_state)
  File "/Users/bryanhu/projects/langs/tylaireum/lark/lark/parsers/lalr_parser.py", line 193, in parse_from_state
    raise e
  File "/Users/bryanhu/projects/langs/tylaireum/lark/lark/parsers/lalr_parser.py", line 184, in parse_from_state
    state.feed_token(token)
  File "/Users/bryanhu/projects/langs/tylaireum/lark/lark/parsers/lalr_parser.py", line 130, in feed_token
    raise UnexpectedToken(token, expected, state=self, interactive_parser=None)
^Clark.exceptions.UnexpectedToken: <exception str() failed>

But as you can see, I had to CTRL-C to stop it. It was just hanging

I am too lazy to send a minimal reproducible example right now, so if you want my grammar, etc, I'll send the whole enchilada.

erezsh · 2021-04-20T01:31:39Z

@ThatXliner Thanks for finding that. I think I fixed it, so try now.

ThatXliner · 2021-04-20T02:12:37Z

nope.

EDIT: Though I noticed it wasn't hanging for valid input this time

erezsh · 2021-04-20T02:50:55Z

@ThatXliner Ok then, hit me with your enchilada. But make sure it's easy to see it fail.

ThatXliner · 2021-04-20T04:13:38Z

Probably it was hanging for valid input when I sent my last comment but I most likely didn't double-check. Now, when I make it valid, it hangs.

Here's the enchilada: https://gist.github.com/ThatXliner/96829ce180b5659198a822cf74b6c309

Tell me if you can reproduce or not

erezsh · 2021-04-20T15:14:22Z

I found a minimal breaking example:

from lark import Lark

p = Lark("""
    ?start: "foo" x "bar"
    x: $ x
""", parser="lalr")

p.parse("foo")

The infinite loop happens because $ is accepted forever, but has no valid route to unwind the stack.

I need to think about how to solve this. My current intuition is that this grammar shouldn't even be allowed. But I'm not sure how easy it is to detect reliably.

But maybe I just need to rethink the loop mechanism, and solve it by only providing the end token once.

ThatXliner · 2021-04-20T16:05:34Z

I'm wondering how you're implementing the end token for right now. I think it's better to parse normally, get an EOF error, see if the grammar expects it and if so, continue parsing. Unless that's what you're doing right now.

erezsh · 2021-04-20T16:41:40Z

So, it seems that this can be "solved" by testing that the parser only shifts once on END, and otherwise throwing a run time error for infinite loop.

But I don't like it. Lark shouldn't allow to compile a grammar that can never run to completion. But, I'm not sure how to detect it at compile time.

ThatXliner · 2021-04-20T18:07:15Z

lark/parsers/lalr_parser.py

@@ -132,7 +133,7 @@ def feed_token(self, token, is_end=False):

            if action is Shift:
                # shift once and return
-                assert not is_end
+                # assert not is_end


Wonder what would happen if you undid this

It's just an assert. It doesn't matter.

seimit · 2021-06-01T17:27:03Z

Hi
Do you have plan this PR to be merged into the master soon?
Regards!

erezsh · 2021-06-01T17:43:58Z

@seimit There are issues with this PR, so currently no.

The problem is that in some edge cases, the parser gets stuck in a loop forever. Which is something I never want Lark to be able to do.

MegaIng · 2021-09-30T12:24:49Z

@erezsh Shouldn't it be possible to add some early point reject all BNF rules that have a Terminal following an $ symbol? E.g. if we find a rule of the form

a: $ A

we completely throw it away, since it can never actually match.

For rules of the form

a: $ b

We need to check if b can be empty, and if it can not be empty we throw the rule away. I think we check during Grammar Analyzing anyway if a rule can be empty.

erezsh · 2021-09-30T13:20:56Z

It's a good point that we can detect it during grammar analysis. And then we can just throw an error if anything follows the $END symbol. I don't think there's a good enough reason to allow them, empty rules or not.

MegaIng · 2021-09-30T13:21:56Z

Yes there is. ($|_NEWLINE)* should be allowed. I think that would break if we don't allow it.

MegaIng · 2021-09-30T13:28:42Z

OTOH, maybe it shouldn't be allowed. In effect it is the same as _NEWLINE* $?. I am not sure. But we need to make sure that for example the python grammar works when everywhere were we accept a statement ending _NEWLINE we also accept a $.

But also, can we somehow detect this:

a: end A
end: $

?

erezsh · 2021-09-30T14:42:34Z

If we do it at the grammar analysis level, we can make sure FOLLOWS[$] is empty. That should take care of all edge cases (I think)

erezsh · 2021-09-30T14:46:45Z

Hmm actually, no, that's the crux of the problem, which I guess is why I left it open.

On one hand, the grammar has to allow the option of something after an end token, otherwise, there's no advantage in implementing it. On the other hand, it's not obvious how to block the bad uses while allowing the good ones.

MegaIng · 2021-09-30T15:15:29Z

The point is more that there is that it is possible to accept nothing after $. Not sure how that can be expressed.

Erotemic · 2023-07-27T17:54:35Z

Throwing in my $0.02, just to say that this feature would be useful to me.

Reimplementation of end symbol (Issue #237)

51cde70

erezsh requested a review from MegaIng April 18, 2021 03:47

erezsh mentioned this pull request Apr 18, 2021

Add EOF symbol to match end of input #237

Open

MegaIng approved these changes Apr 18, 2021

View reviewed changes

MegaIng reviewed Apr 18, 2021

View reviewed changes

Fix: Push the end loop to feed_token

93bcca3

MegaIng approved these changes Apr 19, 2021

View reviewed changes

Fix for shift/reduce conflict on END

dbc5ad5

ThatXliner reviewed Apr 20, 2021

View reviewed changes

lark-parser deleted a comment from daniel-leicht Nov 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reimplementation of end symbol (Issue #237) #880

Reimplementation of end symbol (Issue #237) #880

erezsh commented Apr 18, 2021

MegaIng left a comment

MegaIng Apr 18, 2021

erezsh Apr 18, 2021

erezsh commented Apr 18, 2021

ThatXliner commented Apr 18, 2021

MegaIng commented Apr 18, 2021

ThatXliner commented Apr 18, 2021

MegaIng commented Apr 18, 2021

erezsh commented Apr 18, 2021

ThatXliner commented Apr 20, 2021 •

edited

erezsh commented Apr 20, 2021

ThatXliner commented Apr 20, 2021 •

edited

erezsh commented Apr 20, 2021

ThatXliner commented Apr 20, 2021

erezsh commented Apr 20, 2021

ThatXliner commented Apr 20, 2021

erezsh commented Apr 20, 2021

ThatXliner Apr 20, 2021

erezsh Apr 20, 2021

seimit commented Jun 1, 2021

erezsh commented Jun 1, 2021 •

edited

MegaIng commented Sep 30, 2021

erezsh commented Sep 30, 2021

MegaIng commented Sep 30, 2021

MegaIng commented Sep 30, 2021 •

edited

erezsh commented Sep 30, 2021

erezsh commented Sep 30, 2021

MegaIng commented Sep 30, 2021

Erotemic commented Jul 27, 2023

Reimplementation of end symbol (Issue #237) #880

Are you sure you want to change the base?

Reimplementation of end symbol (Issue #237) #880

Conversation

erezsh commented Apr 18, 2021

MegaIng left a comment

Choose a reason for hiding this comment

MegaIng Apr 18, 2021

Choose a reason for hiding this comment

erezsh Apr 18, 2021

Choose a reason for hiding this comment

erezsh commented Apr 18, 2021

ThatXliner commented Apr 18, 2021

MegaIng commented Apr 18, 2021

ThatXliner commented Apr 18, 2021

MegaIng commented Apr 18, 2021

erezsh commented Apr 18, 2021

ThatXliner commented Apr 20, 2021 • edited

erezsh commented Apr 20, 2021

ThatXliner commented Apr 20, 2021 • edited

erezsh commented Apr 20, 2021

ThatXliner commented Apr 20, 2021

erezsh commented Apr 20, 2021

ThatXliner commented Apr 20, 2021

erezsh commented Apr 20, 2021

ThatXliner Apr 20, 2021

Choose a reason for hiding this comment

erezsh Apr 20, 2021

Choose a reason for hiding this comment

seimit commented Jun 1, 2021

erezsh commented Jun 1, 2021 • edited

MegaIng commented Sep 30, 2021

erezsh commented Sep 30, 2021

MegaIng commented Sep 30, 2021

MegaIng commented Sep 30, 2021 • edited

erezsh commented Sep 30, 2021

erezsh commented Sep 30, 2021

MegaIng commented Sep 30, 2021

Erotemic commented Jul 27, 2023

ThatXliner commented Apr 20, 2021 •

edited

ThatXliner commented Apr 20, 2021 •

edited

erezsh commented Jun 1, 2021 •

edited

MegaIng commented Sep 30, 2021 •

edited