have parser offer lexer feedback as to which tokens are expected #195

nikomatsakis · 2017-03-15T10:43:30Z

It often happens in grammars that you might want to have a distinct set of tokens at different points in the grammar; LALRPOP's intertwined syntax also leads to that expectation (see e.g. #193). This comes up a lot in LALRPOP's own grammar, and an example from Rust would be macro bodies. You can usually work around this in various ways (e.g., LALRPOP has its own tokenizer that handles it) but it'd be interesting to have the parser inform the lexer as to its current state and what set of tokens it expects at this point. We could integrate this into our overlap analysis and thus avoid rejecting grammars like the one in #193.

This would have many of the advantages that the "scanner-less" parsing mode of PEG grammars offers (#143) without the disadvantages of PEG (poorer performance, can't statically distinguish ambiguous grammars).

nikomatsakis · 2017-03-15T10:46:29Z

Another use case for this is contextual keywords, as described in #112.

nikomatsakis · 2017-03-27T08:51:36Z

OK, so, I gave some more thought to this. I think it'd be fairly easy to use (e.g.) a combination of LR0 states and follow-sets to partition the input tokens into "sets that may be expected at one time" (I'm aiming for something cheaply computed and "good enough"), but it occurs to me that if you just do this naively, you will wind up parsing way more than you meant to. For example, many languages have "keywords", and you actually want to prevent them from being used as identifiers, even if that would be possible without ambiguity.

On the other hand, in some contexts (e.g., inside a comment) you do not. That is, I had hoped to use this to replace lexer states by having some kind of special nonterminal Drop:

Drop = { Whitespace, Comment };
Whitespace = r"\s*";
Comment = "/*" Commented "*/*;
Commented = {
    Comment, r"."
};

(Ignoring the ambiguity there between r"." and r"/*" for the moment, which could be resolved in various ways.)

This would only work if entering the "comment" meant that other tokens were being ignored by lexer.

Seems like we need to give user more control over this.

remexre · 2019-12-07T20:01:59Z

If you haven't read https://www-users.cs.umn.edu/~evw/pubs/vanwyk07gpce/vanwyk07gpce.pdf, it has an algorithm that works quite well for this. Copper implements this.

(Disclaimer: I'm affiliated w/ the research group.)

it can not be used for shell parser, because lalrparser's lexer is totally seperated from parser. Shell tokenizer should to know the current parser status. Depend on parser status, tokenizer's action is also different. If lalrpop/lalrpop#195 will be fixed, I'll try to use it again.

elkowar · 2021-07-28T07:26:24Z

I'm currently facing exactly this issue, and having this type of contextual lexer solution would help a lot. I have an embedded different syntax inside my main language, and my current approach is to just lex everything inside { } (which delimit that other embedded language) as a single token, then hand that to a separate parser+lexer in the parsing rule. This works (but already means that my lexer needs to be much more complex than it deserves to be), but is highly suboptimal (also for reasons such as #392). Allowing me to specify separate sets of lexer states / expected tokens would solve this very cleanly!

My intuitive approach to designing a solution for this would be to optionally extend the interface that a lexer needs to provide, to include a set_context method or something like that. You'd then add another generic type Context which would most likely end up being a user defined enum most of the time. Then, depending on what parsing rule you're in currently, the grammar could specify / override the default context, which would call set_context before reading the tokens and thus get another set of possible tokens.
This would probably require the lexer to support some sort of backtracking (?) which should be relatively easy to encode as a "backtrack_n" trait method which would specify the amount of tokens to backtrack. Alternatively the set_context function could be replaced with a sort of "start_path" thing, that would require the lexer implementation to store where it was when the path was started, to then be able to backtrack on that path.

nacaclanga · 2021-09-11T21:43:33Z

Hi I am currently having a similar isssue, as I am trying to build a C parser. I personally would love to have LALRPOP offer an alternative operation mode to the one currently taken. Currently parsing is parser-driven, like in yacc. This means, that the parser is handed a lexer, and then the parser continuously calles the lexer to request new tokens.
An alternative is the lexer driven approach, like in lemon (another classical parser generator), that uses the fact, that LR(1) parsers only look at a single lookahead token at a time.

Here the parser is an object providing an .parse_token() method, that takes a single token as an input. When called, the parser performs any operations until a shift is performed, upon which the .parse_token(). The .parse_token() method must then be called for every single token. Finally the parser also provides a .finish() method that completes parsing operations using the EOF lookahead and returns the reduced start symbol. To take some benefits from this design, the parser should also provide two additional methods: .would_accept() takes a token and returns true unless the token would result in a parsing error. A context_view() method lets the user inspect the context (parameters to the grammar statement) which would allow for a very easy lexer feedback.

Marwes · 2022-07-05T14:34:54Z

Pushed a POC for this in #673 . I don't plan on completing it at the moment but it might be interesting to someone else.

nikomatsakis added design-work-needed wishlist labels Mar 15, 2017

nikomatsakis mentioned this issue Mar 15, 2017

Contextual keywords with custom lexer #112

Closed

This was referenced Mar 24, 2017

Ability to resolve ambiguities by assigning priorities #202

Closed

Customize handling whitespace, comments when using generated tokenizer #14

Open

gwenn mentioned this issue Jun 22, 2018

Is it possible to do partial parsing, fallback and wildcard ? #156

Open

Marwes mentioned this issue Dec 6, 2021

Unexpected error from unused nonterminal #638

Closed

Marwes linked a pull request Jul 5, 2022 that will close this issue

[POC] feat: Allow the parser to direct the which tokens it expects from the lexer #673

Draft

Marwes mentioned this issue Feb 17, 2023

Stateful Lexing Support #706

Open

mccolljr mentioned this issue May 28, 2023

Custom lexer needs context information from the parser? #802

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

have parser offer lexer feedback as to which tokens are expected #195

have parser offer lexer feedback as to which tokens are expected #195

nikomatsakis commented Mar 15, 2017

nikomatsakis commented Mar 15, 2017

nikomatsakis commented Mar 27, 2017

remexre commented Dec 7, 2019

elkowar commented Jul 28, 2021

nacaclanga commented Sep 11, 2021

Marwes commented Jul 5, 2022 •

edited

have parser offer lexer feedback as to which tokens are expected #195

have parser offer lexer feedback as to which tokens are expected #195

Comments

nikomatsakis commented Mar 15, 2017

nikomatsakis commented Mar 15, 2017

nikomatsakis commented Mar 27, 2017

remexre commented Dec 7, 2019

elkowar commented Jul 28, 2021

nacaclanga commented Sep 11, 2021

Marwes commented Jul 5, 2022 • edited

Marwes commented Jul 5, 2022 •

edited