Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

have parser offer lexer feedback as to which tokens are expected #195

Open
nikomatsakis opened this issue Mar 15, 2017 · 6 comments · May be fixed by #673
Open

have parser offer lexer feedback as to which tokens are expected #195

nikomatsakis opened this issue Mar 15, 2017 · 6 comments · May be fixed by #673

Comments

@nikomatsakis
Copy link
Collaborator

It often happens in grammars that you might want to have a distinct set of tokens at different points in the grammar; LALRPOP's intertwined syntax also leads to that expectation (see e.g. #193). This comes up a lot in LALRPOP's own grammar, and an example from Rust would be macro bodies. You can usually work around this in various ways (e.g., LALRPOP has its own tokenizer that handles it) but it'd be interesting to have the parser inform the lexer as to its current state and what set of tokens it expects at this point. We could integrate this into our overlap analysis and thus avoid rejecting grammars like the one in #193.

This would have many of the advantages that the "scanner-less" parsing mode of PEG grammars offers (#143) without the disadvantages of PEG (poorer performance, can't statically distinguish ambiguous grammars).

@nikomatsakis
Copy link
Collaborator Author

Another use case for this is contextual keywords, as described in #112.

@nikomatsakis
Copy link
Collaborator Author

OK, so, I gave some more thought to this. I think it'd be fairly easy to use (e.g.) a combination of LR0 states and follow-sets to partition the input tokens into "sets that may be expected at one time" (I'm aiming for something cheaply computed and "good enough"), but it occurs to me that if you just do this naively, you will wind up parsing way more than you meant to. For example, many languages have "keywords", and you actually want to prevent them from being used as identifiers, even if that would be possible without ambiguity.

On the other hand, in some contexts (e.g., inside a comment) you do not. That is, I had hoped to use this to replace lexer states by having some kind of special nonterminal Drop:

Drop = { Whitespace, Comment };
Whitespace = r"\s*";
Comment = "/*" Commented "*/*;
Commented = {
    Comment, r"."
};

(Ignoring the ambiguity there between r"." and r"/*" for the moment, which could be resolved in various ways.)

This would only work if entering the "comment" meant that other tokens were being ignored by lexer.

Seems like we need to give user more control over this.

@remexre
Copy link

remexre commented Dec 7, 2019

If you haven't read https://www-users.cs.umn.edu/~evw/pubs/vanwyk07gpce/vanwyk07gpce.pdf, it has an algorithm that works quite well for this. Copper implements this.

(Disclaimer: I'm affiliated w/ the research group.)

hiking90 added a commit to hiking90/rushell that referenced this issue Nov 23, 2020
it can not be used for shell parser, because lalrparser's lexer is totally seperated from parser.

Shell tokenizer should to know the current parser status.
Depend on parser status, tokenizer's action is also different.

If lalrpop/lalrpop#195 will be fixed, I'll try to use it again.
@elkowar
Copy link

elkowar commented Jul 28, 2021

I'm currently facing exactly this issue, and having this type of contextual lexer solution would help a lot. I have an embedded different syntax inside my main language, and my current approach is to just lex everything inside { } (which delimit that other embedded language) as a single token, then hand that to a separate parser+lexer in the parsing rule. This works (but already means that my lexer needs to be much more complex than it deserves to be), but is highly suboptimal (also for reasons such as #392). Allowing me to specify separate sets of lexer states / expected tokens would solve this very cleanly!

My intuitive approach to designing a solution for this would be to optionally extend the interface that a lexer needs to provide, to include a set_context method or something like that. You'd then add another generic type Context which would most likely end up being a user defined enum most of the time. Then, depending on what parsing rule you're in currently, the grammar could specify / override the default context, which would call set_context before reading the tokens and thus get another set of possible tokens.
This would probably require the lexer to support some sort of backtracking (?) which should be relatively easy to encode as a "backtrack_n" trait method which would specify the amount of tokens to backtrack. Alternatively the set_context function could be replaced with a sort of "start_path" thing, that would require the lexer implementation to store where it was when the path was started, to then be able to backtrack on that path.

@nacaclanga
Copy link

Hi I am currently having a similar isssue, as I am trying to build a C parser. I personally would love to have LALRPOP offer an alternative operation mode to the one currently taken. Currently parsing is parser-driven, like in yacc. This means, that the parser is handed a lexer, and then the parser continuously calles the lexer to request new tokens.
An alternative is the lexer driven approach, like in lemon (another classical parser generator), that uses the fact, that LR(1) parsers only look at a single lookahead token at a time.

Here the parser is an object providing an .parse_token() method, that takes a single token as an input. When called, the parser performs any operations until a shift is performed, upon which the .parse_token(). The .parse_token() method must then be called for every single token. Finally the parser also provides a .finish() method that completes parsing operations using the EOF lookahead and returns the reduced start symbol. To take some benefits from this design, the parser should also provide two additional methods: .would_accept() takes a token and returns true unless the token would result in a parsing error. A context_view() method lets the user inspect the context (parameters to the grammar statement) which would allow for a very easy lexer feedback.

@Marwes
Copy link
Contributor

Marwes commented Jul 5, 2022

Pushed a POC for this in #673 . I don't plan on completing it at the moment but it might be interesting to someone else.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants