New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
have parser offer lexer feedback as to which tokens are expected #195
Comments
Another use case for this is contextual keywords, as described in #112. |
OK, so, I gave some more thought to this. I think it'd be fairly easy to use (e.g.) a combination of LR0 states and follow-sets to partition the input tokens into "sets that may be expected at one time" (I'm aiming for something cheaply computed and "good enough"), but it occurs to me that if you just do this naively, you will wind up parsing way more than you meant to. For example, many languages have "keywords", and you actually want to prevent them from being used as identifiers, even if that would be possible without ambiguity. On the other hand, in some contexts (e.g., inside a comment) you do not. That is, I had hoped to use this to replace lexer states by having some kind of special nonterminal Drop = { Whitespace, Comment };
Whitespace = r"\s*";
Comment = "/*" Commented "*/*;
Commented = {
Comment, r"."
}; (Ignoring the ambiguity there between This would only work if entering the "comment" meant that other tokens were being ignored by lexer. Seems like we need to give user more control over this. |
If you haven't read https://www-users.cs.umn.edu/~evw/pubs/vanwyk07gpce/vanwyk07gpce.pdf, it has an algorithm that works quite well for this. Copper implements this. (Disclaimer: I'm affiliated w/ the research group.) |
it can not be used for shell parser, because lalrparser's lexer is totally seperated from parser. Shell tokenizer should to know the current parser status. Depend on parser status, tokenizer's action is also different. If lalrpop/lalrpop#195 will be fixed, I'll try to use it again.
I'm currently facing exactly this issue, and having this type of contextual lexer solution would help a lot. I have an embedded different syntax inside my main language, and my current approach is to just lex everything inside { } (which delimit that other embedded language) as a single token, then hand that to a separate parser+lexer in the parsing rule. This works (but already means that my lexer needs to be much more complex than it deserves to be), but is highly suboptimal (also for reasons such as #392). Allowing me to specify separate sets of lexer states / expected tokens would solve this very cleanly! My intuitive approach to designing a solution for this would be to optionally extend the interface that a lexer needs to provide, to include a |
Hi I am currently having a similar isssue, as I am trying to build a C parser. I personally would love to have LALRPOP offer an alternative operation mode to the one currently taken. Currently parsing is parser-driven, like in yacc. This means, that the parser is handed a lexer, and then the parser continuously calles the lexer to request new tokens. Here the parser is an object providing an .parse_token() method, that takes a single token as an input. When called, the parser performs any operations until a shift is performed, upon which the .parse_token(). The .parse_token() method must then be called for every single token. Finally the parser also provides a .finish() method that completes parsing operations using the EOF lookahead and returns the reduced start symbol. To take some benefits from this design, the parser should also provide two additional methods: .would_accept() takes a token and returns true unless the token would result in a parsing error. A context_view() method lets the user inspect the context (parameters to the grammar statement) which would allow for a very easy lexer feedback. |
Pushed a POC for this in #673 . I don't plan on completing it at the moment but it might be interesting to someone else. |
It often happens in grammars that you might want to have a distinct set of tokens at different points in the grammar; LALRPOP's intertwined syntax also leads to that expectation (see e.g. #193). This comes up a lot in LALRPOP's own grammar, and an example from Rust would be macro bodies. You can usually work around this in various ways (e.g., LALRPOP has its own tokenizer that handles it) but it'd be interesting to have the parser inform the lexer as to its current state and what set of tokens it expects at this point. We could integrate this into our overlap analysis and thus avoid rejecting grammars like the one in #193.
This would have many of the advantages that the "scanner-less" parsing mode of PEG grammars offers (#143) without the disadvantages of PEG (poorer performance, can't statically distinguish ambiguous grammars).
The text was updated successfully, but these errors were encountered: