Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom parser errors #831

Open
emm312 opened this issue Sep 12, 2023 · 10 comments
Open

Custom parser errors #831

emm312 opened this issue Sep 12, 2023 · 10 comments

Comments

@emm312
Copy link

emm312 commented Sep 12, 2023

Ive looked through the source and in the ParseError there is a generic for it, yet no docs.
Could this be documented or developed further?

@dburgener
Copy link
Contributor

there is a generic for it

I'm a little unclear on what you mean here. Are you referring to the generics in the ParseError definition, L, T, E? Or do you mean the ParseError::User field?

The ParseError generics are somewhat described here: https://docs.rs/lalrpop-util/0.12.1/lalrpop_util/enum.ParseError.html

The User field is described pretty thoroughly here.

There's of course always room for documentation improvement, but I'm not really clear on what you want to see improved here. Can you be more specific about what info you feel is missing from the docs that would be helpful?

@emm312
Copy link
Author

emm312 commented Sep 12, 2023

Oh, oops. I'm talking about fully custom parse errors. Like instead of this returning ParseError::UnrecognisedToken:

let a = (3;

you can use the error recovery or something similar to make it UserError::MissedBracket.

@dburgener
Copy link
Contributor

I think I'm getting closer to understanding you. Perhaps you're talking about one of these things?

  • The grammar statement at the top of a .lalrpop file can take arbitrary parameters, that become part of the generated parse functions, and can be accessed in the grammar. This is (among other things) used for error recovery. I do think this is fairly poorly documented right now, and could probably have improved documentation.
  • The ErrorRecovery struct provided by lalrpop wraps the normal ParseError enum. The ParseError enum has a User field that can be used for custom errors to your hearts content. That sounds more like what you're looking for. But I do think it's pretty thoroughly documented here. That functionality can do what you describe in terms of returning user errors. If that's what you're looking for, then I still could use more specifics about what is missing from the linked docs.

@emm312
Copy link
Author

emm312 commented Sep 13, 2023

I mean like instead of how its documented (errors on the action code) you can make errors if the parser fails, so you can have more descriptive errors.
Also it'd be nice to say the rule name instead of a list of tokens, since its normally more descriptive and readable for the average user over a bunch of regex-s.

@dburgener
Copy link
Contributor

Ive looked through the source and in the ParseError there is a generic for it,

Can you point me at what specifically in the source you're looking at? I'm not finding whatever you see here.

I mean like instead of how its documented (errors on the action code) you can make errors if the parser fails, so you can have more descriptive errors.

So this is fundamentally a feature request for an ability for users to control ParseErrors? How do you envision this working?

Also it'd be nice to say the rule name instead of a list of tokens, since its normally more descriptive and readable for the average user over a bunch of regex-s.

Unless I'm missing something, this sounds like a separate unrelated request, in which case it should get its own issue to keep discussions straight.

@emm312
Copy link
Author

emm312 commented Sep 13, 2023

Can you point me at what specifically in the source you're looking at? I'm not finding whatever you see here.

ParseError<L, T, E>

So this is fundamentally a feature request for an ability for users to control ParseErrors? How do you envision this working?

Each rule can add an attribute to the error with a more specific kind, depicting things like the rule it errored on, etc.

Unless I'm missing something, this sounds like a separate unrelated request, in which case it should get its own issue to keep discussions straight.

Alright, got it

@dburgener
Copy link
Contributor

ParseError<L, T, E>

Okay, that generic is only used in the User field, which is fully documented in the "fallible actions" section of the book I've linked a few times. It sounds like your issue is that you don't want to use the User field via action code as it is documented, you'd like to augment it with more information about the parser grammar.

Perhaps someone with more LR(1) knowledge than me will chime in with a better answer, but I think the fundamental issue around things like "which rule did it error out on?" is that the parser isn't necessarily parsing a particular rule. It has a stack of tokens encountered, and a collection of possible rules it could match. I'm guessing you're envisioning a situation like:

MyRule:
    TokA TokB TokC TokD

where we've seen TokA followed by TokB, but then we see TokE instead of TokC

The problem is that in the general case, could have:

MyRule:
    TokA TokB TokC TokD

MyOtherRule:
    TokA TokB TokF TokG

MyThirdRule:
    TokA TokB TokH TokI

etc. So if we've seen TokA followed by TokB and now we see TokE, which rule are we parsing? We have been potentially parsing all three, until the unexpected token fails to match any of them. Since we don't know the rule we're parsing, it's not clear how to apply some custom rule-specific error handling, since there are three rules, which could each have their own error handling.

I suspect the cleanest answer here is to augment your parser so that you have cases for the expected cases you would like to have special errors for, and then use the fallible actions functionality to make special errors there.

@Pat-Lafon
Copy link
Contributor

This conversation reminded me of http://moscova.inria.fr/~fpottier/menhir/manual.pdf#section.11. There has been some cool work on menhir which I think addresses this issue. Maybe there is a simplified version of what they do though(Some staticly known error messages which over approximate the set of possible errors that lead to it). I would imagine this would be a fair amount of work.

@Elias-Graf
Copy link

Elias-Graf commented Feb 5, 2024

How would I do what @emm312 suggested, and is supposed to be possible:

you can use the error recovery or something similar to make it UserError::MissedBracket.

In this code for example:

DefWindow: Statement = {
    "(defwindow" <n:Symbol> <a:DefWindowArg*> ")" => {
        DefWindow::new(n, a).into()
    },
}

I want to detect if the name (n:Symbol) failed, and somehow get a custom error (one that could be recovered from so other "defwindow" could be contined to be validated), that the name is invalid / missing.

I tried creating a "DefWindowName" rule:

DefWindowName: Result<Symbol, crate::ast::ParseError> = {
    Symbol => Ok(<>),
    ! => {
        println!("{:?}", <>);
    },
};

But the issue here is, that it get's called multiple times. Let's take this input: (defwindow :geometry "x")". It would get called once for the actual problem token :, and once for the "x".
I'm assuming this is, because larlpop considers the error to be recovered from in the DefWindowName (but doesn't really explain to me why it ties to parse another name...). Thus, I also tried to return it like this:

DefWindowName: Result<Symbol, crate::ast::ParseError> = {
    Symbol => Ok(<>),
    ! =>? {
        Err(ParseError::User { error: MyError }
    },
};

But the issue with that code is, that I can't recover from it in the parent, e.g. DefWindow, and the whole parsing exits.

@yannham
Copy link
Contributor

yannham commented Feb 9, 2024

Here is how we do it in Nickel, not sure if that would fit your use-case, but it does seem so:

  1. We have an error node in the AST. So that we can recover from a local error (for example in the LSP), and still continue to parse other parts of the source file: https://github.com/tweag/nickel/blob/d899cf5848e6ac0958062376696aa39389de580e/core/src/term/mod.rs#L220-L223
  2. In the grammar, instead of bailing out on the first error - as you said, this would abort the parsing, which we don't want - we have a global parameter where we put all the errors we encountered: https://github.com/tweag/nickel/blob/d899cf5848e6ac0958062376696aa39389de580e/core/src/parser/grammar.lalrpop#L75
  3. Whenever we encounter an error, we push the error to the global error lists, and return an AST error node without using a fallible action (so, from the point of view of LALRPOP, there's in fact to error at all): https://github.com/tweag/nickel/blob/d899cf5848e6ac0958062376696aa39389de580e/core/src/parser/grammar.lalrpop#L438-L447

Here is an example of using this Error rule: https://github.com/tweag/nickel/blob/d899cf5848e6ac0958062376696aa39389de580e/core/src/parser/grammar.lalrpop#L249-L251

Doing so, you accumulate all the errors, and produce a valid AST as well. Then you can decide what to do with it. In theory you don't even need to accumulate the errors and can reconstruct the list from walking the AST, however if the AST is huge this can be costly to do so (and there might not even be a single error), so it's easier to just keep a duplicate list that is easily accessible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants