Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] support custom state and hooks #815

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

skyzh
Copy link

@skyzh skyzh commented Feb 26, 2023

Guided-Level Explanation

This PR adds a "context" for parsing with pest (for one of my course projects where I'm using pest...). Now users can parse with a custom state as a context. One example is the following grammar, where we want ints to parse a sequence of non-decrementing numbers.

WHITESPACE   =  _{ " " | "\t" | NEWLINE }
int          =  @{ (ASCII_NONZERO_DIGIT ~ ASCII_DIGIT+ | ASCII_DIGIT) }
__HOOK_INT   =  _{ int }
ints         =  { SOI ~ __HOOK_INT* ~ EOI }

For all terms beginning with __HOOK, the generator will call user-defined hooks before deciding whether to parse.

    #[derive(Parser)]
    #[grammar = "../examples/hook.pest"]
    #[custom_state(crate::parser::CustomState)]
    pub struct Parser;

    pub struct CustomState {
        pub max_int_visited: usize,
    }

    impl Parser {
        #[allow(non_snake_case)]
        fn hook__HOOK_INT<'a>(state: &mut CustomState, span: Span<'a>) -> bool {
            let val: usize = span.as_str().parse().unwrap();
            if val >= state.max_int_visited {
                state.max_int_visited = val;
                true
            } else {
                false
            }
        }
    }

The state will need to support snapshot / recovery so that the hook can be placed anywhere. But users can also opt-out recovery and modify their grammar to avoid this situation. Examples in hook.rs. and hook.pest.

Implementation-Level Explanation

ParseState now contains a field for storing custom state. It is by-default set to ().

pub struct ParserState<'i, R: RuleType, S = ()>
where
    S: StateCheckpoint,

We now have a new parser trait called StateParser, where we accept a state + input and return rules + state.

/// A trait with a single method that parses strings.
pub trait StateParser<R: RuleType, S: StateCheckpoint> {
    /// Parses a `&str` starting from `rule`.
    #[allow(clippy::perf)]
    fn parse_with_state(rule: R, input: &str, state: S) -> Result<(S, Pairs<'_, R>), Error<R>>;
}

The generator crate is refactored to implement custom state generic type for all traits. Some functions also have new signatures in order to accept custom state.

The API change doesn't seem to affect any existing examples.

This PR is a demo and I'm open to suggestions / changes to this new feature. If it doesn't work I can also maintain it in my own fork. Thanks for review!

Signed-off-by: Alex Chi <iskyzh@gmail.com>
Signed-off-by: Alex Chi <iskyzh@gmail.com>
Signed-off-by: Alex Chi <iskyzh@gmail.com>
Signed-off-by: Alex Chi <iskyzh@gmail.com>
@skyzh skyzh requested a review from a team as a code owner February 26, 2023 17:11
@skyzh skyzh requested review from NoahTheDuke and removed request for a team February 26, 2023 17:11
@NoahTheDuke
Copy link
Member

This feels like a job for the compiler, not the parser. I know that some PEG libraries can do this, such as the one Python uses for interpreting Python code, but in generic libraries, makes me think we're overstepping the core premise.

I'm willing to be convinced, just want to get that out first.

@skyzh
Copy link
Author

skyzh commented Feb 26, 2023

I totally agree that this adds too much complexity for the parser. But meanwhile I feel without custom state and hooks, it will be hard to do some context-aware parsing. An example is parsing a C-like language.

typedef int foo;
int main() {
  foo * bar;
}

Whether to parse foo as an identifier (and then parse the statement as expression multiply expression), or to parse it as foo* bar, depends on if foo is type-defined before. If we don't have a context, then we will need to do some extra work after parsing to revert the wrong decision, which can be painful.

I'm also thinking of removing CheckpointState trait in this PR, and we can have something like action in peg.js. This is easier to understand from the semantic aspect. https://pegjs.org/documentation

Hence, an alternative is to support a syntax like:

Typedef = { "typedef" ~ Type ~ Ident { SaveTypeAction } ~ ";" }
Type = { Ident { CheckTypeAction } /* user def type */ | Int | Void | Bool }

so that we can run some custom action to check / save something, while error recovery becomes some kind of undefined behavior.

@CAD97
Copy link
Contributor

CAD97 commented Feb 27, 2023

It's both a strength and a limitation of pest that it sticks to pure1 PEG, without any sort of user actions in the grammar.

If pest does want to support non-PEG semantics, I believe the way to go about doing so is to add the ability to fully define custom BUILTIN style rules (e.g. PUSH/PEEK/POP, though probably without the parameterization), similar to how logos callbacks function, and the current builtins. If it wants to do more advanced parsing, it can maybe recursively call into the parser its a part of, or use a different parser.

Any newly designed language SHOULD NOT emulate C's lexer hack. The documentation should make it clear that using the functionality is discouraged and should be avoided if that's at all an option.

To make these special rules stick out more (whether they're just binary acceptance functions or more sophisticated custom rules), they should look different from normal rules. Three arbitrary proposals could be HOOK_INT = native (only possible if the entire match is done natively via the ParserState API), native HOOK_INT = { ... } (obviously sticks out, gives a place to put a pre-hook grammar), or HOOK_INT = ?{ ... } (fits with other rule modifiers, but as such very quiet; % is also available as a character). I very much would like to avoid introducing any new special name semantics; I remain sadge at the opaqueness of WHITESPACE = _{ WHITE_SPACE } as an actual meaningful thing to write without those names being mentioned anywhere else.

The save/restore interface also imho should be improved before it's exposed to consumers; it should either just be a Clone of the state (if we just care about it working) or something along the lines of fn save(&self) -> Self::Snapshot; fn restore(&mut self, &Self::Snapshot); (if minimizing cloned state matters more), with Self::Snapshot being stored on a stack managed by the parser (e.g. Stack::Snapshot = usize).

Ultimately, I don't have any strong reason to block the availability of this functionality in pest. It feels very "pest3" in that it's adding entirely new capabilities on top of the existing jank, but it's certainly a need that people feel, so it's valid to serve that need.

Footnotes

  1. Technically speaking, PUSH/PEEK/POP break the quality of pest only parsing true context-free languages (of which PEG is known to be a subset and conjectured to be a proper subset), as it adds a manipulable stack, making the matching engine computationally in the class of turing machines rather than finite automaton.

Copy link
Contributor

@tomtau tomtau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this can be added to the existing codebase in a non-breaking way. I agree with what was mentioned above that it has use cases and it's perhaps a better fit with a "simplified" / less complex grammar (as in pest3: #333 (comment) )

@skyzh would you be interested in dusting off https://github.com/pest-parser/pest3 and trying to add hooks there? Once it's somewhat useable, we can perhaps add pest3 stuff here under a feature flag and/or v3 modules until it's stabilized and there's tooling that can help people to switch to a new grammar?


/// A trait with a single method that parses strings.
pub trait Parser<R: RuleType> {
pub trait Parser<R: RuleType, S: Default + StateCheckpoint = ()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

won't this be a semver-breaking change? i.e. existing pest 2.X-generated code implemented the trait with one param and wouldn't compile with this/would need to be regenerated: https://github.com/pest-parser/pest/actions/runs/4276068348/jobs/7480923758#step:4:253

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, that's more breakage from the fact that the runtime and derive library versions aren't pinned to each other, isn't it... adding a new defaulted generic to a type is backwards compatible, as old code will use the default, but it's not forwards compatible, in that generated code supplying the parameter won't work with the older library versions without it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bootstrap path will always be interesting because it has a weird mix of using the local version for some stuff and the crates-io version for other stuff.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, yes, it's not forwards-compatible

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but that should be ok

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(sans figuring out the bootstrap)

@skyzh
Copy link
Author

skyzh commented Mar 3, 2023

Thanks a lot for your information!

HOOK_INT = ?{ ... }

I personally would prefer this grammar.

The save/restore interface also imho should be improved before it's exposed to consumers; it should either just be a Clone of the state (if we just care about it working) or something along the lines of fn save(&self) -> Self::Snapshot; fn restore(&mut self, &Self::Snapshot); (if minimizing cloned state matters more), with Self::Snapshot being stored on a stack managed by the parser (e.g. Stack::Snapshot = usize).

Agree.

@skyzh would you be interested in dusting off https://github.com/pest-parser/pest3 and trying to add hooks there? Once it's somewhat useable, we can perhaps add pest3 stuff here under a feature flag and/or v3 modules until it's stabilized and there's tooling that can help people to switch to a new grammar?

I can take a look but I'm not sure if it will work 🤪 The pest3 repo doesn't seem to have development activities since 2 years ago...

won't this be a semver-breaking change?

I tried designing the API to be compatible but it seems that this will be a semver-breaking change. If users have manually implemented the Parser trait, probably they will need to change it... Therefore, I would probably make hook behind a feature gate.


I think people still have diverged opinions on whether to have hooks built-in pest. Probably it would be a good idea to leave this PR open and see if there's any need for this feature in the future?

Meanwhile, to make hooks more useable in pest, I'll probably maintain my version of pest in this branch and see if there're potential users that are interested in using that...

  • Add everything behind a feature gate "custom-state"?
  • Add pest grammar support for int = ?{ ... } that will eventually call hook_int in the parser implementation, instead of the current __HOOK hacks.
  • Make the StateCheckpoint API more intuitive, or clone the full user snapshot instead of giving maximum control to user.

Thanks for your reviews and comments!

@tomtau
Copy link
Contributor

tomtau commented Mar 3, 2023

I can take a look but I'm not sure if it will work 🤪 The pest3 repo doesn't seem to have development activities since 2 years ago...

Yeah, the pest3 repo is somewhat abandoned -- so it'd rather mean restarting the work of that reworked / simplified pest codebase there.

In any case, you can probably proceed as you outlined by continuing the updates in this PR. I think potentially this PR could be merged if the hooks functionality is exposed under a non-default feature flag.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants