Add support for user defined member variables in the grammar #223

sandeep-datta · 2020-04-05T01:17:09Z

Example ...

peg::parser!{
  grammar my_grammar() for str {
    // These variables will be updated by specific rules in the grammar
    indentation_level: u32,
    line_number: u32,
    ...
  }
}

I understand rule arguments can be used to simulate this feature but IMO it makes the grammar unwieldy.

…#223.

kevinmehall · 2020-04-05T16:03:50Z

This is something that has come up before and I've been thinking about for a while. The closest prior art I know of is the C# library Pegasus. I never came up with a design for it I was happy with, or that seemed to be the best approach to solving the problems it would be used for, but it would be great to hear what you have in mind.

How do you envision these working with backtracking and caching? We basically assume that rule and grammar arguments are immutable, but for a feature explicitly designed for mutation, that's asking for trouble. As I understand it, Pegasus saves the variables before entering alternation or repetition, and restores them on backtracking. It would also need to include the values of all variables (along with input position) in the key for memoization.

What use cases do you have in mind for this? The primary one I've seen is parsing languages with significant whitespace, but I think you're way better off handling that in a custom lexer.

sandeep-datta · 2020-04-05T16:38:44Z

How do you envision these working with backtracking and caching?

Yes, this complicates things. I think saving the member variables on the parse stack would be the correct way of handling this (not familiar with Pegasus but it seems this is what it does).

But IMO this is overkill for most purposes. I'd be happy if the parser generator can just fail with an error message when it detects a grammar where the member variables cannot be safely used with backtracking or caching.

As for use cases, I needed this to deal with significant white-space (like you correctly guessed).

For example in the line count grammar (see member_variables.rs in PR #224) the line count is updated only when there is a successful match. As far as I can tell backtracking does not invalidate the line count at any point. But it would have been great if rust-peg could confirm that. I am of-course assuming any such static analysis will not run afoul of the decidability problem.

kevinmehall · 2020-04-05T16:56:48Z

That analysis would break the encapsulation of rules. A rule may be safe on its own, but if it uses variables, it would be an error to call that rule within alternation or repetition. While we theoretically could do a whole-grammar analysis starting from the pub rules, this wouldn't play well with the future ability to import rules across grammars.

It would also mean parsing the Rust code to determine where a variable was read and written. Right now we just paste the action code blocks into the output without looking at their content.

sandeep-datta · 2020-04-06T04:36:03Z

I think saving state on a "parse stack" (like Pegasus) is probably the best way to move forward with this.

It would also mean parsing the Rust code to determine where a variable was read and written.

What if we use some kind of sigil to flag member variable use? For example ...

 rule line()
        = (terminated_line() / unterminated_line()) { @line_count += 1; }

IIRC there is precedence for this in other parser generators and should not be too jarring.

kevinmehall · 2020-04-11T17:15:54Z

Have you tried parsing a language with significant indentation with a custom lexer? (Please suggest improvements for custom lexer support; it's the bare minimum right now).

For a feature that would be as invasive and inelegant as this one, I'm reluctant to add it when the only thing people want to use it for can probably be done more elegantly in a different way.

sandeep-datta · 2020-04-16T09:18:10Z

Sure, let me try a custom lexer and get back to you.

neunenak · 2021-11-13T23:16:58Z

This is something that has come up before and I've been thinking about for a while. The closest prior art I know of is the C# library Pegasus. I never came up with a design for it I was happy with, or that seemed to be the best approach to solving the problems it would be used for, but it would be great to hear what you have in mind.

How do you envision these working with backtracking and caching? We basically assume that rule and grammar arguments are immutable, but for a feature explicitly designed for mutation, that's asking for trouble. As I understand it, Pegasus saves the variables before entering alternation or repetition, and restores them on backtracking. It would also need to include the values of all variables (along with input position) in the key for memoization.

What use cases do you have in mind for this? The primary one I've seen is parsing languages with significant whitespace, but I think you're way better off handling that in a custom lexer.

A use case that I have where this feature would come in handy is creating a programming language AST where each node in the tree is tagged with a unique numeric ID upon creation (I have a bunch of AST processing code that assumes these unique IDs already exist). This needs state so that I can make sure I'm generating a fresh ID for each created tree node.

kevinmehall · 2021-11-21T19:57:51Z

@neunenak There's existing (but not documented) support for grammar arguments that can help with this. They're not saved and restored on backtracking, and are not really supposed to be mutable, but as long as you're either only assigning IDs after completing rules that can't backtrack into an alternative, or don't care about skipped IDs, it works:

extern crate peg;

pub struct Item {
    id: usize,
    name: String,
}

peg::parser! {
    grammar p(counter: &mut usize) for str {
        pub rule list() -> Vec<Item> = "[" items: item() ** "," "]" { items }
        rule item() -> Item = name: $(['a'..='z']+) {
            Item {
                id: { *counter += 1; *counter },
                name: name.to_owned()
            }
        }
    }
}

fn main() {
    let mut counter = 0;
    let r = p::list("[abcd,defg]", &mut counter).unwrap();
    assert_eq!(r[0].id, 1);
    assert_eq!(r[1].id, 2);
}

sandeep-datta added a commit to sandeep-datta/rust-peg that referenced this issue Apr 5, 2020

Implemented feature: member variables. Fixes github issue kevinmehall…

7404fec

…#223.

sandeep-datta mentioned this issue Apr 5, 2020

Implemented feature: member variables. Fixes github issue #223. #224

Closed

sandeep-datta added a commit to sandeep-datta/rust-peg that referenced this issue Apr 5, 2020

Implemented feature: member variables. Fixes github issue kevinmehall…

77ff5e4

…#223.

kevinmehall added the feature Something not supported that perhaps should be label May 20, 2020

kevinmehall mentioned this issue Jan 21, 2022

How do I count the tab depth #285

Closed

mcandre mentioned this issue Apr 9, 2023

include line numbers in AST nodes mcandre/unmake#72

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for user defined member variables in the grammar #223

Add support for user defined member variables in the grammar #223

sandeep-datta commented Apr 5, 2020

kevinmehall commented Apr 5, 2020

sandeep-datta commented Apr 5, 2020

kevinmehall commented Apr 5, 2020

sandeep-datta commented Apr 6, 2020

kevinmehall commented Apr 11, 2020

sandeep-datta commented Apr 16, 2020

neunenak commented Nov 13, 2021

kevinmehall commented Nov 21, 2021

Add support for user defined member variables in the grammar #223

Add support for user defined member variables in the grammar #223

Comments

sandeep-datta commented Apr 5, 2020

kevinmehall commented Apr 5, 2020

sandeep-datta commented Apr 5, 2020

kevinmehall commented Apr 5, 2020

sandeep-datta commented Apr 6, 2020

kevinmehall commented Apr 11, 2020

sandeep-datta commented Apr 16, 2020

neunenak commented Nov 13, 2021

kevinmehall commented Nov 21, 2021