Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for user defined member variables in the grammar #223

Open
sandeep-datta opened this issue Apr 5, 2020 · 8 comments
Open

Add support for user defined member variables in the grammar #223

sandeep-datta opened this issue Apr 5, 2020 · 8 comments
Labels
feature Something not supported that perhaps should be

Comments

@sandeep-datta
Copy link
Contributor

Example ...

peg::parser!{
  grammar my_grammar() for str {
    // These variables will be updated by specific rules in the grammar
    indentation_level: u32,
    line_number: u32,
    ...
  }
}

I understand rule arguments can be used to simulate this feature but IMO it makes the grammar unwieldy.

@kevinmehall
Copy link
Owner

This is something that has come up before and I've been thinking about for a while. The closest prior art I know of is the C# library Pegasus. I never came up with a design for it I was happy with, or that seemed to be the best approach to solving the problems it would be used for, but it would be great to hear what you have in mind.

How do you envision these working with backtracking and caching? We basically assume that rule and grammar arguments are immutable, but for a feature explicitly designed for mutation, that's asking for trouble. As I understand it, Pegasus saves the variables before entering alternation or repetition, and restores them on backtracking. It would also need to include the values of all variables (along with input position) in the key for memoization.

What use cases do you have in mind for this? The primary one I've seen is parsing languages with significant whitespace, but I think you're way better off handling that in a custom lexer.

@sandeep-datta
Copy link
Contributor Author

How do you envision these working with backtracking and caching?

Yes, this complicates things. I think saving the member variables on the parse stack would be the correct way of handling this (not familiar with Pegasus but it seems this is what it does).

But IMO this is overkill for most purposes. I'd be happy if the parser generator can just fail with an error message when it detects a grammar where the member variables cannot be safely used with backtracking or caching.

As for use cases, I needed this to deal with significant white-space (like you correctly guessed).

For example in the line count grammar (see member_variables.rs in PR #224) the line count is updated only when there is a successful match. As far as I can tell backtracking does not invalidate the line count at any point. But it would have been great if rust-peg could confirm that. I am of-course assuming any such static analysis will not run afoul of the decidability problem.

@kevinmehall
Copy link
Owner

That analysis would break the encapsulation of rules. A rule may be safe on its own, but if it uses variables, it would be an error to call that rule within alternation or repetition. While we theoretically could do a whole-grammar analysis starting from the pub rules, this wouldn't play well with the future ability to import rules across grammars.

It would also mean parsing the Rust code to determine where a variable was read and written. Right now we just paste the action code blocks into the output without looking at their content.

@sandeep-datta
Copy link
Contributor Author

I think saving state on a "parse stack" (like Pegasus) is probably the best way to move forward with this.

It would also mean parsing the Rust code to determine where a variable was read and written.

What if we use some kind of sigil to flag member variable use? For example ...

 rule line()
        = (terminated_line() / unterminated_line()) { @line_count += 1; }

IIRC there is precedence for this in other parser generators and should not be too jarring.

@kevinmehall
Copy link
Owner

Have you tried parsing a language with significant indentation with a custom lexer? (Please suggest improvements for custom lexer support; it's the bare minimum right now).

For a feature that would be as invasive and inelegant as this one, I'm reluctant to add it when the only thing people want to use it for can probably be done more elegantly in a different way.

@sandeep-datta
Copy link
Contributor Author

Sure, let me try a custom lexer and get back to you.

@kevinmehall kevinmehall added the feature Something not supported that perhaps should be label May 20, 2020
@neunenak
Copy link
Contributor

This is something that has come up before and I've been thinking about for a while. The closest prior art I know of is the C# library Pegasus. I never came up with a design for it I was happy with, or that seemed to be the best approach to solving the problems it would be used for, but it would be great to hear what you have in mind.

How do you envision these working with backtracking and caching? We basically assume that rule and grammar arguments are immutable, but for a feature explicitly designed for mutation, that's asking for trouble. As I understand it, Pegasus saves the variables before entering alternation or repetition, and restores them on backtracking. It would also need to include the values of all variables (along with input position) in the key for memoization.

What use cases do you have in mind for this? The primary one I've seen is parsing languages with significant whitespace, but I think you're way better off handling that in a custom lexer.

A use case that I have where this feature would come in handy is creating a programming language AST where each node in the tree is tagged with a unique numeric ID upon creation (I have a bunch of AST processing code that assumes these unique IDs already exist). This needs state so that I can make sure I'm generating a fresh ID for each created tree node.

@kevinmehall
Copy link
Owner

@neunenak There's existing (but not documented) support for grammar arguments that can help with this. They're not saved and restored on backtracking, and are not really supposed to be mutable, but as long as you're either only assigning IDs after completing rules that can't backtrack into an alternative, or don't care about skipped IDs, it works:

extern crate peg;

pub struct Item {
    id: usize,
    name: String,
}

peg::parser! {
    grammar p(counter: &mut usize) for str {
        pub rule list() -> Vec<Item> = "[" items: item() ** "," "]" { items }
        rule item() -> Item = name: $(['a'..='z']+) {
            Item {
                id: { *counter += 1; *counter },
                name: name.to_owned()
            }
        }
    }
}

fn main() {
    let mut counter = 0;
    let r = p::list("[abcd,defg]", &mut counter).unwrap();
    assert_eq!(r[0].id, 1);
    assert_eq!(r[1].id, 2);
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Something not supported that perhaps should be
Projects
None yet
Development

No branches or pull requests

3 participants