Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a way to capture the delimiters in a delimited repeat. #259

Open
tomprince opened this issue Apr 26, 2021 · 6 comments
Open

Add a way to capture the delimiters in a delimited repeat. #259

tomprince opened this issue Apr 26, 2021 · 6 comments
Labels

Comments

@tomprince
Copy link

I'm looking at migrating full-moon to use rust-peg for parsing. However, since it captures the entire text (such as whitespace and comments), I need to be able to capture the result of delimiters, as well as the main item, if I were to use **, or ++.

@kevinmehall
Copy link
Owner

You could do something like:

rule list<I, S>(item: rule<I>, sep: rule<S>) -> (Option<I>, Vec<(S, I)>)
        = first:item() items:(s:sep() i:item() { (s, i) })* { (Some(first), items) }
        / { (None, vec![]) }

rule use_it() = list(<expr()>, <comma()>)

which is kind of like what ** expands to.

I would be interested to hear your experience and pain points in using this library for a lossless parser. Are you producing a typed or untyped syntax tree?

@tomprince
Copy link
Author

You could do something like: [...]

It looks like the use of rule<...> as a type of rule argument isn't documented anywhere.

I would be interested to hear your experience and pain points in using this library for a lossless parser.

I've only just started working on converting the existing hand build parser to peg, so I don't know what pain points I'll run into. This is the first major one.

A couple of minor points:

  • I often have rule fragments like (a:e1 b:e2 {(a,b)}). It would be nice if I could instead just say (e1 e2).
  • Parsing against a complex [T] [1] requires defining a helper trait with a bunch of method and using the undocumented ## to call them. I'm not sure if there is something that could be done to make this more ergonomic.[2]

[1] I'm adapting an existing split lexer + parser, that clusters trivia like whitespace/comments with the adjacent tokens before parsing, so I'm parsing these token clusters (which also include position information, but only care about the root token for determining the parser.
[2] I realized as I was writing this that I could also use [token] {? if token ... } but something like

rule number() -> TokenReference<'text>
    = [token] {? if let TokenType::Number { number } = *token {
            Ok(token.with_value(number))
        } else {
            Err("not a number")
        }
     }

still feels a little bit awkward.

Are you producing a typed or untyped syntax tree?

I'm not sure what you mean by this?

@tomprince
Copy link
Author

I would be interested to hear your experience and pain points in using this library for a lossless parser.

I just discovered that I can't implement ParseLiteral for my [T]. I was going to experiment using this to allow matching symbols in the parser using string literal syntax. Though, even if I could, that would allow me to write grammar with an invalid symbol that would only be detected at runtime.

@godmar
Copy link

godmar commented Jul 3, 2021

You could do something like:

rule list<I, S>(item: rule<I>, sep: rule<S>) -> (Option<I>, Vec<(S, I)>)
        = first:item() items:(s:sep() i:item() { (s, i) })* { (Some(first), items) }
        / { (None, vec![]) }

rule use_it() = list(<expr()>, <comma()>)

which is kind of like what ** expands to.

I also have a use case where I'd like to collect the delimiters.
For instance, in a bash-style shell grammar, pipelines are separated by & or ; and within a pipeline, commands may be separated by | or |&. Before stumbling on this issue, my solution required 4 rules instead of 1 in each case; in general, with n choices of delimiters, it would be 2*n rules if I'm seeing this correctly.

So adding syntactic sugar may be useful. Also, it should probably return the separator that follows an item rather than the separator that precedes it (at least for my use case).

I'm currently successfully using the list<> rule given above. Very elegant.
For reference, the resulting code is:

    pub rule cmdline() -> Result<CommandLine, &'input str>
      = delimited_cmdline: list(<pipeline()>, <pipeline_separator()>) {
            let (pipe0, rest) = delimited_cmdline;
            let mut pipelines = vec![pipe0?];

            for (i, (sep, pipe)) in rest.into_iter().enumerate() {
                if matches!(sep, "&") {
                    let mut last = &mut pipelines[i];
                    last.bg_job = true;
                }
                pipelines.push(pipe.unwrap());
            }

            Ok(CommandLine {
                pipelines
            })
        }

    rule pipeline_separator() -> &'input str
        = $(";") / $("&")

@kevinmehall
Copy link
Owner

Also, it should probably return the separator that follows an item rather than the separator that precedes it (at least for my use case).

Yeah, one argument against making this some kind of built-in syntax is the number of different return types you might want, depending on how the separators associate with the items and whether empty lists and leading/trailing separators should be allowed:

  • (I, Vec<(S, I)>)
  • (Vec<(I, S>, I)
  • Vec<(I, Option<S>)>
  • Vec<(Option<S>, I)>
  • (Option<I>, Vec<(S, I)>)
  • (Vec<(I, S)>, Option<I>)
  • etc

(where I is the item and S is the separator)

@godmar
Copy link

godmar commented Jul 19, 2021

The better alternative may then in fact be to improve documentation for the technique that uses rule; the user should be able to quickly create the variant that's best for them from the example if it's included in the README.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants