Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve documentation and examples how to handle strings properly #314

Open
jbx1 opened this issue Jul 24, 2022 · 5 comments
Open

Improve documentation and examples how to handle strings properly #314

jbx1 opened this issue Jul 24, 2022 · 5 comments
Labels

Comments

@jbx1
Copy link

jbx1 commented Jul 24, 2022

The documentation only shows examples of parsing numbers and single characters. Almost all the tests also don't parse strings, which makes it hard to know what one needs to do, especially if one is a bit of a beginner in Rust and rust-peg.

The information about the 'input lifetime annotation is a bit elusive (not documented), and it is not clear how this affects the lifetime annotations needed for any structs receiving the parsed str or Vec of str.

It would also be great if there were some recommendations as to how strings should be parsed, and if zero-copy can be achieved in any way.

Some proper documentation with a few examples of parsing singular or vec of strings (with operators such as ** and ++) would be really helpful.

@kevinmehall
Copy link
Owner

The $() operator returns an &'input str slice of the input string corresponding to the text matched by the expression inside, and is zero-copy:

pub rule alphanumeric1() -> &'input str = $(['a'..='z' | 'A'..='Z' | '0'..='9']+)

though if you want to copy it into an owned String you can do so in an action:

pub rule alphanumeric2() -> String = v:$(['a'..='z' | 'A'..='Z' | '0'..='9']+) { v.to_owned() }

You can compose these into something that parses a sequence of strings:

pub rule alphanumeric_seq1() -> Vec<&'input str> = alphanumeric1() ** ","
pub rule alphanumeric_seq2() -> Vec<String> = alphanumeric2() ** ","

or inline the rule if you don't want the separate rule:

pub rule alphanumeric_seq2a() -> Vec<String> = (v:$(['a'..='z' | 'A'..='Z' | '0'..='9']+) { v.to_owned() }) ** ","

If by "string" you mean something like a quoted string literal, it gets a little more complicated to handle escape sequences rather than a simple slice of the input:

   pub rule double_quoted_string() -> String
    = "\""  s:double_quoted_character()* "\"" { s.into_iter().collect() }

    rule double_quoted_character() -> char
      = [^ '"' | '\\' | '\r' | '\n' ]
      / "\\n" { '\n' }
      / "\\u{" value:$(['0'..='9' | 'a'..='f' | 'A'..='F']+) "}" {?
            u32::from_str_radix(value, 16).ok().and_then(char::from_u32).ok_or("valid unicode code point")
        }
      / expected!("valid escape sequence")

Hope that helps. Leaving this issue open for these examples to be integrated somewhere in the documentation.

@jbx1
Copy link
Author

jbx1 commented Jul 28, 2022

That's great. Maybe a bit more details about the semantics of the 'input lifetime would be helpful.

@kevinmehall
Copy link
Owner

The 'input lifetime just gets used for the the input argument in the generated parse function. So a rule like

pub rule x() -> Vec<&'input str> = ($(['a'..='z')) ** ","

expands into a function like

fn x(input: &'input str) -> Result<Vec<&'input str>, ParseError>

In #299 (probably for 0.9), the name will be customizable instead of hard-coded, making it seem a little less magical.

@YingboMa
Copy link

YingboMa commented Apr 12, 2023

How can we match unicode identifiers? Is it possible to use unicode-ident in the grammar?

@kevinmehall
Copy link
Owner

Yes, [ ] patterns allow a boolean if like Rust's match cases, so you can do something like

rule identifier() -> &'input str = $([c if is_xid_start(c)] [c if is_xid_continue(c)]*)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants