Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow for context-sensitive parsing #158

Open
rtoal opened this issue Sep 20, 2016 · 14 comments
Open

Allow for context-sensitive parsing #158

rtoal opened this issue Sep 20, 2016 · 14 comments

Comments

@rtoal
Copy link
Contributor

rtoal commented Sep 20, 2016

It would be helpful to support state during parsing to enable such things as the off-side rule. This seems to be already called out as a planned extension. From the MSA paper:

While we have found Ohm’s expressive power to be sufficient for most tasks, several users have requested better support for parsing context-sensitive languages like HTML and Python. We plan to investigate whether a solution like parsing contexts (as implemented in PetitParser) could be adapted to Ohm.

@MuhammedZakir
Copy link

MuhammedZakir commented Aug 8, 2021

Apologiies if what I am asking for isn't releated to this issue.


Is it possible to get something similiar to regex capture group? Examples (syntax can be anything):

Using @rule for capturing, and $rule for using a captured rule.

html_elem = "<" @tag ">" anythingExceptClosing<$tag> "</" $tag ">"
fenced_code_block = @f_c_marker lineWithout<$f_c_marker> $f_c_marker

In the above, the scope of a captured rule need to considered. I am not that familiar with parsing, so I am not sure if I would be of any help regarding this. However, if you need further clarification, or any input, feel free to ask it. I will do what I can. :-)

Edit: One more thing needs to considered: To use same rule for multiple captures in same scope, for the above syntax, captured index can be mentioned.

Reason:

Like I mentioned in #215 [1], I am trying to write a grammar for Markdown. I am stuck at writing a rule for fenced code blocks. From CommonMark spec (above example-124 [2]):

The closing code fence must be at least as long as the opening fence

Without capture group/node, If I am right, the only way I can think of to solve this is to use a global state. That will make this more complicated, as I will have to save source string of some parsed non-code-blocks, and re-parse some code blocks. I haven't implemented this, so I am not sure how feasible this is. If I am doing this in a wrong way, or over-complicating this, please do say so! :-)

[1] #215 (comment)
[2] https://spec.commonmark.org/0.30/#example-124

@MuhammedZakir
Copy link

I think the above will also work nicely with indentation-sensitive languages.

Simple example:

class = @class_indent class_rule
func = @func_indent func_rule ($func_indent func_rule)*
statement = @stmt_indent statement_rule ($stmt_indent statatement_rule)*

class_indent = indent
// optional indent because both function and statement
// could also be in main scope.
func_indent = $class_indent? indent
stmt_indent = $func_indent? indent

indent = space+ | tab+

For this to work, a capture's scope should be such that the a rule should be able to access captures in it's parent scope/rules - like functions in most programming languages, which can access variables in its' outer scope. Therefore, in the above grammar, class_rule should contain func for func rule to access @class_indent capture, and func_rule should contain statement for statement to access @func_indent capture.

A random thought. Hope this makes sense! :-)

@pdubroy
Copy link
Contributor

pdubroy commented Aug 10, 2021

@MuhammedZakir Thanks for contributing your thoughts here! That's an interesting suggestion, and maybe something like this could work for Ohm. I still need some time to think through all of the details, but this has inspired me to see what I could find in the literature for solving this problem. Here's what I found after a quick search:

  • Indentation Sensitive Languages by Brunauer and Mühlbacher. They propose a concept called counter binding.
  • Indentation-Sensitive Parsing for Parsec by Adams and Agacan proposes an extension to PEGs to support indentation-sensitive languages. About Brunauer and Mühlbacher's approach, they write:

    While encoding indentation sensitivity this way is formally precise, it comes at a cost. The YAML specification (Ben-Kiki et al. 2009) uses the approach proposed by Brunauer and Mühlbacher (2006) and as a result has about a dozen and a half different nonterminals for various sorts of whitespace and comments. With this encoding, the grammar cannot use a separate tokenizer and must be scannerless, each possible occurrence of whitespace must be explicit in the grammar, and the grammar must carefully track which non-terminals produce or expect what sorts of whitespace. The authors of the YAML grammar establish naming conventions for nonterminals that help manage this, but the result is still a grammar that is difficult to comprehend and even more difficult to modify.

I'd like to take the time to read this papers and think through the implications for Ohm. Maybe you're interested to read these as well!

@MuhammedZakir
Copy link

MuhammedZakir commented Aug 28, 2021

A Symbol-Based Extension of Parsing Expression Grammars and Context-Sensitive Packrat Parsing

Paper: https://dl.acm.org/doi/10.1145/3136014.3136025
Talk: https://www.youtube.com/watch?v=4QrZxfgZFDg
Implementation: http://nez-peg.github.io/
Example: https://github.com/nez-peg/nez-grammar/blob/master/xml-sym.nez

  • Edit: add link to paper.

All symol operations in SPEG:

  • Symbol Table Mutation
    • <symbol A>: captures and binds A
    • <reset A>: remove all A entries
  • Symbol Table Scoping
    • <block e>: narrowing scope of all symbols in e
  • Symbol Table Matching (no state mutation)
    • <match A>: matches with A
    • <exists A>: tests whether A exists
    • <is A>: tests whether the capture is A
    • <isa A>: tests whether the capture is one of A's entries

--

Edit:

Might be helpful (haven't read this):

Is stateful packrat parsing really linear in practice? a counter-example, an improved grammar, and its parsing algorithms

https://dl.acm.org/doi/10.1145/3377555.3377898

@MuhammedZakir
Copy link

Any updates/thoughts? 👀

@pdubroy
Copy link
Contributor

pdubroy commented Feb 2, 2022

@MuhammedZakir Unfortunately I haven't had the time to investigate this deeply yet. But, the SPEG approach seems nice and I could imagine it fitting well into Ohm.

I do have some more time in the coming weeks/months so this may be something that I can dig into soon.

@MuhammedZakir
Copy link

Glad to hear that! :-)

FYI: Pest supports context-sensitive parsing: https://docs.rs/pest_derive/latest/pest_derive/#push-pop-drop-and-peek.

@haikyuu
Copy link

haikyuu commented Apr 12, 2022

I'm trying to extend the grammar of Ohm to include some indentation specific operators as specified in the paper Indentation-sensitive parsing for Parsec also linked above.

This will change the following part of Ohm grammar:

-    Seq = Iter*
-
+    Seq = IterWithIndentation*
+
+  IndentationRel = Eq | Ge | Gt | Any
+  
+  Eq = "@="
+  Ge = "@>="
+  Gt = "@>"
+  Any = "@*"
+ 
+  IterWithIndentation
+    = Iter Ge -- indent_ge
+    | Iter Gt -- indent_gt
+    | Iter Eq -- indent_eq
+    | Iter Any -- indent_any
+    | Iter

With these changes, we can write the following grammar for (one of the productions of) while statement in python:

While_stm = "while"@> Test ":"@> Suite

Suite
  = newline@> Block@>
  | Stm_list newline@>

Block = Statement@=*

We can now write

while a < 100:
    a = a + 1
    a = a + 4
while a < 10: a=a+1
  • The operator ":"@>'s intent is to say that after ":", code is meant to be at a strictly greater indentation in order to be part of the Rule. That's why the second while is detected as a separate statement.
  • Block = Statement@=*'s intent is to say that the block is a sequence of statements that all have their first token start at the same column
  • X@>= is an extension of the @> that allows being at the same level of indentation
  • Y@* allows for tokens to be at any level of indentation. It would be used to specify the following code
b = """e
eee
1
"""

In order to be able to handle this logic, Ohm would need to make two properties available at each parse:

  • The allowed indentation set: e.g{2, 3 ...}. This covers @>, @>= and @*
  • A boolean flag for wether the current parse is inside a @= or not

Each parse needs these properties and produces a new set of properties for the next parse.
One thing to note is that terminals default to have @> in order to avoid making it cumbersome to write the grammar. That would be configurable but it's a sane default for an indentation sensitive language.

Side note

The suggested operators are two characters long, but I believe they read well @> :at a greater indentation and we could make the @ light gray in syntax highlighting to make it stand less or maybe even have it displayed like the paper using the mathematical form if the cursor isn't on it. But that's not time for these considerations yet.
Stm*>

@MuhammedZakir
Copy link

Wouldn't a general method/operator such as the one I mentioned above [1] solves this?

[1] #158 (comment)

@pdubroy
Copy link
Contributor

pdubroy commented Apr 13, 2022

@haikyuu Thanks for taking a go at this! This is definitely an interesting experiment.

I've been thinking a bit in the past few weeks about how we can handle indentation-sensitive language, and other context-sensitive languages. I'll try to share some more substantial thoughts in the next day or two. For now, I can share my initial thoughts/impressions:

  • I find the indentation annotations — both in the paper and your example — to be fairly hard to read. What do you think? Do you have an ideas for a syntax that might be more readable? To me this is a pretty important factor.

  • All things being equal, I'd prefer to add a feature that could be used for other context-sensitive languages, not just indentation-sensitive languages. For example, the paper A Symbol-Based Extension of Parsing Expression Grammars and Context-Sensitive Packrat Parsing describes a more general feature that would be useful for things like parsing matching XML tags.

@haikyuu
Copy link

haikyuu commented Apr 14, 2022

@pdubroy I have watched the SPEG video and I find it very powerful and readable syntax.

  • It is definitely more expressive. I'll read the paper and try to use it to assess its usability and readability.
  • It seems to be easier to implement as well as it requires adding one shared data structure that only the added operators access and mutate.

@pdubroy
Copy link
Contributor

pdubroy commented Apr 14, 2022

@haikyuu Sounds great! Btw, if I were to experiment with this in Ohm, I'd initially try to do this without changing any syntax at all. I'd create a "dummy" grammar with empty rules. Once it's instantiated in JS, I'd replace the rule bodies with a custom subclass of PExpr. Something like this:

const g = ohm.grammar(`
  SPEG {
    symbol<a> =
    is<a> =
    match<a> =
    ...
  }
`)
g.rules.symbol.body = new SpegSymbol();
g.rules.is.body = new SpegIs();
g.rules.match.body = new SpegMatch();

Then of course you'd have to implement the SpegSymbol, SpegIs, SpegMatch classes. The existing CaseInsensitiveTerminal class is a good example of a self-contained PExpr subclass.

Maybe that's helpful in case you or anyone else ends up experimenting with this.

@todwang
Copy link

todwang commented Jun 22, 2022

Any progress?

@haikyuu
Copy link

haikyuu commented Jun 22, 2022

No, I didn't make any progress. Feel free to jump on it if you will 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants