Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support parsing of indentation-based languages #217

Open
tiye opened this issue Oct 16, 2013 · 34 comments
Open

Support parsing of indentation-based languages #217

tiye opened this issue Oct 16, 2013 · 34 comments
Labels
Milestone

Comments

@tiye
Copy link

tiye commented Oct 16, 2013

I was using a bunch of indentation-based languages, like CoffeeScript, Jade, and want to create DSLs by myself.
I find some tricks maintaining indentations in pegjs by searching, was was wondering if there's a consistant solution:
http://stackoverflow.com/questions/11659095/parse-indentation-level-with-peg-js
http://stackoverflow.com/questions/4205442/peg-for-python-style-indentation
https://gist.github.com/jakubkulhan/3192844
https://groups.google.com/forum/#!searchin/pegjs/indent/pegjs/RkbAB4rPlfU/xxafrY5wGCEJ
But will pegjs support this feature?

@krisnye
Copy link

krisnye commented Feb 11, 2014

It's very dangerous to rely on side-effects that you add in custom handlers to parse indentation based grammars. Just don't do it. Pegjs would have to add some ability to push and pop conditional state in order to make parsing indentations (and other context sensitive grammars) safe.

This is what I do for now, and I recommend you do this: Preprocess the input file and insert your own indent/outdent tokens. I use {{{{ and }}}} respectively. Then your grammar is context free and can be parsed normally. It may mess up your line/column values, but you can correct those in a postprocessor.

@otac0n
Copy link

otac0n commented Feb 11, 2014

If you aren't needing to target javascript, Pegasus, my pegjs clone for C#, has support for pushing/popping state. Here's a wiki article on how to do exactly what you want: https://github.com/otac0n/Pegasus/wiki/Significant-Whitespace-Parsing

I would like to propose that pegjs use my syntax as a starting point for state-based parsing.

@krisnye
Copy link

krisnye commented Feb 11, 2014

The ability to safely push and pop state is nice. I would use that if it was Javascript based. Just not worth it to integrate a CLR just for parsing.

@otac0n
Copy link

otac0n commented Feb 12, 2014

That's what I figured. I think, in that case, that I should probably try to back-port my improvements into pegjs.

However, I don't necessarily want to do that without having a conversation with @dmajda.

@tiye
Copy link
Author

tiye commented Feb 12, 2014

@otac0n It's nice. I don't write C# . JavaScript is much better for me.

@dmajda
Copy link
Contributor

dmajda commented Apr 21, 2014

Indentation-based languages are important. I want to look at simplifying their parsing after 1.0.0.

@dmajda dmajda added the Feature label Apr 21, 2014
@dmajda dmajda added this to the post-1.0 milestone Apr 21, 2014
@lydell
Copy link

lydell commented Feb 15, 2015

I think this problem is best solved by allowing state in general, just like Pegasus does and as suggested in #285. Here is an idea (the following is Pegasus’ significant whitespace grammar translated to pegjs and with my syntax idea added):

{var indentation = 0}

program
  = s:statements eof { return s }

statements
  = line+

line
  = INDENTATION s:statement { return s }

statement
  = s:simpleStatement eol { return s }
  / "if" _ n:name _? ":" eol INDENT !"bar " s:statements UNDENT {
      return { condition: n, statements: s }
    }
  / "def" _ n:name _? ":" eol INDENT s:statements UNDENT {
      return { name: n, statements: s }
    }

simpleStatement
  = a:name _? "=" _? b:name { return { lValue: a, expression: b } }

name
  = [a-zA-Z] [a-zA-Z0-9]* { return text() }

_ = [ \t]+

eol = _? comment? ("\r\n" / "\n\r" / "\r" / "\n" / eof)

comment = "//" [^\r\n]*

eof = !.

INDENTATION
  = spaces:" "* &{ return spaces.length == indentation }

INDENT
  = #STATE{indentation}{ indentation += 4 }

UNDENT
  = #STATE{indentation}{ indentation -= 4 }

Note the #STATE{indentation} blocks near the bottom (obviously inspired by Pegasus). I call those state blocks. The idea is to allow a state block before actions. Here is a more complicated state block:

#STATE{a, b, arr: {arr.slice()}, obj: {shallowCopy(obj)}, c}

It is shorthand for:

#STATE{a: {a}, b: {b}, arr: {arr.slice()}, obj: {shallowCopy(obj)}, c: {c}}

In other words, after the shorthand expansion has been applied, the contents of a state block is a list of identifier ":" "{" code "}". Adding a state block before an action tells pegjs that this action will modify the identifiers listed, and if the rule is backtracked those identifiers should be reset to the code between the braces.

Here are the compiled functions for INDENT and UNDENT from the above grammar above, with resetting of the indentation variable added:

    function peg$parseINDENT() {
      var s0, s1, t0;

      s0 = peg$currPos;
      t0 = indentation;
      s1 = [];
      if (s1 !== peg$FAILED) {
        peg$reportedPos = s0;
        s1 = peg$c41();
      } else {
        indentation = t0;
      }
      s0 = s1;

      return s0;
    }

    function peg$parseUNDENT() {
      var s0, s1, t0;

      s0 = peg$currPos;
      t0 = indentation;
      s1 = [];
      if (s1 !== peg$FAILED) {
        peg$reportedPos = s0;
        s1 = peg$c42();
      } else {
        indentation = t0;
      }
      s0 = s1;

      return s0;
    }

And here’s a bit of how the “complicated state block” from above could be compiled:

s0 = peg$currPos;
t0 = a;
t1 = b;
t2 = arr.slice();
t3 = shallowCopy(obj);
t4 = c;
// ...
if (s1 !== peg$FAILED) {
  // ...
} else {
  peg$currPos = s0;
  a = t0;
  b = t1;
  arr = t2;
  obj = t3;
  c = t4;
}

What do you think about this idea of being able to:

  • Tell pegjs about which stateful variables will be modified by an action.
  • Supply the code needed to store those variables if they need to be reset. (Including shorthand syntax for the simple case where the variable is a primitive value.)

And what do you think about the syntax?

Edit: Here’s the proposed syntax grammar (just for fun):

diff --git a/src/parser.pegjs b/src/parser.pegjs
index 08f6c4f..09e079f 100644
--- a/src/parser.pegjs
+++ b/src/parser.pegjs
@@ -116,12 +116,31 @@ ChoiceExpression
     }

 ActionExpression
-  = expression:SequenceExpression code:(__ CodeBlock)? {
+  = expression:SequenceExpression code:((__ StateBlock)? __ CodeBlock)? {
       return code !== null
-        ? { type: "action", expression: expression, code: code[1] }
+        ? {
+            type:       "action",
+            expression: expression,
+            code:       code[2],
+            stateVars:  (code[0] !== null ? code[0][1] : [])
+          }
         : expression;
     }

+StateBlock "state block"
+  = "#STATE{" __ first:StateBlockItem rest:(__ "," __ StateBlockItem)* __ "}" {
+      return buildList(first, rest, 3);
+    }
+
+StateBlockItem
+  = varName:Identifier expression:(__ ":" __ CodeBlock)? {
+      return {
+        type:       "stateVar",
+        name:       varName,
+        expression: expression !== null ? expression[3] : varName
+      };
+    }
+
 SequenceExpression
   = first:LabeledExpression rest:(__ LabeledExpression)* {
       return rest.length > 0

@dmajda dmajda changed the title Will pegjs support parsing indentation-based grammer? Support parsing of indentation-based languages Aug 14, 2015
@hoho
Copy link

hoho commented Nov 8, 2015

Hi guys,
Just to clarify, am I correct that it is better not to use PEG.js (with workarounds from the top of this issue) with indentation-based languages till this issue is closed?
Thanks.

@tiye
Copy link
Author

tiye commented Nov 9, 2015

@hoho I don't get you meaning.. But I later found another solution to parse indentations with parser combinator like solutions and it worked. And I think my original indention to parse indentations with PEG.js gone.

@hoho
Copy link

hoho commented Nov 10, 2015

I mean there are workarounds to parse indentation, but the comments say that these workarounds will fail in some certain cases.

@dmajda
Copy link
Contributor

dmajda commented Nov 27, 2015

Let me clarify the situation: Parsing indentation-based languages in PEG.js is possible. There are various solutions mentioned above and I just created another one as I tried to get a “feel” for this (it’s a grammar of a simple language with two statements, one of which can contain indented sub-statements — similar to e.g. if in Python).

One thing common to all the solutions is that they need to track the indentation state manually (because PEG.js can’t do that). This means there are two limitations:

  1. You can’t compile the grammar with caching safely (because the parser could use cached results instead of executing state-manipulating code).
  2. You can’t backtrack across indentation levels (because there is currently no way to unroll the state when backtracking). In other words, you can’t parse a language where there are two valid constructs which can be disambiguated only after a newline and indentation level change.

Limitation 1 can cause performance issues in some cases, but I don’t think there are many languages for which limitation 2 would be a problem.

I’m OK with this state until 1.0.0 and I plan to circle back to this topic sometime afterwards. The first level of improvement could be getting rid of limitation 2 using more explicit state tracking (as suggested above) or by providing a backtracking hook (so that one can unroll the state correctly). The second level could be getting rid of the need to track indentation state manually by providing some declarative way to do so. This could help with limitation 1.

@tebbi
Copy link

tebbi commented Nov 27, 2015

H, I wrote a (tiny, hacky) patch for PEG.js that supports proper backtracking, as I explained here: #45

@futagoza
Copy link
Member

sorry for the bump 😜

I was just looking into creating CSON and YAML parsers for a language I am designing, and while looking on ways to create a indention based parser with PEG.js, I came up with a simple method that:

  1. doesn't rely on push/pop state's
  2. asserting indention levels via code within actions

It had occurred to me that either of the above 2 solutions actually adds performance problems to the generated parsers. Additionally in my opinion:

  1. relying on state's not only adds a ugly PEG.js syntax but also can affect what type of parsers that can be generated as they would need to support action based state handing.
  2. sometimes adding some code in actions results in a language dependent rule, and for some developers that means they can't use plugin's to generate parsers for other languages like C or PHP without resorting to more plugin's to handle actions on rules, which just means a bigger build system just to support 1 or 2 changes.

After a while I started about creating my own variant of the PEG.js parser and thought: why not just use increment (“++”) and decrement (“--”) prefix operators (++ expression and -- expression) to handle the results of match expressions (expression * or expression +).

The following is a example grammar based on @dmajda's Simple intentation-based language, rewritten to use the new ++ expression and -- expression instead of & { predicate }:

Start
  = Statements

Statements
  = Statement*

Statement
  = Indent* statement:(S / I) { return statement; }

S
  = "S" EOS {
      return "S";
    }

I
  = "I" EOL ++Indent statements:Statements --Indent { return statements; }
  / "I" EOS { return []; }

Indent "indent"
  = "\t"
 / !__ "  "

__ "white space"
 = " \t"
 / " "

EOS
  = EOL
  / EOF

EOL
  = "\n"

EOF
  = !.

Much more pleasing to the eye, no? Easer to understand too, both for humans and software.

How does it work? simple:

  1. Indent* tells the parser that we want 0 or more of what Indent is returning
  2. ++Indent tells the parser to increase the minimum amount of matches required for Indent
  3. Now any time the parser is about to return the matches for Indent, it first expects it to be 1 more match then before, otherwise peg$SyntaxError gets thrown.
  4. --Indent tells the parser to decrease the minimum amount of matches required for Indent
  5. Now any time the parser looks for Indent and returns the matches it expects 1 less match then before, otherwise peg$SyntaxError gets thrown.

This solution is the best way to add support for 'Significant Whitespace Parsing' without adding an ugly syntax to PEG.js grammars or blocking 3rd party generators.

Here's the changed rules to add support for parsing this in src/parser.pegjs:

{
  const OPS_TO_PREFIXED_TYPES = {
    "$": "text",
    "&": "simple_and",
    "!": "simple_not",
    "++": "increment_match",
    "--": "decrement_match"
  };
}

PrefixedOperator
  = "$"
  / "&"
  / "!"
  / "++"
  / "--"

SuffixedOperator
  = "?"
  / "*"
  / "+" !"+"

Am I right to assume that to support it compiler/generator side we will have to:

  1. add a compiler pass that ensures ++ expression or -- expression are only being used on expression * or expression +, where expression must be of types: choice, sequence or rule_ref
  2. add a cache based check in the generated parser for expression * or expression + that asserts the minimum required match is met before returning the matches
  3. optionally add a helper method for generated parsers to implement that returns number of matches required for a given rule, eg. nMatches( name: String ): Number

@krisnye
Copy link

krisnye commented Mar 16, 2017

@futagoza, this is clean and clever. I like it. I am working on a parser that handles state, but the only state we really need is indentation levels. I may use this idea and give you credit for it. Tracking the indentation level still effectively requires pushing/popping state and so it may still prevent some optimizations but the semantics of this are very nice.

If you're adding operators to a grammar, I recommend adding the @ prefix operator as well. It's purpose is to simply extract a single rule result out of a sequence. Using that the sample grammar becomes even cleaner. No more trivial { return x } Actions.

Start
  = Statements

Statements
  = Statement*

Statement
  = Indent* @(S / I)

S
  = "S" EOS {
      return "S";
    }

I
  = "I" EOL ++Indent @Statements --Indent
  / "I" EOS { return []; }

Indent "indent"
  = "\t"
 / !__ "  "

__ "white space"
 = " \t"
 / " "

EOS
  = EOL
  / EOF

EOL
  = "\n"

EOF
  = !.

@KodyJKing what do you think of this?

@kristianmandrup
Copy link

@futagoza Do you have a fork/branch with the indentation patch enabled and a small sample grammar?

@kristianmandrup
Copy link

kristianmandrup commented Mar 29, 2017

I'm working on this fork/branch indentation

@krinye "I recommend adding the @ prefix operator as well. It's purpose is to simply extract a single rule result out of a sequence"

Could one of you guy have a look and make and make a comment or PR with the fix. Thanks :)

Readme: fork changes

@kristianmandrup
Copy link

I debugged... increment_match-not-found

@kristianmandrup
Copy link

Ah, didn't notice the caveat:

Am I right to assume that to support it compiler/generator side we will have to:

  • add a compiler pass that ensures ++ expression or -- expression are only being used on expression * or expression +, where expression must be of types: choice, sequence or rule_ref
  • add a cache based check in the generated parser for expression * or expression + that asserts the minimum required match is met before returning the matches
  • optionally add a helper method for generated parsers to implement that returns number of matches required for a given rule, eg. nMatches( name: String ): Number

@kristianmandrup
Copy link

Just for kicks, I tried adding this in visitor.js

      increment_match: visitExpression,
      decrement_match: visitExpression,

Now I get Invalid opcode: undefined.

@krisnye
Copy link

krisnye commented Mar 29, 2017

@kristianmandrup concerning the @ operator to extract single values from sequences, I have a fork with just that feature added to PegJS available here:

https://github.com/krisnye/pegjs

It's a pretty simple addition.

futagoza referenced this issue in kristianmandrup/pegjs Mar 30, 2017
@futagoza
Copy link
Member

@krisnye +1 for the grammar based implementation, nice and simple. If you don't mind, I will be adding this to my variant of PEG.js 😄

@kristianmandrup see you commit for my suggestion

@krisnye
Copy link

krisnye commented Mar 30, 2017

@futagoza Please do.

I was discussing the indentation logic with a coworker and we suggest the following syntactic elements

// increments the named state variable
identifier++
// decrements the named state variable
identifier--

// repeat constant amount or state variable (with default zero if variable not yet incremented)
rule{integer | identifier}
// repeat min/max using constants or state variables
rule{integer | identifier, integer | identifier}

The parser we are working on can handle arbitrary state, but honestly, the above is all that is needed for indentation parsing.

@kristianmandrup
Copy link

Thanks a lot guys! If its so easy to do, why not make a "dedicated" fork where the things you mention "just works" ;)
Cheers!

@futagoza
Copy link
Member

@krisnye

  1. Using identifier++ can easily cause a messy error if the developer meant identifier+, that's why I chose to go with ++identifier and, for consistency, --identifier
  2. As mentioned in a different issue about ranges, rule{ STATE_REPEAT / RANGE } can be confused with rule{ ACTION }, especially if you are building a syntax highlighter for PEG.js, so this approach has been denied by @dmajda

@kristianmandrup (OFF TOPIC) I am good at designing features, and sometimes making implementations, but horrible at testing and benchmarking them, so usually I make working variants (without any tests or benchmarks) in private non-repo directories on my pc, and then forget about them 🤣. For my current variant of PEG.js (named ePEG.js 😝, an extended rewrite of PEG.js) I am adding the stuff mentioned here, as well as other features (ranges, imports, templates, etc), so I am adding tests and benchmarks, but I am also currently working on a C++ project which is taking my time, so there's no ETA on that.

@kristianmandrup
Copy link

kristianmandrup commented Mar 30, 2017

@futagoza Thanks mate :) Looking at the feature extensions, but doesn't mention indentation support. Is that included but undocumented or coming down the road?

Someone else from this list pointed me at other parser builder/generator solutions that I might look into as well. Keep me posted! Cheers!

@futagoza
Copy link
Member

@kristianmandrup It's not included as far as I can tell, but @dmajda said 3 years ago that he will look into it after releasing PEG.js v1, but from what I can tell that wont be for another 2 years, unless he plans to release more minor versions of PEG.js v0 (0.12, 0.13, etc)

@kristianmandrup
Copy link

I meant if you had included indentation support in ePEG already or on the roadmap?

@futagoza
Copy link
Member

futagoza commented Mar 30, 2017

@kristianmandrup oh 😆, its on the roadmap. I haven't updated the ePEG.js repo for a while, and just recently decided to turn it into a complete rewrite of PEG.js instead of a plugin.

@krisnye
Copy link

krisnye commented Mar 30, 2017

@futagoza agreed on the ++/-- as pre operations, and I forgot about action syntax, we changed it to [min,max]

So

++identifier
--identifier
rule[integer | identifier]
rule[integer | identifier, integer | identifier]

@futagoza
Copy link
Member

@krisnye [ ... ] is used for character classes, see https://github.com/pegjs/pegjs#characters

What I'm doing in ePEG.js is adding ranges (also on the roadmap) to achieve what I think your describing:

space = [ \t]*
rule = range|expression
  • expression can be any of these: ++space, --space or space
  • range can be min.., min..max, ..max or exact
  • min, max or exact can only be an unsigned integer
  • using a range with an expression (e.g. 2|expression) can allow us to set the total amount of expression required for rule to parse successfully.
  • using the exact range with ++expression or --expression (e.g. 3|++expression) can allow us to set the integer amount for ++ or --, which is by default 1
  • using either min or max range with ++expression or --expression throws a syntax error.

I'm not using state variables as that would just be confusing with rule identifiers.

Using a combination of range, ++, -- or @ I'm hoping to create PEG grammar files that rely less on a rules action for their result, which should increase the development time of whitespace (e.g. indentation-based, ASCII art, etc) languages as the language designers and/or implementers don't have to worry about trying to confirm if the correct amount of whitespace has been parsed.
This should also allow plugin developers to create parser generators that can optimise with less fear of what our action's language is (by default JavaScript, but with plugin's can change it to CoffeeScript, PHP, etc).

@mindjuice
Copy link

So it seems like it's still not possible today to parse Python with PEG.js out of the box, is that right?

If not, is this something that will be coming soon? Is there a set of tasks required to make this work that people could contribute?

I have a project where I'd love to be able to get a Python AST in JS, modify it and then convert it back to source code with the same formatting, so I'd be interested in making this happen if there is a clear roadmap.

@krisnye
Copy link

krisnye commented Apr 14, 2017

@mindjuice Not yet, it's planned for post 1.0.

If you cannot wait, my nephew and I created a parser written in TypeScript that uses the same syntax and handles indentation. It's not documented yet, but it's pretty simple. There is still work ongoing as we're using it as the parser for a new language design.

https://github.com/krisnye/pegs

@kristianmandrup
Copy link

kristianmandrup commented Apr 14, 2017

You can also try another parser generator with support such as chevrotain

Python indentation example

@appcypher
Copy link

appcypher commented Oct 20, 2017

I believe it's very possible to parse python-like indentation with PEGjs.
The example below uses only four-space-based indentation, but it can be extended to cover arbitrarily-spaced tabs and other whitespace characters.
In fact, the language I'm working on has a slightly more complicated indentation story than Python has and this grammar works well for it.

{
    let prevIndentCount = 0;
    function print(...s) { console.log(...s); }
}

Indent 'indent'
    = i:("    "+) { 
        let currentIndentCount = i.toString().replace(/,/g, "").length;
        if (currentIndentCount === prevIndentCount + 4) { 
            // DEBUG //
            print("=== Indent ===");
            print("    current:"+currentIndentCount); 
            print("    previous:"+prevIndentCount);
            print("    lineNumber:"+location().start.line); 
            // DEBUG //
            prevIndentCount += 4;
            return "[indent]";
        }
        error("error: expected a 4-space indentation here!")
    } // 4 spaces 

Samedent 'samedent'
    = s:("    "+ / "") &{
        let currentIndentCount = s.toString().replace(/,/g, "").length;
        if (currentIndentCount === prevIndentCount) {
            print("=== Samedent ===");
            return true;
        }
        return false;
    }

Dedent 'dedent'
    = d:("    "+ / "") {
        let currentIndentCount = d.toString().replace(/,/g, "").length;
        if (currentIndentCount < prevIndentCount) {
            // DEBUG //
            print("=== Dedent ===");
            print("    current:"+currentIndentCount); 
            print("    previous:"+prevIndentCount);
            print("    lineNumber:"+location().start.line); 
            // DEBUG //
            prevIndentCount -= 4;
            return "[dedent]";
        }
        error("error: expected a 4-space dedentation here!");
    }

With the grammar above, you can create an indented block rule like this:

FunctionDeclaration 
    = 'function' _ Identifier _ FunctionParameterSection _ ":" _ FunctionBody

FunctionBody
    = Newline Indent FunctionSourceCode (Newline Samedent FunctionSourceCode)* Dedent 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests