Make &... and !... not return any value #66

dmajda · 2012-02-03T07:42:33Z

In sequences, returning a value from predicates only confuses users. For example, start = "a" &. "b" returns ["a", "", "b"], on input "ab", but users expect ["a", "b"].

This means predicates won’t be allowed outside of sequences or in one-element sequences because such expressions would have undefined value.

The text was updated successfully, but these errors were encountered:

WardCunningham · 2013-07-27T16:36:55Z

Perhaps it could return undefined. I would revise my whitespace rule to return undefined too. A parse that consisted of only whitespace would then parse as undefined.

Update: $ prefix (described with this commit 5e146fc) might lessen my desire to have whitespace productions ignored.

dmajda · 2014-04-20T09:48:49Z

In PEG.js 0.8.0 I made all predicates return undefined. You can read detailed reasons for the change in the relevant commit.

Meanwhile, I was thinking about this issue a lot, and I became somewhat stuck. I'll try to describe my thinking and maybe someone will be able to push me into the right direction.

Clear Cases

Making the predicates not return any value has clear semantics in case of sequences with 2 or more non-predicate elements. For example start = "a" !"x" "b" on input ab would return ["a", "b"] instead of current ["a", undefined, "b"].

The semantics is also clear in case of sequences with only one non-predicate element. There the match result wouldn't be an array, but just a match result of the non-predicate element. For example, start = !"x" "b" on input b would return "b" instead of current [undefined, "b"].

Tricky Case

The tricky case is when a predicate is not part of a sequence, but stands in an expression alone. For example, what should start = !"x" return on an empty input?

There are several possible approaches:

One-element Sequence

One approach is to look at such expression as if it was a one-element sequence. With the predicate match result removed from the resulting array (similarly to the clear cases), the match result would be an empty sequence.

This approach is possible, but it creates an inconsistency. This is best illustrated by an example:

start = !x "a" "b" "c"   // returns ["a", "b", "c"] on successful match
start = !x "a" "b"       // returns ["a", "b"] on successful match
start = !x "a"           // returns "a" on successful match
start = !x               // returns [] on successful match

To be consistent, the third example should return a one-element array — something which is not desired.

Not Returning Any Value

Another approach is not returning any value at all. If a predicate is used outside a sequence, its expression would be marked with a “no-value” flag, which would “spread outward” through the expression and rule tree, until absorbed by a sequence (or until it bubbles to the top, which would be an error).

For example, the following grammars would return ["a", "b"] on input ab:

start = "a" !"x" "b"

start = "a" notX "b"
notX  = !"x"

On the other hand, the following grammars would cause an error:

start = !"x"

start = notX
notX  = !"x"

Although there are some unclear boundary cases (e.g. how should the / operator behave with respect to the “no-value” flag), the approach seems viable. However, I think it brings a lot of complexity into PEG.js core and the behavior would be non-transparent to users.

Is This Really Relevant?

One may ask, is the case of using predicates outside a sequence really relevant? What about just forbidding that case?

It turns out there is at least one important pattern that uses predicates outside a sequence — the EOF rule:

EOF = !.

This pattern is used in many grammars and disallowing it would complicate lives of users and force duplications in their grammars.

Note that encapsulating this pattern into some built-in pseudo-rule wouldn't help, because all the questions related to predicates would simply re-appear for this rule.

What Now?

At this point, the solution with not returning any value seems the most appealing to me, but it scares me a little, so I'm not 100% sure. I'll need to think about his more. I welcome any comments, be it new ideas or opinions about solutions described above.

WardCunningham · 2014-04-20T16:48:29Z

I don't have a specific suggestion so I will ramble on instead.

I assume the parser knows more than easily encoded into results so any decision will involve some compromise. I would think the day to day user values coding convenience most. But irregular behavior leads to the ultimate inconvenience of special cases and out-guessing the parser.

That rules automatically return anything is a risky convenience. This is one of pegjs' charms. I often tell colleagues that they can understand parsers and turn to http://pegjs.majda.cz/online to show them how simple it can be. That [a-z]+ returns an array of characters becomes a point of enlightenment.

I call it automatic returns a risky convenience because, first, it's hard to please everyone, and, second, any work done that need not be done raises performance issues. I noticed in the prefix $ discussion I cited above that you mention possibly avoiding allocating single character strings. All of those allocations cost, eventually, and when they are built-in there is no recourse.

Ian Piumarta's pegleg is very careful about storage allocation. He adds < ... > syntax to peg which works much like your optimized $( ... ) if I understand it correctly. By the time strings are constructed details of predicates are history.

Aside: I hacked Ian's parser generator to handle << ... >> which captured text and fed it straight into my downstream analysis for exploratory parsing of the whole wikipedia. blog post

I hope this is helpful. I'm happy to elaborate on pegleg if that would be useful. However, you're already set a course so I would not suggest you deviate too far. You have a sweet package. Nice work.

andreineculau · 2014-04-21T20:08:35Z

but users expect ["a", "b"].

I got stuck when reading the above. Maybe I'm not a common user, but I wouldn't expect that at all. I value the uniformity actually: every expression returns something, may it be undefined.

Regardless, what is the quintessence of this request? What I hear is something along the lines of making the parser return a close-to-the-intended AST, and leave aside constructs that appeared because of grammar definition or PEGjs internals.

Currently the $ prefix does a neat job in that direction (toString(), flattening, remove undefined), but there isn't a e.g. € prefix to just flatten and remove undefined, without toString().

Somehow I think that's what's missing, and it would allow the grammar definition to have influence on the AST form. At least that's what I'd prefer rather than introducing rules that break uniformity.

startA1 = !x "a" "b" "c" // [undefined, "a", "b", "c"]
startA2 = €(!x "a" "b" "c") // ["a", "b", "c"]

startB1 = !x "a" ("b" "c") // [undefined, "a", ["b", "c"]]
startB2 = €(!x "a" ("b" "c")) // ["a", "b", "c"]

Phrogz · 2014-10-10T05:04:39Z

I recently used LPeg in Lua to do some parsing, and found its "capture" mechanism exceptionally elegant. In PEG.js nesting productions results in nested output. Consider this PEG.js grammar:

pairs = a:pair b:(wsp pair)* { console.log(a); console.log(b) }
pair  = x:num wsp y:num { return [x,y] }
num   = n:[0-9]+ { return n*1 }
wsp   = [ ,]+  { return ''  } // I don't really want this value at all!

When you parse 1,2 3,4 5,6 with the above you get:

[ 1, 2 ] // a
[ [ '', [ 3, 4 ] ], [ '', [ 5, 6 ] ] ] // b (OUCH)

Look at that horrific undesirable nesting. In LPeg, captures bubble up through the productions. If you have a complicated grammar and capture only numbers in a leaf production, then your final result from parsing is just an array of the matched numbers. There is no nesting introduced by a production. If you add a production you don't have to worry about removing an extra level of nesting. There is no need for a flatten() function.

I encourage you to check out LPeg. The syntax is insane-looking, largely because it's a righteous hack that allows PEG directly within the language itself, and thus needs to use standard operators (and their existing precedence) to combine productions. But separate from the syntax, the way it handles captures is really, really elegant IMHO.

For comparison, here's the above grammar in LPeg (using the bottom-up, variable-based approach; there's another way to write the above grammar top-down that supports recursion and loops):

function jointwo(x,y) return {x,y} end
L = require('lpeg')
set, range = L.S, L.R             -- more understandable function names
wsp   = set(' ,')^1               -- ^1 is "one or more of", i.e. + in PEG
num   = range('09')^1/tonumber    -- call tonumber(), use the result as the capture
pair  = (num * wsp * num)/jointwo -- * is "followed by", i.e. space in PEG
pairs = pair * (wsp * pair)^0     -- ^0 is "zero or more of", i.e. * in PEG

dump{ pairs:match('1,2 3,4 5,6 7,8') }
--> { {1,2}, {3,4}, {5,6}, {7,8} }

atavakoli · 2015-06-10T17:55:09Z

I think there's value in returning something; for consistency, but also because there will be some use-cases (maybe uncommon ones) where you actually want access to that value.

For example, suppose I'm going through a list of words & want to classify them based on how they start:

ClassifiedWords
  = first:ClassifiedWord rest:(_ ClassifiedWord)* {
    var result = [first];
    result = result.concat(rest.map(function(e) { return e[1]; }));
    return result;
  }

ClassifiedWord
  = reason:&StartingCharsOfInterest w:Word {
    return {
      interesting: true,
      reason: reason,
      word: w
    };
  }
  / w:Word {
    return {
      interesting: false,
      word: w
    };
  }

StartingCharsOfInterest
  = $("sch"i / "qu"i / "ae"i) {
    return "foo";
  }
  / $("kn"i / "gn"i) {
    return "bar";
  }

Word = $([a-z]i+)

_ = [ \t\r\n]+

If I had input like "school blah knight", I'd get the following output:

[
   {
      "interesting": true,
      "reason": undefined, <-- expecting "foo"
      "word": "school"
   },
   {
      "interesting": false,
      "word": "blah"
   },
   {
      "interesting": true,
      "reason": undefined, <-- expecting "bar"
      "word": "knight"
   }
]

Not having access would mean moving the classification logic (i.e. the reason return values) into the ClassifiedWord rule, which is possible in this example, but would hurt cohesion and might not even be possible with a more complex grammar.

The example is a bit contrived, but the point is that sometimes knowing why the positive lookahead matched (i.e. having access to its production) can be useful.

dmajda · 2016-07-04T08:21:24Z

Moving post-1.0.0. I’m becoming convinced this must be solved together with #11.

StoneCypher · 2020-02-03T03:27:30Z

I think this should actually be handled with the es6 empty token (what most people call "array holes.") There's absolutely no sensible place where they should be produced from a parser, and they convey exactly what needs to be conveyed here without any serious changes and without the loss of any scalar types

I think #11 is also easy to solve, but I think they can be solved separately, and that this will happen more safely, more easily testably, and more durably if they are

If we can get a codebase that's ready for PRs, I'll take a swing at it

StoneCypher · 2020-02-03T03:28:23Z

Array holes can, also, be detected if you really want to, but don't turn up unless you really fight for it, meaning the principle of least surprise is satisfied without making weird cases technically difficult

ghost assigned dmajda Feb 3, 2012

dmajda mentioned this issue Feb 4, 2012

Redesign labels #69

Closed

curvedmark mentioned this issue Jan 10, 2013

Non-greedy operators for * , + , and ? #57

Closed

dmajda removed their assignment Apr 20, 2014

Phrogz mentioned this issue Oct 11, 2014

Ability to ignore certain productions #11

Open

dmajda modified the milestones: 0.10.0, 0.9.0 Aug 14, 2015

dmajda modified the milestones: post-1.0.0, 0.10.0 Jul 4, 2016

Mingun mentioned this issue Sep 12, 2017

Clarify execution environment of actions. #531

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make &... and !... not return any value #66

Make &... and !... not return any value #66

dmajda commented Feb 3, 2012

WardCunningham commented Jul 27, 2013

dmajda commented Apr 20, 2014

WardCunningham commented Apr 20, 2014

andreineculau commented Apr 21, 2014

Phrogz commented Oct 10, 2014

atavakoli commented Jun 10, 2015

dmajda commented Jul 4, 2016

StoneCypher commented Feb 3, 2020

StoneCypher commented Feb 3, 2020

Make &... and !... not return any value #66

Make &... and !... not return any value #66

Comments

dmajda commented Feb 3, 2012

WardCunningham commented Jul 27, 2013

dmajda commented Apr 20, 2014

Clear Cases

Tricky Case

One-element Sequence

Not Returning Any Value

Is This Really Relevant?

What Now?

WardCunningham commented Apr 20, 2014

andreineculau commented Apr 21, 2014

Phrogz commented Oct 10, 2014

atavakoli commented Jun 10, 2015

dmajda commented Jul 4, 2016

StoneCypher commented Feb 3, 2020

StoneCypher commented Feb 3, 2020