Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make &... and !... not return any value #66

Open
dmajda opened this issue Feb 3, 2012 · 9 comments
Open

Make &... and !... not return any value #66

dmajda opened this issue Feb 3, 2012 · 9 comments
Labels
Milestone

Comments

@dmajda
Copy link
Contributor

dmajda commented Feb 3, 2012

In sequences, returning a value from predicates only confuses users. For example, start = "a" &. "b" returns ["a", "", "b"], on input "ab", but users expect ["a", "b"].

This means predicates won’t be allowed outside of sequences or in one-element sequences because such expressions would have undefined value.

@WardCunningham
Copy link

Perhaps it could return undefined. I would revise my whitespace rule to return undefined too. A parse that consisted of only whitespace would then parse as undefined.

Update: $ prefix (described with this commit 5e146fc) might lessen my desire to have whitespace productions ignored.

@dmajda
Copy link
Contributor Author

dmajda commented Apr 20, 2014

In PEG.js 0.8.0 I made all predicates return undefined. You can read detailed reasons for the change in the relevant commit.

Meanwhile, I was thinking about this issue a lot, and I became somewhat stuck. I'll try to describe my thinking and maybe someone will be able to push me into the right direction.

Clear Cases

Making the predicates not return any value has clear semantics in case of sequences with 2 or more non-predicate elements. For example start = "a" !"x" "b" on input ab would return ["a", "b"] instead of current ["a", undefined, "b"].

The semantics is also clear in case of sequences with only one non-predicate element. There the match result wouldn't be an array, but just a match result of the non-predicate element. For example, start = !"x" "b" on input b would return "b" instead of current [undefined, "b"].

Tricky Case

The tricky case is when a predicate is not part of a sequence, but stands in an expression alone. For example, what should start = !"x" return on an empty input?

There are several possible approaches:

One-element Sequence

One approach is to look at such expression as if it was a one-element sequence. With the predicate match result removed from the resulting array (similarly to the clear cases), the match result would be an empty sequence.

This approach is possible, but it creates an inconsistency. This is best illustrated by an example:

start = !x "a" "b" "c"   // returns ["a", "b", "c"] on successful match
start = !x "a" "b"       // returns ["a", "b"] on successful match
start = !x "a"           // returns "a" on successful match
start = !x               // returns [] on successful match

To be consistent, the third example should return a one-element array — something which is not desired.

Not Returning Any Value

Another approach is not returning any value at all. If a predicate is used outside a sequence, its expression would be marked with a “no-value” flag, which would “spread outward” through the expression and rule tree, until absorbed by a sequence (or until it bubbles to the top, which would be an error).

For example, the following grammars would return ["a", "b"] on input ab:

start = "a" !"x" "b"
start = "a" notX "b"
notX  = !"x"

On the other hand, the following grammars would cause an error:

start = !"x"
start = notX
notX  = !"x"

Although there are some unclear boundary cases (e.g. how should the / operator behave with respect to the “no-value” flag), the approach seems viable. However, I think it brings a lot of complexity into PEG.js core and the behavior would be non-transparent to users.

Is This Really Relevant?

One may ask, is the case of using predicates outside a sequence really relevant? What about just forbidding that case?

It turns out there is at least one important pattern that uses predicates outside a sequence — the EOF rule:

EOF = !.

This pattern is used in many grammars and disallowing it would complicate lives of users and force duplications in their grammars.

Note that encapsulating this pattern into some built-in pseudo-rule wouldn't help, because all the questions related to predicates would simply re-appear for this rule.

What Now?

At this point, the solution with not returning any value seems the most appealing to me, but it scares me a little, so I'm not 100% sure. I'll need to think about his more. I welcome any comments, be it new ideas or opinions about solutions described above.

@dmajda dmajda removed their assignment Apr 20, 2014
@WardCunningham
Copy link

I don't have a specific suggestion so I will ramble on instead.

I assume the parser knows more than easily encoded into results so any decision will involve some compromise. I would think the day to day user values coding convenience most. But irregular behavior leads to the ultimate inconvenience of special cases and out-guessing the parser.

That rules automatically return anything is a risky convenience. This is one of pegjs' charms. I often tell colleagues that they can understand parsers and turn to http://pegjs.majda.cz/online to show them how simple it can be. That [a-z]+ returns an array of characters becomes a point of enlightenment.

I call it automatic returns a risky convenience because, first, it's hard to please everyone, and, second, any work done that need not be done raises performance issues. I noticed in the prefix $ discussion I cited above that you mention possibly avoiding allocating single character strings. All of those allocations cost, eventually, and when they are built-in there is no recourse.

Ian Piumarta's pegleg is very careful about storage allocation. He adds < ... > syntax to peg which works much like your optimized $( ... ) if I understand it correctly. By the time strings are constructed details of predicates are history.

Aside: I hacked Ian's parser generator to handle << ... >> which captured text and fed it straight into my downstream analysis for exploratory parsing of the whole wikipedia. blog post

I hope this is helpful. I'm happy to elaborate on pegleg if that would be useful. However, you're already set a course so I would not suggest you deviate too far. You have a sweet package. Nice work.

@andreineculau
Copy link
Contributor

but users expect ["a", "b"].

I got stuck when reading the above. Maybe I'm not a common user, but I wouldn't expect that at all. I value the uniformity actually: every expression returns something, may it be undefined.

Regardless, what is the quintessence of this request? What I hear is something along the lines of making the parser return a close-to-the-intended AST, and leave aside constructs that appeared because of grammar definition or PEGjs internals.

Currently the $ prefix does a neat job in that direction (toString(), flattening, remove undefined), but there isn't a e.g. € prefix to just flatten and remove undefined, without toString().

Somehow I think that's what's missing, and it would allow the grammar definition to have influence on the AST form. At least that's what I'd prefer rather than introducing rules that break uniformity.

startA1 = !x "a" "b" "c" // [undefined, "a", "b", "c"]
startA2 = €(!x "a" "b" "c") // ["a", "b", "c"]

startB1 = !x "a" ("b" "c") // [undefined, "a", ["b", "c"]]
startB2 = €(!x "a" ("b" "c")) // ["a", "b", "c"]

@Phrogz
Copy link

Phrogz commented Oct 10, 2014

I recently used LPeg in Lua to do some parsing, and found its "capture" mechanism exceptionally elegant. In PEG.js nesting productions results in nested output. Consider this PEG.js grammar:

pairs = a:pair b:(wsp pair)* { console.log(a); console.log(b) }
pair  = x:num wsp y:num { return [x,y] }
num   = n:[0-9]+ { return n*1 }
wsp   = [ ,]+  { return ''  } // I don't really want this value at all!

When you parse 1,2 3,4 5,6 with the above you get:

[ 1, 2 ] // a
[ [ '', [ 3, 4 ] ], [ '', [ 5, 6 ] ] ] // b (OUCH)

Look at that horrific undesirable nesting. In LPeg, captures bubble up through the productions. If you have a complicated grammar and capture only numbers in a leaf production, then your final result from parsing is just an array of the matched numbers. There is no nesting introduced by a production. If you add a production you don't have to worry about removing an extra level of nesting. There is no need for a flatten() function.

I encourage you to check out LPeg. The syntax is insane-looking, largely because it's a righteous hack that allows PEG directly within the language itself, and thus needs to use standard operators (and their existing precedence) to combine productions. But separate from the syntax, the way it handles captures is really, really elegant IMHO.

For comparison, here's the above grammar in LPeg (using the bottom-up, variable-based approach; there's another way to write the above grammar top-down that supports recursion and loops):

function jointwo(x,y) return {x,y} end
L = require('lpeg')
set, range = L.S, L.R             -- more understandable function names
wsp   = set(' ,')^1               -- ^1 is "one or more of", i.e. + in PEG
num   = range('09')^1/tonumber    -- call tonumber(), use the result as the capture
pair  = (num * wsp * num)/jointwo -- * is "followed by", i.e. space in PEG
pairs = pair * (wsp * pair)^0     -- ^0 is "zero or more of", i.e. * in PEG

dump{ pairs:match('1,2 3,4 5,6 7,8') }
--> { {1,2}, {3,4}, {5,6}, {7,8} }

@atavakoli
Copy link
Contributor

I think there's value in returning something; for consistency, but also because there will be some use-cases (maybe uncommon ones) where you actually want access to that value.

For example, suppose I'm going through a list of words & want to classify them based on how they start:

ClassifiedWords
  = first:ClassifiedWord rest:(_ ClassifiedWord)* {
    var result = [first];
    result = result.concat(rest.map(function(e) { return e[1]; }));
    return result;
  }

ClassifiedWord
  = reason:&StartingCharsOfInterest w:Word {
    return {
      interesting: true,
      reason: reason,
      word: w
    };
  }
  / w:Word {
    return {
      interesting: false,
      word: w
    };
  }

StartingCharsOfInterest
  = $("sch"i / "qu"i / "ae"i) {
    return "foo";
  }
  / $("kn"i / "gn"i) {
    return "bar";
  }

Word = $([a-z]i+)

_ = [ \t\r\n]+

If I had input like "school blah knight", I'd get the following output:

[
   {
      "interesting": true,
      "reason": undefined, <-- expecting "foo"
      "word": "school"
   },
   {
      "interesting": false,
      "word": "blah"
   },
   {
      "interesting": true,
      "reason": undefined, <-- expecting "bar"
      "word": "knight"
   }
]

Not having access would mean moving the classification logic (i.e. the reason return values) into the ClassifiedWord rule, which is possible in this example, but would hurt cohesion and might not even be possible with a more complex grammar.

The example is a bit contrived, but the point is that sometimes knowing why the positive lookahead matched (i.e. having access to its production) can be useful.

@dmajda dmajda modified the milestones: 0.10.0, 0.9.0 Aug 14, 2015
@dmajda
Copy link
Contributor Author

dmajda commented Jul 4, 2016

Moving post-1.0.0. I’m becoming convinced this must be solved together with #11.

@StoneCypher
Copy link

I think this should actually be handled with the es6 empty token (what most people call "array holes.") There's absolutely no sensible place where they should be produced from a parser, and they convey exactly what needs to be conveyed here without any serious changes and without the loss of any scalar types

I think #11 is also easy to solve, but I think they can be solved separately, and that this will happen more safely, more easily testably, and more durably if they are

If we can get a codebase that's ready for PRs, I'll take a swing at it

@StoneCypher
Copy link

Array holes can, also, be detected if you really want to, but don't turn up unless you really fight for it, meaning the principle of least surprise is satisfied without making weird cases technically difficult

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants