New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make &... and !... not return any value #66
Comments
Perhaps it could return undefined. I would revise my whitespace rule to return undefined too. A parse that consisted of only whitespace would then parse as undefined. Update: $ prefix (described with this commit 5e146fc) might lessen my desire to have whitespace productions ignored. |
In PEG.js 0.8.0 I made all predicates return Meanwhile, I was thinking about this issue a lot, and I became somewhat stuck. I'll try to describe my thinking and maybe someone will be able to push me into the right direction. Clear CasesMaking the predicates not return any value has clear semantics in case of sequences with 2 or more non-predicate elements. For example The semantics is also clear in case of sequences with only one non-predicate element. There the match result wouldn't be an array, but just a match result of the non-predicate element. For example, Tricky CaseThe tricky case is when a predicate is not part of a sequence, but stands in an expression alone. For example, what should There are several possible approaches: One-element SequenceOne approach is to look at such expression as if it was a one-element sequence. With the predicate match result removed from the resulting array (similarly to the clear cases), the match result would be an empty sequence. This approach is possible, but it creates an inconsistency. This is best illustrated by an example:
To be consistent, the third example should return a one-element array — something which is not desired. Not Returning Any ValueAnother approach is not returning any value at all. If a predicate is used outside a sequence, its expression would be marked with a “no-value” flag, which would “spread outward” through the expression and rule tree, until absorbed by a sequence (or until it bubbles to the top, which would be an error). For example, the following grammars would return
On the other hand, the following grammars would cause an error:
Although there are some unclear boundary cases (e.g. how should the Is This Really Relevant?One may ask, is the case of using predicates outside a sequence really relevant? What about just forbidding that case? It turns out there is at least one important pattern that uses predicates outside a sequence — the
This pattern is used in many grammars and disallowing it would complicate lives of users and force duplications in their grammars. Note that encapsulating this pattern into some built-in pseudo-rule wouldn't help, because all the questions related to predicates would simply re-appear for this rule. What Now?At this point, the solution with not returning any value seems the most appealing to me, but it scares me a little, so I'm not 100% sure. I'll need to think about his more. I welcome any comments, be it new ideas or opinions about solutions described above. |
I don't have a specific suggestion so I will ramble on instead. I assume the parser knows more than easily encoded into results so any decision will involve some compromise. I would think the day to day user values coding convenience most. But irregular behavior leads to the ultimate inconvenience of special cases and out-guessing the parser. That rules automatically return anything is a risky convenience. This is one of pegjs' charms. I often tell colleagues that they can understand parsers and turn to http://pegjs.majda.cz/online to show them how simple it can be. That [a-z]+ returns an array of characters becomes a point of enlightenment. I call it automatic returns a risky convenience because, first, it's hard to please everyone, and, second, any work done that need not be done raises performance issues. I noticed in the prefix $ discussion I cited above that you mention possibly avoiding allocating single character strings. All of those allocations cost, eventually, and when they are built-in there is no recourse. Ian Piumarta's pegleg is very careful about storage allocation. He adds < ... > syntax to peg which works much like your optimized $( ... ) if I understand it correctly. By the time strings are constructed details of predicates are history. Aside: I hacked Ian's parser generator to handle << ... >> which captured text and fed it straight into my downstream analysis for exploratory parsing of the whole wikipedia. blog post I hope this is helpful. I'm happy to elaborate on pegleg if that would be useful. However, you're already set a course so I would not suggest you deviate too far. You have a sweet package. Nice work. |
I got stuck when reading the above. Maybe I'm not a common user, but I wouldn't expect that at all. I value the uniformity actually: every expression returns something, may it be Regardless, what is the quintessence of this request? What I hear is something along the lines of making the parser return a close-to-the-intended AST, and leave aside constructs that appeared because of grammar definition or PEGjs internals. Currently the $ prefix does a neat job in that direction (toString(), flattening, remove undefined), but there isn't a e.g. € prefix to just flatten and remove undefined, without toString(). Somehow I think that's what's missing, and it would allow the grammar definition to have influence on the AST form. At least that's what I'd prefer rather than introducing rules that break uniformity.
|
I recently used LPeg in Lua to do some parsing, and found its "capture" mechanism exceptionally elegant. In PEG.js nesting productions results in nested output. Consider this PEG.js grammar: pairs = a:pair b:(wsp pair)* { console.log(a); console.log(b) }
pair = x:num wsp y:num { return [x,y] }
num = n:[0-9]+ { return n*1 }
wsp = [ ,]+ { return '' } // I don't really want this value at all! When you parse [ 1, 2 ] // a
[ [ '', [ 3, 4 ] ], [ '', [ 5, 6 ] ] ] // b (OUCH) Look at that horrific undesirable nesting. In LPeg, captures bubble up through the productions. If you have a complicated grammar and capture only numbers in a leaf production, then your final result from parsing is just an array of the matched numbers. There is no nesting introduced by a production. If you add a production you don't have to worry about removing an extra level of nesting. There is no need for a I encourage you to check out LPeg. The syntax is insane-looking, largely because it's a righteous hack that allows PEG directly within the language itself, and thus needs to use standard operators (and their existing precedence) to combine productions. But separate from the syntax, the way it handles captures is really, really elegant IMHO. For comparison, here's the above grammar in LPeg (using the bottom-up, variable-based approach; there's another way to write the above grammar top-down that supports recursion and loops): function jointwo(x,y) return {x,y} end
L = require('lpeg')
set, range = L.S, L.R -- more understandable function names
wsp = set(' ,')^1 -- ^1 is "one or more of", i.e. + in PEG
num = range('09')^1/tonumber -- call tonumber(), use the result as the capture
pair = (num * wsp * num)/jointwo -- * is "followed by", i.e. space in PEG
pairs = pair * (wsp * pair)^0 -- ^0 is "zero or more of", i.e. * in PEG
dump{ pairs:match('1,2 3,4 5,6 7,8') }
--> { {1,2}, {3,4}, {5,6}, {7,8} } |
I think there's value in returning something; for consistency, but also because there will be some use-cases (maybe uncommon ones) where you actually want access to that value. For example, suppose I'm going through a list of words & want to classify them based on how they start:
If I had input like "school blah knight", I'd get the following output:
Not having access would mean moving the classification logic (i.e. the The example is a bit contrived, but the point is that sometimes knowing why the positive lookahead matched (i.e. having access to its production) can be useful. |
Moving post-1.0.0. I’m becoming convinced this must be solved together with #11. |
I think this should actually be handled with the es6 I think #11 is also easy to solve, but I think they can be solved separately, and that this will happen more safely, more easily testably, and more durably if they are If we can get a codebase that's ready for PRs, I'll take a swing at it |
Array holes can, also, be detected if you really want to, but don't turn up unless you really fight for it, meaning the principle of least surprise is satisfied without making weird cases technically difficult |
In sequences, returning a value from predicates only confuses users. For example,
start = "a" &. "b"
returns["a", "", "b"]
, on input "ab", but users expect["a", "b"]
.This means predicates won’t be allowed outside of sequences or in one-element sequences because such expressions would have undefined value.
The text was updated successfully, but these errors were encountered: