New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ability to ignore certain productions #11
Comments
Agreed. Is there a clean way to do this at the moment? |
@benekastah: There is no clean way as of now. This would be hard to do without changing how PEG.js works. Possible solutions include:
I won't work on this now but it's something to think about in the feature. |
I would need this feature to. May be you could introduce a "skip"-Token. So if a rule returns that token, it will be ignored and get no node at the AST (aka entry in the array). |
I am looking for a way to do this as well. I have a big grammar file (it parses the ASN.1 format for SNMP MIB files). I didn't write, it, but I trivially transformed it from the original form to create a parser in PEG.js. (This is good. In fact, it's extremely slick that it took me less than 15 minutes to tweak it so that PEG.js would accept it.) Unfortunately, the grammar was written with the ability simply to ignore whitespace and comments when it encounters them. Consequently, no real MIB files can be handled, because the parser stops at the first occurrence of whitespace. I am not anxious to have to figure out the grammar so that I can insert all the proper whitespace in all the rules (there are about 126 productions...) Is there some other way to do this? NB: In the event that I have to modify the grammar by hand, I asked for help with some quesitons in a ticket in the Google Groups list. http://groups.google.com/group/pegjs/browse_thread/thread/568b629f093983b7 Many thanks! |
Thanks to the folks over on Google Groups. I think I got enough information to allow me to do what I want. But I'm really looking forward to the ability in PEG.js to mark whitespace/comments as something to be ignored completely so that I wouldn't have to take a few hours to modify an otherwise clean grammar... Thanks! Rich |
I agree with the assertion that pegjs needs the ability to skip tokens. I may look into it, since if you want to write a serious grammar you will get crazy when putting ws between every token. |
Since the generated parsers are modular. As a workaround, create a simplistic lexer and use it's output as input to the for-real one e.g: elideWS.pegjs: s = input:(whitespaceCharacter / textCharacter)* for(var i = 0;i < input.length;i++) result += input[i]; whitespaceCharacter = [ \n\t] { return ""; } but that causes problems when whitespace is a delimiter -- like for identifiers |
Bumping into this issue quite often. What I was thinking is to be able to define skip rules that can be used as alternatives whenever there's no match. This introduces the need for a non-breaking class though. Example with
Still digesting this. Any feedback? Might be a very stupid idea. |
So the difference is that you want to distinguish when the overall engine is Is there a case when you want to not ignore whitespace when in lexer mode Would the following be equivalent? Float or otherwise extend peg to process the typical regex’s directly and outside On Apr 19, 2014, at 3:22 PM, Andrei Neculau notifications@github.com wrote:
|
@waTeim Actually no. Traditionally the parsing process is split into lexing and parsing. During lexing every character is of significance, that including whitespaces. But these are then lexed into a "discard" token. The parser, when advancing to the next token, will then discard any discard tokens. The important part is that you can discard anything, not just whitespaces. This behavior is exactly what @andreineculau is describing. The basic idea how to implement this is by needing to additionally check against all discard rules when transitioning from one state to the next. |
On Apr 23, 2014, at 2:54 PM, Sean Farrell notifications@github.com wrote:
Therefore there’s no need to have glue elements (e.g. ‘#’) in the language
|
Ok then I misunderstood you. There may be cases for lexer states, but that is a totally different requirement and IMHO outside of the scope of peg.js. |
@andreineculau Because your grammar is whitespace sensitive, this is not applicable. The discard tokens would be part of the grammar, the lexing part to be exact. I don't know what the big issue is here, this was already sufficiently solved in the 70s. Each and every language has it's own skippable tokens and where they are applicable. The whitespaces and comments are as much part of the language definition and thus part of the grammar. It just turn out that with most languages the skippable tokens may be between each and every other token and using a discard rule makes it WAY simpler than writing I understand that retconning discard rules into pegjs is not easy, but that does not mean that it is not a laudable goal. |
Oh man, free response section! I have a lot to say, so sorry for the length.
header_field whitespace(IGNORE) The addition I’d make is a options section that may be included in any production The http-bis language would not be limited by this re-write (see appendix a).
It feels like you are exchanging requiring the user to fill the parser definition with a bunch appendix a, HTTP-bis is not one of those occurrences, just badly documented.
But I can see how it would be easier on the parser definer to simply cut and paste the production1(state==1) production2(state==2) production3 production4 In other words, just like lex/yacc make it possible for productions to only be available if the system is in a particular state, and allow the user to set that state value.
Or you could make it easier on the user and more apparent to the reader with another production(DONTIGNORE) Which would allow the parser to override the default action of discarding tokens marked or no ignore, but I don’t think that extra flexibility is needed.
I think all this comes down to is (I’m making some assumptions here) currently, the parameter to it getNextToken(input,options).Appendix a) That HTTP-bis spec Ok I’ve read some but have not read all of this Hypertext Transfer Protocol (HTTP/1.1): Message Syntax and Routing I don’t like the way they have defined their grammar. I don’t suggest changing the input it OWS ::== (SP | HTAB)* which is just repetition of tabs and spaces For no good reason. They have made the language harder to parse — require the lexical analyzer They have defined OWS as “optional white space” BWS as “bad whitespace” or otherwise optional In their spec, the only place RWS is used is here Via = 1#( received-protocol RWS received-by [ RWS comment ] )
but 'protocol-version' is numbers and maybe letters, while 'received-by' is numbers and letters. In other words, On Apr 24, 2014, at 1:23 PM, Andrei Neculau notifications@github.com wrote:
|
@waTeim I think you are going overboard with this. I have written quite few parsers and I think the lexer states where never really useful as such. In most cases I saw them was where the lexer consumed block comments and it was "simpler" to put the lexer into "block comment mode" and write simpler patterns than the über pattern to consume the comment (and count lines). I have never seen any proper use of lexer states stemming from the parser. The fundamental problem here is that with one look ahead, when the parser sees the token to switch states, the lexer has already erroneously lexed the next token. What you propose is almost impossible to implement without back-tracking and that is never a good feature in a parser. When writing a grammar you basically define which productions are considered parsed and what can be sipped. In @andreineculau's example there are two options, either you handle white spaces in the parser or you define the trailing ":" part of the token. ( |
I might suggest turning the problem into specifying a whitelist—which portions do I want to capture and transform—instead of a blacklist. Although whitespace is one problem with the current capture system, the nesting of rules is another. As I wrote in Issue #66, the LPeg system of specifying what you want to capture directly, via transforms or string captures, seems more useful to me than specifying a handful of productions to skip and still dealing with the nesting of every other production. See my comment in Issue #66 for a simple example of LPeg versus PEG.js with respect to captures. Although the names are a bit cryptic, see the Captures section of the LPeg documentation for the various ways that you can capture or transform a given production (or portion thereof). |
Hello, I've created a snippet to ignore some general cases: {
var strip = require('./strip-ast');
} The two ways to improve it:
|
@richb-hanover Where did your ASN.1 definition parser efforts land? |
@atesgoral - I bailed out. I didn't need a "real parser" - I only needed to isolate certain named elements in the target file. So I did what any wimpy guy would do - used regular expressions. (And then I had two problems :-) But it did the trick, so I was able to move on to the next challenge. Good luck in your project! |
Having had a look at chevrotain and its skip option, something like this is hugely desirable. Too often we find ourselves writing something like this:
Would be cool if we could write this instead:
|
@richb-hanover, and anybody else who got here in search of a similar need, I ended up writing my own parsers, too: https://www.npmjs.com/package/asn1exp and https://www.npmjs.com/package/asn1-tree |
A skip would be relatively easy to implement using |
Just stumbled upon this too. When we write a rule, at the end of it we can add a return block. Somewhere in the code the returned value of that block goes into the output stream. So what would need to change in PEG.js if I want to skip a value returned by a rule if that value is the return of callking a e.g. As mentioned above, skip() could return a Symbol, which is then checked by the code somewhere and removed. |
I don't understand your question. Are you looking for a way to fail a rule under some circumstances? Use
|
If it helps anyone, I ignore whitespace by having my top-level rule filter the array of results. Example:
This will happily parse input while keeping whitespace out of the result array. |
That only works for top-level productions. You have to manually filter every parent that could contain a filterable child. |
@StoneCypher True, it does require some top level work, but it works for me, and I think as long as the gammar isn't too complex, one should be able to get away with having a top level filter. Other than that, all I can think of is have a top level function that filters whitespace from input and pass every match through it. Slower for sure, and requires alot more calls, but easy if you (like me) pass everything into a token generator. You can call the filter function from where you generate tokens, and only have to worry about generating your tokens and the whitespace is more or less automatically filtered |
One of the things I liked about the current HEAD of pegjs is its (undocumented) support for picking fields without having to create labels and do return statements. It looks like this: foo = @bar _ @baz
bar = $"bar"i
baz = $"baz"i
_ = " "* parse('barbaz') // returns [ 'bar', 'baz' ] I feel like this gives nice, clean, explicit syntax for this use case plus a bunch of others. |
@hildjj This is exactly what I needed in combination with parsing lists. Peggy is wonderful, thank you for your effort! I guess, this example also would make a good candidate for the documentation page, as it illustrates the usage of |
It would be nice to be able to tell the lexer/parser to ignore certain productions (i.e. whitespace and comment productions) so that it becomes unnecessary to litter all other productions with comment/whitespace allowances. This may not be possible though, due to the fact that lexing is builtin with parsing?
Thank you
The text was updated successfully, but these errors were encountered: