New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Peg.js haddock parser as replace for custom regex parser #863
Conversation
Also haskell parser renamed to haddock parser.
build.sh
Outdated
|
||
# Build pegjs parsers | ||
function build_pegjs_parsers { | ||
bin/pegjs --export-var haddock_parser --format globals -o ../haddock-parser.js ../../web/js/haddock-parser.pegjs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does haddock-parser.pegjs
need to be accessible at runtime? If not, then it shouldn't go in web
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, what is better place for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to root of project.
install.sh
Outdated
@@ -255,5 +255,11 @@ run $BUILD/CodeMirror git checkout tags/5.43.0 | |||
run $BUILD/CodeMirror npm install | |||
run $BUILD/CodeMirror npm install -s uglify-js git+ssh://git@github.com:angelozerr/CodeMirror-Extension.git | |||
|
|||
# Install pegjs | |||
|
|||
run $BUILD git clone https://github.com/pegjs/pegjs.git |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should be able to install directly using npm instead of git.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I changed this.
Okay, this is the big important comment. You wrote: "I think main plan is to create two parsers, one for codeworld and one for haskell/haddock." I do NOT think that should be the plan. I would be very disappointed if that happened. Haddock's hoogle output format is a subset of Haskell, and I'd like to see the same parser used for both. This is not just less code (in fact, it probably isn't much less), but also a guarantee that there won't be a permanent gap between feature sets of builtin code and user code. Is there anything specific you ran into that makes you think it wouldn't be feasible to use the same parser for both? The one thing I'm worried about is handling layout. I've poked at this a little bit over the past week or so, and found a lot of requests but no answers for a good story with PEG.js and indentation-sensitive parsing. |
Yes I understand that haddock output is subset of Haskell. But I'm not sure about codeworld language (is it language?). For example:
I guess for codeworld should be different parser, because codeworld uses parentheses and commas differently. Am I wrong? |
The CodeWorld language is a subset of Haskell (with language extensions). If you just parse it as Haskell, it should parse correctly. In your examples:
|
Ok, I got it, should be one parser for all kinds of input. So I continue making it. It is good that we clarified this. |
About layout sensitive parsing - I'll try to get solution from here http://hackage.haskell.org/package/haskell-src-1.0.3.0/docs/src/Language-Haskell-ParseMonad.html#getSrcLoc |
I'm not sure what you mean by getSrcLoc, which looks like just a simple accessor into parser state. It tells you the current line and column, but you also need to include changes in the indent in the parsing itself. There are a few cases here, and it might get pretty complex. That's okay, because a good answer here is worth the work and lays the foundation for lots of cool features in the future. Problem 1: Parsing correctly at all. You will need to handle layout algorithms to implement correct parsing. For example, this:
should be a syntax error, because the implied layout in the first column makes Problem 2: Recovery. The fact that there's a syntax error on the current line (which is probably only half-typed anyway) shouldn't prevent autocomplete and docs from including symbols in the rest of the code. That means you need to isolate syntax errors and only ignore the sections with the error. Layout is probably the easiest way to do this. If you first split up the module into layout sections, then you can separately parse each one, and if it fails, drop it without ignoring the rest of the input. |
I think I understood the problem. Also I found this and this. Last one looks too verbose, but working. I can either continue with peg.js or try parsimmon/chevrotain . What do you think? |
I can look into this, but honestly, it'll be a couple days before I get to it. Can you evaluate some options into then? Another that I ran across while searching is using nearly (for parsing) and moo (for lexing). There's documentation in the nearley docs for how to handle indentation with that setup. |
Ok, I'll do minimal example for all these. |
Also Nearly + Moo are last on this https://sap.github.io/chevrotain/performance/ benchmark. |
I like moo, it have neat interface, and good handling of indents. https://gist.github.com/nixorn/f88fb194bd92b4fee3f351e689e99e0a |
This is chevrotain lexer for python-like indentation https://github.com/SAP/chevrotain/blob/master/examples/lexer/python_indentation/python_indentation.js Advantages of chevrotain
Disadvantages
|
Sample of chevrotain based lexer https://gist.github.com/nixorn/a8f1acb9c957f513e6072c0fcde8141b . Trying chevrotain parser. |
Hey! This PR not for merge yet, just to show how it going, and maybe you will notice some errors.
This code does not solve most of problems described in #798 , but creates base for solving. I think main plan is to create two parsers, one for codeworld and one for haskell/haddock. This PR is about haddock parser.
I intentionally leave there old regex based code + some testing code, to show how I made sure it works. I will delete old and testing code if you'll say all is ok. I tested this parser using https://github.com/flitbit/diff
Also this code does not handle some conditions from old parser, for example there is no
class
andinstance
keywords in current version of codeworld-base.txt.