Peg.js haddock parser as replace for custom regex parser #863

nixorn · 2019-03-03T11:36:47Z

Hey! This PR not for merge yet, just to show how it going, and maybe you will notice some errors.

This code does not solve most of problems described in #798 , but creates base for solving. I think main plan is to create two parsers, one for codeworld and one for haskell/haddock. This PR is about haddock parser.
I intentionally leave there old regex based code + some testing code, to show how I made sure it works. I will delete old and testing code if you'll say all is ok. I tested this parser using https://github.com/flitbit/diff

Also this code does not handle some conditions from old parser, for example there is no class and instance keywords in current version of codeworld-base.txt.

Also haskell parser renamed to haddock parser.

cdsmith · 2019-03-03T14:16:47Z

build.sh

+
+# Build pegjs parsers
+function build_pegjs_parsers {
+    bin/pegjs --export-var haddock_parser --format globals -o ../haddock-parser.js ../../web/js/haddock-parser.pegjs


Does haddock-parser.pegjs need to be accessible at runtime? If not, then it shouldn't go in web.

Ok, what is better place for it?

Moved to root of project.

cdsmith · 2019-03-03T14:18:03Z

install.sh

@@ -255,5 +255,11 @@ run $BUILD/CodeMirror git checkout tags/5.43.0
 run $BUILD/CodeMirror npm install
 run $BUILD/CodeMirror npm install -s uglify-js git+ssh://git@github.com:angelozerr/CodeMirror-Extension.git

+# Install pegjs
+
+run $BUILD       git clone https://github.com/pegjs/pegjs.git


You should be able to install directly using npm instead of git.

Yes, I changed this.

cdsmith · 2019-03-03T15:02:35Z

Okay, this is the big important comment. You wrote: "I think main plan is to create two parsers, one for codeworld and one for haskell/haddock."

I do NOT think that should be the plan. I would be very disappointed if that happened. Haddock's hoogle output format is a subset of Haskell, and I'd like to see the same parser used for both. This is not just less code (in fact, it probably isn't much less), but also a guarantee that there won't be a permanent gap between feature sets of builtin code and user code. Is there anything specific you ran into that makes you think it wouldn't be feasible to use the same parser for both?

The one thing I'm worried about is handling layout. I've poked at this a little bit over the past week or so, and found a lot of requests but no answers for a good story with PEG.js and indentation-sensitive parsing.

nixorn · 2019-03-03T15:52:26Z

Yes I understand that haddock output is subset of Haskell. But I'm not sure about codeworld language (is it language?). For example:

codeworld shipPic(ship(w), direction(w), effectiveThrust(w))
haskell shipPic (ship w) (direction w) (effectiveThurst w)

I guess for codeworld should be different parser, because codeworld uses parentheses and commas differently. Am I wrong?

cdsmith · 2019-03-03T16:23:46Z

The CodeWorld language is a subset of Haskell (with language extensions). If you just parse it as Haskell, it should parse correctly.

In your examples:

ship(w) is the same as ship w, but just has extra parentheses. CodeWorld requires those parentheses, but Haskell allows them, too.
shipPic(a, b, c) is just the function shipPic applied to one argument, the tuple (a, b, c). CodeWorld programmers consider this a function with three arguments, but it's the same thing. Just a different way of encoding multiple arguments, using product types rather than currying.

nixorn · 2019-03-03T18:59:23Z

Ok, I got it, should be one parser for all kinds of input. So I continue making it. It is good that we clarified this.

…t from web.

nixorn · 2019-03-04T10:21:25Z

About layout sensitive parsing - I'll try to get solution from here http://hackage.haskell.org/package/haskell-src-1.0.3.0/docs/src/Language-Haskell-ParseMonad.html#getSrcLoc
I have access to context of parsing inside of parser via location and text functions, so I can still split source to lines, cache result, and calculate indent level using location + line index + col index. I think it is possible (or I don't understand the problem).

cdsmith · 2019-03-04T15:32:46Z

I'm not sure what you mean by getSrcLoc, which looks like just a simple accessor into parser state. It tells you the current line and column, but you also need to include changes in the indent in the parsing itself.

There are a few cases here, and it might get pretty complex. That's okay, because a good answer here is worth the work and lays the foundation for lots of cool features in the future.

Problem 1: Parsing correctly at all. You will need to handle layout algorithms to implement correct parsing. For example, this:

foo = 5
    bar = 5

should be a syntax error, because the implied layout in the first column makes bar = 5 part of the previous line.

Problem 2: Recovery. The fact that there's a syntax error on the current line (which is probably only half-typed anyway) shouldn't prevent autocomplete and docs from including symbols in the rest of the code. That means you need to isolate syntax errors and only ignore the sections with the error. Layout is probably the easiest way to do this. If you first split up the module into layout sections, then you can separately parse each one, and if it fails, drop it without ignoring the rest of the input.

nixorn · 2019-03-05T13:18:46Z

I think I understood the problem.
This tldr issue contains lot of good details about problem and some solutions how to use custom state to handle layout with peg.js. But these solutions more like a hacks with potentially dangerous side-effects. In the end of discussion guys recommend to use other parsers.

Also I found this and this. Last one looks too verbose, but working.

I can either continue with peg.js or try parsimmon/chevrotain . What do you think?

cdsmith · 2019-03-05T15:35:03Z

I can look into this, but honestly, it'll be a couple days before I get to it. Can you evaluate some options into then? Another that I ran across while searching is using nearly (for parsing) and moo (for lexing). There's documentation in the nearley docs for how to handle indentation with that setup.

nixorn · 2019-03-05T18:21:37Z

Ok, I'll do minimal example for all these.

nixorn · 2019-03-06T09:53:11Z

Also Nearly + Moo are last on this https://sap.github.io/chevrotain/performance/ benchmark.

nixorn · 2019-03-06T12:31:31Z

I like moo, it have neat interface, and good handling of indents. https://gist.github.com/nixorn/f88fb194bd92b4fee3f351e689e99e0a
Also there is possible to use regex to define tokens, which is convenient. Pegjs uses own character classes implementation.

nixorn · 2019-03-06T13:21:30Z

This is chevrotain lexer for python-like indentation https://github.com/SAP/chevrotain/blob/master/examples/lexer/python_indentation/python_indentation.js

Advantages of chevrotain

it much faster (according to benchmarks)
it don't require any building. So It allow to combine lexer and parser in single file, without building parser and touching build.sh/install.sh

Disadvantages

If we will insert any js build tool into project - chevrotain code will be broken because of this
It looks verbose

nixorn · 2019-03-07T14:07:49Z

Sample of chevrotain based lexer https://gist.github.com/nixorn/a8f1acb9c957f513e6072c0fcde8141b . Trying chevrotain parser.

nixorn added 3 commits March 3, 2019 12:22

Append peg.js to project.

5519b76

Merge branch 'master' of github.com:google/codeworld into 798

c22beb7

Fix error with parsed var not in scope, fix parsing of data keyword.

4ceb96d

Also haskell parser renamed to haddock parser.

cdsmith reviewed Mar 3, 2019

View reviewed changes

Rename haddock-parser to haskell-parser, move haskell-parses.pegjs ou…

085b978

…t from web.

Commit to just don't lost changes.

f7ddf98

cdsmith mentioned this pull request Mar 21, 2019

Tool tip with argument help when typing a function #850

Closed

devanshbatra04 mentioned this pull request Apr 23, 2019

Tooltips point to an incorrect definition #907

Open

nixorn closed this May 27, 2019

nixorn deleted the 798 branch May 27, 2019 08:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Peg.js haddock parser as replace for custom regex parser #863

Peg.js haddock parser as replace for custom regex parser #863

nixorn commented Mar 3, 2019

cdsmith Mar 3, 2019

nixorn Mar 3, 2019

nixorn Mar 4, 2019

cdsmith Mar 3, 2019

nixorn Mar 4, 2019

cdsmith commented Mar 3, 2019

nixorn commented Mar 3, 2019

cdsmith commented Mar 3, 2019 •

edited

nixorn commented Mar 3, 2019

nixorn commented Mar 4, 2019 •

edited

cdsmith commented Mar 4, 2019 •

edited

nixorn commented Mar 5, 2019

cdsmith commented Mar 5, 2019

nixorn commented Mar 5, 2019

nixorn commented Mar 6, 2019

nixorn commented Mar 6, 2019 •

edited

nixorn commented Mar 6, 2019

nixorn commented Mar 7, 2019

Peg.js haddock parser as replace for custom regex parser #863

Peg.js haddock parser as replace for custom regex parser #863

Conversation

nixorn commented Mar 3, 2019

cdsmith Mar 3, 2019

Choose a reason for hiding this comment

nixorn Mar 3, 2019

Choose a reason for hiding this comment

nixorn Mar 4, 2019

Choose a reason for hiding this comment

cdsmith Mar 3, 2019

Choose a reason for hiding this comment

nixorn Mar 4, 2019

Choose a reason for hiding this comment

cdsmith commented Mar 3, 2019

nixorn commented Mar 3, 2019

cdsmith commented Mar 3, 2019 • edited

nixorn commented Mar 3, 2019

nixorn commented Mar 4, 2019 • edited

cdsmith commented Mar 4, 2019 • edited

nixorn commented Mar 5, 2019

cdsmith commented Mar 5, 2019

nixorn commented Mar 5, 2019

nixorn commented Mar 6, 2019

nixorn commented Mar 6, 2019 • edited

nixorn commented Mar 6, 2019

nixorn commented Mar 7, 2019

cdsmith commented Mar 3, 2019 •

edited

nixorn commented Mar 4, 2019 •

edited

cdsmith commented Mar 4, 2019 •

edited

nixorn commented Mar 6, 2019 •

edited