Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Peg.js haddock parser as replace for custom regex parser #863

Closed
wants to merge 5 commits into from

Conversation

nixorn
Copy link
Contributor

@nixorn nixorn commented Mar 3, 2019

Hey! This PR not for merge yet, just to show how it going, and maybe you will notice some errors.

This code does not solve most of problems described in #798 , but creates base for solving. I think main plan is to create two parsers, one for codeworld and one for haskell/haddock. This PR is about haddock parser.
I intentionally leave there old regex based code + some testing code, to show how I made sure it works. I will delete old and testing code if you'll say all is ok. I tested this parser using https://github.com/flitbit/diff

Also this code does not handle some conditions from old parser, for example there is no class and instance keywords in current version of codeworld-base.txt.

build.sh Outdated

# Build pegjs parsers
function build_pegjs_parsers {
bin/pegjs --export-var haddock_parser --format globals -o ../haddock-parser.js ../../web/js/haddock-parser.pegjs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does haddock-parser.pegjs need to be accessible at runtime? If not, then it shouldn't go in web.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, what is better place for it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to root of project.

install.sh Outdated
@@ -255,5 +255,11 @@ run $BUILD/CodeMirror git checkout tags/5.43.0
run $BUILD/CodeMirror npm install
run $BUILD/CodeMirror npm install -s uglify-js git+ssh://git@github.com:angelozerr/CodeMirror-Extension.git

# Install pegjs

run $BUILD git clone https://github.com/pegjs/pegjs.git
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to install directly using npm instead of git.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I changed this.

@cdsmith
Copy link
Collaborator

cdsmith commented Mar 3, 2019

Okay, this is the big important comment. You wrote: "I think main plan is to create two parsers, one for codeworld and one for haskell/haddock."

I do NOT think that should be the plan. I would be very disappointed if that happened. Haddock's hoogle output format is a subset of Haskell, and I'd like to see the same parser used for both. This is not just less code (in fact, it probably isn't much less), but also a guarantee that there won't be a permanent gap between feature sets of builtin code and user code. Is there anything specific you ran into that makes you think it wouldn't be feasible to use the same parser for both?

The one thing I'm worried about is handling layout. I've poked at this a little bit over the past week or so, and found a lot of requests but no answers for a good story with PEG.js and indentation-sensitive parsing.

@nixorn
Copy link
Contributor Author

nixorn commented Mar 3, 2019

Yes I understand that haddock output is subset of Haskell. But I'm not sure about codeworld language (is it language?). For example:

  • codeworld shipPic(ship(w), direction(w), effectiveThrust(w))
  • haskell shipPic (ship w) (direction w) (effectiveThurst w)

I guess for codeworld should be different parser, because codeworld uses parentheses and commas differently. Am I wrong?

@cdsmith
Copy link
Collaborator

cdsmith commented Mar 3, 2019

The CodeWorld language is a subset of Haskell (with language extensions). If you just parse it as Haskell, it should parse correctly.

In your examples:

  1. ship(w) is the same as ship w, but just has extra parentheses. CodeWorld requires those parentheses, but Haskell allows them, too.
  2. shipPic(a, b, c) is just the function shipPic applied to one argument, the tuple (a, b, c). CodeWorld programmers consider this a function with three arguments, but it's the same thing. Just a different way of encoding multiple arguments, using product types rather than currying.

@nixorn
Copy link
Contributor Author

nixorn commented Mar 3, 2019

Ok, I got it, should be one parser for all kinds of input. So I continue making it. It is good that we clarified this.

@nixorn
Copy link
Contributor Author

nixorn commented Mar 4, 2019

About layout sensitive parsing - I'll try to get solution from here http://hackage.haskell.org/package/haskell-src-1.0.3.0/docs/src/Language-Haskell-ParseMonad.html#getSrcLoc
I have access to context of parsing inside of parser via location and text functions, so I can still split source to lines, cache result, and calculate indent level using location + line index + col index. I think it is possible (or I don't understand the problem).

@cdsmith
Copy link
Collaborator

cdsmith commented Mar 4, 2019

I'm not sure what you mean by getSrcLoc, which looks like just a simple accessor into parser state. It tells you the current line and column, but you also need to include changes in the indent in the parsing itself.

There are a few cases here, and it might get pretty complex. That's okay, because a good answer here is worth the work and lays the foundation for lots of cool features in the future.

Problem 1: Parsing correctly at all. You will need to handle layout algorithms to implement correct parsing. For example, this:

foo = 5
    bar = 5

should be a syntax error, because the implied layout in the first column makes bar = 5 part of the previous line.

Problem 2: Recovery. The fact that there's a syntax error on the current line (which is probably only half-typed anyway) shouldn't prevent autocomplete and docs from including symbols in the rest of the code. That means you need to isolate syntax errors and only ignore the sections with the error. Layout is probably the easiest way to do this. If you first split up the module into layout sections, then you can separately parse each one, and if it fails, drop it without ignoring the rest of the input.

@nixorn
Copy link
Contributor Author

nixorn commented Mar 5, 2019

I think I understood the problem.
This tldr issue contains lot of good details about problem and some solutions how to use custom state to handle layout with peg.js. But these solutions more like a hacks with potentially dangerous side-effects. In the end of discussion guys recommend to use other parsers.

Also I found this and this. Last one looks too verbose, but working.

I can either continue with peg.js or try parsimmon/chevrotain . What do you think?

@cdsmith
Copy link
Collaborator

cdsmith commented Mar 5, 2019

I can look into this, but honestly, it'll be a couple days before I get to it. Can you evaluate some options into then? Another that I ran across while searching is using nearly (for parsing) and moo (for lexing). There's documentation in the nearley docs for how to handle indentation with that setup.

@nixorn
Copy link
Contributor Author

nixorn commented Mar 5, 2019

Ok, I'll do minimal example for all these.

@nixorn
Copy link
Contributor Author

nixorn commented Mar 6, 2019

Also Nearly + Moo are last on this https://sap.github.io/chevrotain/performance/ benchmark.

@nixorn
Copy link
Contributor Author

nixorn commented Mar 6, 2019

I like moo, it have neat interface, and good handling of indents. https://gist.github.com/nixorn/f88fb194bd92b4fee3f351e689e99e0a
Also there is possible to use regex to define tokens, which is convenient. Pegjs uses own character classes implementation.

@nixorn
Copy link
Contributor Author

nixorn commented Mar 6, 2019

This is chevrotain lexer for python-like indentation https://github.com/SAP/chevrotain/blob/master/examples/lexer/python_indentation/python_indentation.js

Advantages of chevrotain

  1. it much faster (according to benchmarks)
  2. it don't require any building. So It allow to combine lexer and parser in single file, without building parser and touching build.sh/install.sh

Disadvantages

  1. If we will insert any js build tool into project - chevrotain code will be broken because of this
  2. It looks verbose

@nixorn
Copy link
Contributor Author

nixorn commented Mar 7, 2019

Sample of chevrotain based lexer https://gist.github.com/nixorn/a8f1acb9c957f513e6072c0fcde8141b . Trying chevrotain parser.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants