Chapter 9: Parser combinators

Contents

Notes: Chapter notes and links to further reading related to the content in this chapter
FAQ: Questions related to the chapter content. Feel free to add questions and/or answers here related to the chapter.

Notes

Different approaches to functional design

There are many different approaches to functional design. The approach we take in this chapter we are calling "algebraic design" to emphasize the fact that we are mainly concerned with the abstract data types and laws of our API. The actual representation is a later concern, and we choose a representation that best facilitates our algebra. This perspective is common to Category Theory, where mathematical objects (or a type in the more concretely programming-related case) are defined solely by their relationship to other objects, not their "internal" structure.

Conal Elliott advocates for an approach he calls denotational design, also see this nice exposition by Luke Palmer. In this style of design, the choice of a precise meaning or denotation (see denotational semantics) for the data types in our library guides the design process. This denotation is not (necessarily) similar to the actual concrete representation we ultimately select when implementing our library--that happens later. The denotation is instead a hopefully simple, precise mathematical object that lets us understand what the data types of our library and their various operations mean (in terms of how they transform these denotations). In this style of design, we begin with some initial idea of a denotation for our data type, and refine this denotation in response to the operations we wish to support.

There is no "right" answer in approaches to the design of software, and the higher your perspective, the more the lines between different approaches blur.

Design issues in parser combinator libraries

Parser combinator libraries are very common in the FP world, and there are a lot of different variations. A few of the key design issues:

Error reporting: Generating good error messages is a delicate problem, and is easy to get wrong. The overall error-reporting strategy developed in this chapter is most similar to Parsec, a widely used parsing library in Haskell, though Parsec does not support the nested error messages we developed here. A key factor for good error reporting is having a clear model for when backtracking is allowed to occur and giving the programmer the ability to control this when specifying the grammar. We chose to adopt Parsec's convention that all parsers commit by default if they consume at least one character, and that the attempt combinator "undoes" this commit. Another choice is to not have parsers commit by default, and introduce a separate commit combinator (which is still "undone" by attempt). This has the same expressiveness but we find it is not usually the best default since most grammars require little lookahead. Interestingly, the grammar ends up looking almost the same with the introduction of an explicit commit: the leaf-level token parser is usually wrapped in commit, and higher levels of the grammar that require more lookahead use attempt.

Input handling: Is the input type fixed to be String or can parsers be polymorphic over any sequence of tokens? The library developed in this chapter chose String for simplicity and speed. Separate tokenizers are not as commonly used with parser combinators--the main reason to use a separate tokenizer would be for speed, which is usually addressed by building a few fast low-level primitive like regex (which checks whether a regular expression matches a prefix of the input string).

Having the input type be an arbitrary Seq[A] results in a library that could in principle be used for other sorts of computations (for instance, we can "parse" a sum from a Seq[Int]). There are usually better ways of doing these other sorts of tasks (see the discussion of stream processing in part 4 of the book) and we have found that there isn't much advantage in generalizing the library we have here to other input types.

We can define a similar set of combinators for parsing binary formats. Binary formats rarely have to deal with lookahead, ambiguity, and backtracking, and have simpler error reporting needs, so often a more specialized binary parsing library is sufficient. Some examples of binary parsing libraries are Data.Binary in Haskell, which has inspired similar libraries in other languages, for instance scodec in Scala.

You may be interested to explore generalizing or adapting the library we wrote enough to handle binary parsing as well (to get good performance, you may want to try using the specialized annotation to avoid the overhead of boxing and unboxing that would otherwise occur).

Streaming parsing and push vs. pull: It is possible to make a parser combinator library that can operate on huge inputs, larger than what is possible or desirable to load fully into memory. Related to this is is the ability to produce results in a streaming fashion--for instance, as the raw bytes of an HTTP request are being read, the parse result is being constructed. The two general approaches here are to allow the parser to pull more input from its input source, and inverting control and pushing chunks of input to our parsers (this is the approach taken by the popular Haskell library attoparsec), which may report they are finished, failed, or requiring more input after each chunk. We will discuss this more in part 4 when discussing stream processing.

Handling of ambiguity and left-recursion: The parsers developed in this chapter scan input from left to right and use a left-biased or combinator which selects the "leftmost" parse if the grammar is ambiguous--these are called LL parsers. This is a design choice. There are other ways to handle ambiguity in the grammar--in GLR parsing, roughly, the or combinator explores both branches simultaneously and the parser can return multiple results if the input is ambiguous. (See Daniel Spiewak's series on GLR parsing also his paper) GLR parsing is more complicated, there are difficulties around supporting good error messages, and we generally find that its extra capabilities are not needed in most parsing situations.

Related to this, the parser combinators developed here cannot directly handle left-recursion, for instance:

def expr = (int or double) or (
           expr ** "+" ** expr ) or (
           "(" ** expr ** ")" )

This will fail with a stack overflow error, since we will keep recursing into expr. The invariant that prevents this is that all branches of a parser must consume at least one character before recursing. In this example, we could do this by simply factoring the grammar a bit differently:

def leaf = int or double or ("(" ** expr ** ")")
def expr = leaf.sep("+")

Where sep is a combinator that parses a list of elements, ignoring some delimiter ("separator") that comes between each element (it can be implemented in terms of many). This same basic approach can be used to handle operator precedence--for instance, if leaf were a parser that included binary expressions involving multiplication, this would make the precedence of multiplication higher than that of addition. Of course, it is possible to write combinators to abstract out patterns like this. It is also possible to write a monolithic combinator that uses a more efficient (but specialized) algorithm to build operator parsers.

Other interesting areas to explore related to parsing

Monoidal and incremental parsing: This deals with the much harder problem of being able to incrementally reparse input in the presences of inserts and deletes, like what might be nice to support in a text-editor or IDE. See Edward Kmett's articles on this as a starting point.
Purely applicative parsers: If we stopped before adding flatMap, we would still have a useful library--context-sensitivity is not often required, and there are some interesting tricks that can be played here. See this stack overflow answer for a discussion and links to more reading.

Single pass parsing

Lastly, we mentioned that the JSON parser we wrote in this chapter was rather dumb in that it only constructed a parse tree. This is unfortunate--we will typically then have to traverse this parse tree to extract some meaningful type from the parse tree after the fact. We could instead build up whatever type we wish as we go. The general idea is to parameterize our grammar on functions that describe what to do at each level of the parse tree:

def array[A,B](value: Parser[A])(f: List[A] => B): Parser[B] = 
  "[" *> value.sep(",").map(f) <* "]"
  // `*>` ignores the left parse result, in this case the `"["`, 
  // and `<*` ignores the right parse result, in this case the `"]"`.

This parses a JSON array, but rather than building a JArray, immediately converts the result to B using the given function, where B might be some more meaningful type. The same strategy can be applied for the rest of the grammar.

FAQ

All content on this wiki, including code and documentation, is licensed using the MIT license. Contributors to this wiki agree to license their contributions using the MIT license.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly