Skip to content
This repository has been archived by the owner on Nov 4, 2019. It is now read-only.

Text parser 2

rsms edited this page Jan 10, 2011 · 22 revisions

Text parser 2

This is a sort of scrap book for the work on text-parser2.

Research approaches based on AST/TST

Research into using a universal AST (Abstract Syntax Tree) to represent a document. A complex language parser can provide a rich and complete AST while a simplistic parser can simply build a flat tree.

Example implementation of a patching AST parser for JavaScript

Current work is done in the text-parser-2 branch.

Outline:

  • The text parser runs in the embedded nodejs runtime thread.
  • Kod can send an "edited text" message to the parser at any given time.
  • The parser can send a "tree updated" message to kod at any given time.

AST format

We should strive for the most minimal representation, especially when it comes to memory performance when copying the tree and/or transforming it.

Required features:

  • Location in the source
  • Kind (e.g. "variable declaration", "binary operation", etc)
  • Value (i.e. part of the source string)
  • Enumerable children
  • Cheap transformation/access between nodejs and objc --OR-- objc-land need to be able to read the tree.

Example:

Source:

function foo (bar, x) {
  return this;
}

AST:

{ kind: 'root',
  children: [
    { kind: 'declaration.function',
      range: [0,42],
      children: [
        { kind: 'name', value: 'foo', range: [9,3] },
        { kind: 'declaration.parameters',
          range: [13,10],
          children: [
            { kind: 'name', value: 'bar', range: [1,3] },
            { kind: 'name', value: 'x',   range: [5,1] }
          ],
        },
        { kind: 'statement.return',
          range: [28,11],
          children: [
            { kind: 'expression.this', range: [7,4] }
          ]
        }
      ] // end of "children" of "declaration.function"
    }
  ]
}

This tree could be implemented using a plain data structure or by using stateful objects.

Source ranges:

Range are defined in the space of a node's parent (like rects in the cocoa view hierarchy). This way we achieve a relatively low complexity for both modifying and reading absolute source locations.

  • When modifying the tree with a positive length delta (addition):

    • Update the location of each direct sibling following the modified node
  • When modifying the tree with a negative length delta (removal):

    • First traverse the tree upwards and stop at the fisrt node which location is less than or equal to the affected node - number of characters removed
    • Update the location of each direct sibling following that node
  • When reading the absolute source location for a node in the tree, traverse the tree upwards starting at the node, and calculate the sum of parent node locations

  • When locating a node for an absolute source location, traverse the tree top-down, chosing nodes which location+length is beyond the location absolute source location

To illustrate this, let's insert a variable declaration at the start of our example source with the following properties:

  • Addition with a change delta of 19 (number of characters added)
  • Modified range is: [0,0] (added to the beginning of the document)

This is the source:

var hello = 43110;
function foo (bar, x) {
  return this;
}

With this information, we can derive the following properties of the edit:

  • The addition to the source is "var hello = 43110;\n"
  • Everything beyond location 19 need to have their locations updated with +19

Here's the AST for the new source:

{ kind: 'root',
  children: [
    { kind: 'declaration.variable'
      range: [0,17],
      children: [
        { kind: 'expression.assignment',
          range: [4,13],
          children: [
            { kind: 'name', value: 'hello', range: [0,5] },
            { kind: 'expression.literal.number', value: 43110, range: [8,5] },
          ]
        }
      ]
    },
    { kind: 'declaration.function',
      range: [19,42],                     // <-- changed from [0,42] to [19,42]
      children: [
        { kind: 'name', value: 'foo', range: [9,3] },
        { kind: 'declaration.parameters',
          range: [13,10],
          children: [
            { kind: 'name', value: 'bar', range: [1,3] },
            { kind: 'name', value: 'x',   range: [5,1] }
          ],
        },
        { kind: 'statement.return',
          range: [28,11],
          children: [
            { kind: 'expression.this', range: [7,4] }
          ]
        }
      ] // end of "children" of "declaration.function"
    }
  ]
}

Notice how the only thing needed updating was the location of the "declaration.function" node -- nothing else.

Informal protocol

Kod sending a "text edited" message to the parser system:

[ui -> parser] "text edited" {
                  document: [KDocument],
                  modifiedRange: [123, 4],
                  changeDelta: 4,
                  version: 12345
               }
  • document is a shared object which represents the document which text was altered. This document can be queried for:
    • language type
    • modification timestamp
    • complete source text
  • modifiedRange represent the character range in the source before the edit occured. E.g. removing "re" from "fred" results in the range [1,2] ("edit started at position 1 and extended 2 characters").
  • changeDelta tells how many characters where added or removed. In the above case of removing "re" from "fred", the value is -2 ("two characters where removed")
  • version is an opaque value which uniquely identifies this edit. The version is simply passed along back to the ui when the paser is done working.

This information should be sufficient to calculate the (probably domain-specific) effective extent of the edit. The parser should also be able to consult the AST on what node(s) of the tree was affected by the edit and need to be replaced or re-evaluated.

At this point in time, the parser should know the following:

  • What node in the AST need to be replaced
  • The full semantically complete range of characters in the new source which need to be parsed

Next, the parser simply parses the sbustring of the new source into a partial AST and finally replaces the old node with the new root node from the partial AST.

Picture yourself a tree... Imagine the AST is a tree in nature and the edit is a scratched branch on that tree. We need to cut of that branch from the tree, but we want to do as little damage as possible, thus we cut of the branch as far out as possible. Then we consult your local god to create a new branch without scratches and insert that one where we cut off the damaged branch.

Finally, the parser sends a message to Kod to inform "the UI" that the semantics of document has changed:

[parser -> ui] "ast changed" {
                  document: [KDocument],
                  affectedRange: [120, 12],
                  version: version,
                  ast: [ast]
               }
  • document is a shared object representing the document in question.
  • affectedRange represents a character range (in the current source text space) which the update affected.
  • version carries the opaque version value we received from the UI when starting our parsing.
  • ast is the new/current AST (TODO: maybe the ast could be a persistent property of the document instead so we don't need to pass it along all the time?)

When the UI receives a "ast changed" message it compares the current version of the document to the version received.

editAccepted = false
document.lock() // temporarily retains exclusive access to the document
currentVersion = document.version
if (currentVersion == version) {
  editAccepted = true
  updateAST(ast)
}
document.unlock()

If editAccepted is false, then another edit occured while the parser was working and the parser is currently busy interpreting that change. Here we need some way to coalesce edits. Here's one idea for an approach to a solution:

failedEdits = []
on("edit failed"), {
  failedEdits.push(affectedRange)
}
on("ast changed", {
  unionRange = calculateUnionOfRanges(failedEdits)
  failedEdits = []
  if (  unionRange.start < affectedRange.start
     || unionRange.end > affectedRange.end ) {
    // some of the failed edits where supposed to cause a change beyond what was
    // just changed. As there might be "dirty" parts outside of what wehere just
    // updated
    [ui -> parser] "text edited" {... modifiedRange: unionRange ...}
  }
}

The UI updates displayed text in the affectedRange to have its style updated to visually reflect the containg elements' semantic meaning (i.e. its kind, relation, etc). The UI can do other things as well, like updating a visual tree representation of the AST in which the user can click to jump between the visual tree and the source text.

Revision 2 of the informal protocol

We limit the number of threads involved to exactly 2: {main} and {nodejs}.

  • {main} makes changes to a document's text and enqueues "text updated" for {nodejs} in a sequential fashion using a shared atomic LIFO queue.
  • {nodejs} dequeues "text updated" events--when notified--from the shared LIFO.

This reasonable limitation would remove the need for version passing and parse-retrying. However, one problem still exists and is that of mutual access to a document's text.

One obvious solution is that {main} makes a complete copy of the text when enqueuing the "text edited" event. Such a solution have the following characteristics and effects:

  • Less performant as {main} need to block during the copy-ing of it's text
  • Removes the need for versioning and complex managed coalescing of edits, simpifying both the implementation and the API
  • Removes the need to implement a special v8 C++ wrapper object to allow mutual access to text between objc and v8, which could potentially be very messy since one would have to create a custom implementation of NSString where the backing memory is accessible.

Update on text copying performance: This has been implemented and tested in Kod with very optimistic-looking results -- passing and copying 400k characters long text from kod to node, let node read it and then send a "received" message back to kod takes about 11ms on a MacBook Pro 15 from 2009.

Given this revision, the API messages would instead look like this:

[ui -> parser] "text edited" {
                  document: [KDocument],
                  text: "complete current source...",
                  modifiedRange: [123, 4],
                  changeDelta: 4
               }
[parser -> ui] "ast changed" {
                  document: [KDocument],
                  affectedRange: [120, 12],
                  ast: [ast]
               }

After a successful parse in {nodejs}, the "ast changed" message need to be delayed by one runloop tick. This important since there might be queued "text edited" messages which need to be processed before we send "ast changed". Something like this:

# runloop notified to run an iteration

recv> edit1()                  # finQ: []        sendQ: []
push> fin1                     # finQ: [ fin1 ]  sendQ: []
if finQ.length == 0:
  # not happening since finQ isn't empty
  # ...

yield
  # a new edit was made while we processed edit1, thus the event got queued on
  # the runloop with a higher priority than our nextTick closure

recv> edit2()                  # finQ: [ fin1 ]  sendQ: []
pop>  fin1() -> res1 > sendQ   # finQ: []        sendQ: [ res1 ]
push> fin2                     # finQ: [ fin2 ]  sendQ: [ res1 ]
if finQ.length == 0:
  # not happening since finQ isn't empty
  # ...

yield
  # this time no new edit was made, but the runloop still triggers an iteration
  # since we queued a nextTick()

pop>  fin2() -> res1 > sendQ   # finQ: []        sendQ: [ res1, res2 ]
if finQ.length == 0:
  # this do happe since finQ is empty
  for each res in sendQ:
    res()
  clear sendQ                  # finQ: []        sendQ: []

yield
  # runloop goes into idle mode...

Which could be implemented like this:

var finQ = []
var sendQ = []

function dequeue() {
  var queuedFinalizer = finQ.shift()
  if (queuedFinalizer) {
    var result = queuedFinalizer()
    sendQ.push(result)
  }
  if (finQ.length == 0) {
    sendQ.forEach(function (result) {
      send(result)
    })
    sendQ = []
  }
}

on("text edited", message, {
  var finalizer = parse(message)
  finQ.push(finalizer)
  process.nextTick(dequeue)
})

Deriving meaning from plain text

As a last resort, kod-core could use a simple algorithm to derive a syntactical tree from any text type lacking an explicit parser. For instace, we could consider a written text structure of paragraphs, sentences, words, etc.

Example source:

Hello everyone. My name is Bulgur Röv and I like carrots.

When I'm not eating carrots I usually listen to death metal, or sometimes
even Beethoven. There are too few songs in the 16th-century-death-metal
genre.

Example derived structure:

Document
|
+-- Paragraph      "Hello everyone. My name is Bulgur Röv and I like 
|   |               carrots."
|   |
|   +-- Sentence   "Hello everyone."
|   |   |
|   |   +-- Word   "Hello"
|   |   |
|   |   +-- Word   "everyone"
|   |
|   + ...
|
+-- Paragraph      "When I'm not eating carrots I usually listen to death 
|   |               metal, or sometimes even Beethoven. There are too few
|   |               songs in the 16th-century-death-metal genre."
|   |
|   +-- Sentence   "When I'm not eating carrots I usually listen to death 
|   |   |           metal, or sometimes even Beethoven."
|   |   |
|   |   +-- Word   "When"
|   |
...

The above was implemented at Jan 8, 2011 with the following results (timings are in real/actual time):

  • Test matter was the monster.c file (500 kB, 18500 lines)
  • Parsing the complete file from scratch
  • Step 1: KDocument text to V8 string: 0.8ms (min 0.4ms, max 1.2ms)
  • Step 2: Calling into node from kod-core: 0.08ms (too low to be an accurate number -- probably faster)
  • Step 2: Parsing: 210ms
    • Resolving and locating an appropriate parser for the document by UTI
    • Tokenizing the complete source
    • Building an abstract syntax tree consisting of paragraphs and words

These preliminary results looks very promising and we will continue on this path with using lex -> parse -> build parsing.

Ideas

Mathematica-style semantic expansion of the cursor

image

In Mathematica frontend, a user can press a key [Ctrl+.], and the token the cursor is on will be selected (highlighted), when user presses the key again, the selection expands to highlight the next smallest semantic unit. When the key is pressed again, it extends further.

-- http://xahlee.org/emacs/syntax_tree_walk.html