Skip to content

Paragraph Recognition

Engelbert Niehaus edited this page Aug 15, 2018 · 19 revisions

Way forward to implement the support of paragraph

To follow the wtf_wikipedia structure the following implementation is suggested.

  • Create a subdirectory src/paragraph
  • a section has basically a title depth and paragraphs.
  • output to HTML, LaTeX, ... generate the section, subsection, ... with the title according to the depth then iterates over the paragraph
  • a paragraph consist of array of content elements of that are stored in order of appearance.
  • content elements of the section are just an array paragraphs, while content elements of a paragraph are images, mathematical expressions, tables, lists, ...
  • output generation checks the type of on content element and determines the appropriate method for output generation. The implementation of a paragraph is related to a more generic element of an Abstract Syntax Tree (AST) called ContentList. The following explaination describes how a paragraph can be implemented as ContentList of type Paragraph. Furthermore it will be shown, how the all parsed elements, List, Table, ... can be described as extension of a ContentList. Even a Section can be described as an extension of ContentList.

Order of Content Elements are lost during parsing - Release 5.0 and before

During parsing of the wiki source, the order of content element gets lost. The introduction of the ContentList fixes that. The following example shows the loss of page order

==Soccer==
The soccer game consists of the following components:
* 2 Teams with 11 players each,
* 3 referees
The game last 90 min.

The output will be rendered in HTML in release 5.0 and the order of block of text is lost.

<h1>Soccer</h1>
<ul>
  <li>2 Teams with 11 players each,</li>
  <li>3 referees</li>
</ul>
The soccer game consists of the following components:
The game last 90 min.

Especially when the preceeding text The soccer game ... must appear logically before the list and the concluding remarks must appear after the list to be comprehensive to the reader, then order of appearance must be preserved by the paragraphs and in general on every level of the Abstract Syntax Tree (AST). Even if paragraphs are not introduced in wtf_wikipedia, then order of appearance must be preserved in a contentlist.

  • AST Type: SectionHeader value: Soccer
  • AST Type: TextBlock value: The soccer game consists of the following components:
  • AST Type: List
  • AST Type: TextBlock value: The game last 90 min.

The key challenges is anyway to preserve the order of content elements in a Section or Paragraph object.

Proposed Steps to implement Paragraphs

  • change doSection() in file /src/section/index.js:
const paragraph_reg = /\n[\s]*\n[\n\s]+/g; // two or more newline -> one pargraph
//const paragraph_reg = /\n[\s\S]*\n/g; // just 2 newline with optional blanks,tabs, ... between \n

const doSection = function(section, wiki, options) {
  // parse XML templates
  wiki = parse.xmlTemplates(section, wiki, options);
  //parse-out all {{templates}}
  wiki = parse.templates(section, wiki, options);
  
  // the aggregation of reference is currently done in the section resp. on the section level
  // * to preserve the design of Spencer, provide 'section' as parameter of pargraph parsing
  // * handle the <ref></ref> tags on deeper levels of the AST (Abstract Syntax Tree) with
  // wiki = parse.references(section, wiki, options);
  
  // now split the paragraphs and add them to the ContentList
  let split = wiki.split(paragraph_reg); //.filter(s => s);
  let paragraphs = new ContentList();
  for (let i = 0; i < split.length; i++) {
    let paragraph = {
      type: 'paragraph',
      contentlist: new ContentList()
    };
    // contentlist of a paragraphs could contain different types of content element: 
    //.   "table", "list", "image", "math",...
    content = split[i] || '';
    // section is a parameter of doParagraph, so that references and citations can be handled
    // on deeper levels of parsing the AST and it is still possible to add references, citations 
    // to the corresponding section, the paragraph belongs to.
  
    // parse the content of the paragraph and populate the paragraph.contentlist
    paragraph = doParagraph(section, paragraph, content , options);
    // add the parsed paragraph to the contentlist
    paragraphs.push(paragraph); 
    // push is a method of ContentList, to emulate the expected behaviour of arrays
  }
  return paragraphs
} 

ContentList

In an object-oriented view Paragraph and Section classes extend the class ContentList. The Section class has a content list for storing paragraphs only, while a paragraph is just a contentlist but does not have section.title and section.depth attribute. The difference mainly appear if the output is rendered e.g. in LaTeX or HTML.

<p>My pargraph rendered in HTML</p>

So the object ContentList could be generate in a following way:

  • Create a subdirectory src/contentlist
  • a content list is generate in the file src/contentlist/ContentList.js
  • paragraphs inherit all the methods and attributes from the contentlist.
  • the method doSection() splits into paragraphs and stores the paragraphs in order of appearance in the contentlist
  • then methods for generation of output of plain text, HTML, LaTeX, ... is dependend on the content element type.
  • generation output call the output method of each content element of the ContentList. In classical Javascript syntax it will look like this:
  
  let contentlist = new ContentList();
  // here populate the contentlist

  const toHTML = function(mypar1,mypar2,...) {
    var out = '';
    for (var i = 0; i < contentlist.length; i++) {
      out += contentlist[i].toHTML();
    }
    return out
  }
  
  const toLatex = function(mypar1,mypar2,...) {
    var out = '';
    for (var i = 0; i < contentlist.length; i++) {
      out += contentlist[i].toLatex();
    }
    return out
  }

  const toMarkdown = function(mypar1,mypar2,...) {
    var out = '';
    for (var i = 0; i < contentlist.length; i++) {
      out += contentlist[i].toMarkdown();
    }
    return out
  }
 

To implement the content list in a generic way with a software design that allows adding new output formats the refactoring could be implemented in the following way (see possible other output formats on PanDoc Website ).

  
  let contentlist = new ContentList();
  // here populate the contentlist

  const toOutput = function(format,mypar1,mypar2,...) {
    var out = '';
    for (var i = 0; i < contentlist.length; i++) {
      out += contentlist[i].toOutput(format);
    }
    return out
  }
  

Even the headers of the sections can be designed as first contents element of a ContentList. The introduction of ContentList

  • fixes the loss of content element order on the section level
  • builds a generic structure for building the Abstract Syntax Tree (AST)

List as generic ContentList

The ContentList provides a generic structure for the Abstract Syntax Tree (AST). Refering to example above. We have to perform 3 major steps

  • open the bullet list or enumeration
  • create a content element in the ContentList for all items of the bullet list or enumeration and populate the content list with items of list
  //let bulletlist = new ContentList();
  let bulletlist = createNode4AST("BulletList");
  //Attribute: bulletlist.type = "bulletlist"
  bulletlist.push(createNode4AST("OpenBulletList"))
  bulletlist.push(createNode4AST("ItemBulletList",parseContentList("2 Teams with 11 players each,")))
  bulletlist.push(createNode4AST("ItemBulletList",parseContentList("3 referees")))
  bulletlist.push(createNode4AST("CloseBulletList"))

The setting that bulletlist.type = "bulletlist" may seem that the a tree node for OpenBulletList and CloseBullet not necessary, due to the fact the type already defines the following items as bullet items. The tree node OpenBulletList can be used to store formating attributes if desired, but all attributes for the bullet list may be store in the tree node BulletList as attributes.

Create a Node for the AST

The method createNode4AST() creates a very simple node for the Abstract Syntax Tree (AST) by return a hash with just the type attribute. This AST node can populated with additional attributes that may be relevant for generation of output formats.

  const createNode4AST = function(nodeid) {
    return {
               "type":nodeid
           }
  }

A tree node for Paragraph will be created with createNode4AST("Paragraph") and populated with more content. A node-specific constructor could use switch command for adding type specific additional attributes.

  const createNode4AST = function(nodeid) {
     let ast_node = {
               "type":nodeid
           };
     switch (nodeid) {
        case "Paragraph","BulletList","EnumList","TextBlock","Sentence":
           ast_node.contentlist = new ContentList()
        break;
        case "Section":
           ast_node.title = "";
           ast_node.depth = -1;
           ast_node.contentlist = new ContentList()
        break;
        default:
    
    }
    return ast_node
  }

parseContentList() Method

var n1 = createNode4AST("ItemBulletList",parseContentList("2 Teams with 11 players each,"))
var n2 = createNode4AST("ItemBulletList",parseContentList("3 referees"))

The method parseContentList() decomposes a string in tree nodes of the AST. Parsing a ContentList or even TextBlock will not be necessary in this example mentioned above, because the TextBlock contains just one sentence. If the item contains substructure of the AST e.g. a TextBlock then parseContentList() will be split the TextBlock into a ContentList of sentences. The basic example mentioned above will provide used again as parsing source of the wiki:

==Soccer==
The soccer game consists of the following components:
* 2 Teams with 11 players each,
* 3 referees
The game last 90 min.

Parsing will create the following AST of the section above:

  • AST Type: SectionHeader value: Soccer
  • AST Type: TextBlock value: The output will be rendered in HTML
  • AST Type: OpenBulletList
  • AST Type: ItemBulletList value: 2 Teams with 11 players each,
  • AST Type: ItemBulletList value: 3 referees
  • AST Type: CloseBulletList
  • AST Type: TextBlock value: The game last 90 min.

This example is a linear concatenation of content elements in a ContentList. But this violates a bit the syntactical structure of the document. So it is recommended to design the BulletList as an extension of the ContentList.

  • AST Type: SectionHeader value: Soccer
  • AST Type: TextBlock value: The output will be rendered in HTML
  • AST Type: BulletList as extension of ContentList
    • AST Type: OpenBulletList
    • AST Type: ItemBulletList value: 2 Teams with 11 players each,
    • AST Type: ItemBulletList value: 3 referees
    • AST Type: CloseBulletList
  • AST Type: TextBlock value: The game last 90 min.

Difference between Paragraph, TextBlock and ContentList

  • a Paragraph will be rendered differently that just a ContentList. The ContentList as list of content elements concatenates just the generated output of elements of the list, while Paragraph wrap the generated output e.g. for HTML with p-tag.
  • A TextBlock is basically a ContentList of Sentences, Citations and References. Tables, Infoboxes, ... were already parsed. If inline images are allowed as content elements of the content list Sentence must be decided by the wtf_wikipedia maintainers (see Images ). Inline images are helpful for text comprehension.
The icon [[File:warning_icon.png]] visualizes a warning in the upcoming paragraph.
...
* [[File:warning_icon.png]] be aware of chemical X, it is nephrotoxic and destroys the kidney.
* self-protection should be applied with Y ...

Tables as generic ContentList

  • A table is a ContentList of TableHeader and TableRows,
  • A TableHeader is a ContentList of THCell
  • A TableBody is a ContentList of TableRow
  • A TableRow is a ContentList of TableCell
  • A TableCell is a Sentence, a TextBlock or again a ContentList