Paragraph Recognition
To follow the wtf_wikipedia
structure the following implementation is suggested.
- Create a subdirectory
src/paragraph
- a section has basically a
title
depth
andparagraphs
. - output to HTML, LaTeX, ... generate the section, subsection, ... with the title according to the depth then iterates over the paragraph
- a paragraph consist of array of content elements of that are stored in order of appearance.
- content elements of the section are just an array paragraphs, while content elements of a paragraph are images, mathematical expressions, tables, lists, ...
- output generation checks the type of on content element and determines the appropriate method for output generation.
The implementation of a paragraph is related to a more generic element of an Abstract Syntax Tree (AST) called
ContentList
. The following explaination describes how a paragraph can be implemented asContentList
of typeParagraph
. Furthermore it will be shown, how the all parsed elements,List
,Table
, ... can be described as extension of aContentList
. Even aSection
can be described as an extension ofContentList
.
During parsing of the wiki source, the order of content element gets lost. The introduction of the ContentList
fixes that. The following example shows the loss of page order
==Soccer==
The soccer game consists of the following components:
* 2 Teams with 11 players each,
* 3 referees
The game last 90 min.
The output will be rendered in HTML in release 5.0 and the order of block of text is lost.
<h1>Soccer</h1>
<ul>
<li>2 Teams with 11 players each,</li>
<li>3 referees</li>
</ul>
The soccer game consists of the following components:
The game last 90 min.
Especially when the preceeding text The soccer game ...
must appear logically before the list and the concluding remarks must appear after the list to be comprehensive to the reader, then order of appearance must be preserved by the paragraphs and in general on every level of the Abstract Syntax Tree (AST). Even if paragraphs are not introduced in wtf_wikipedia
, then order of appearance must be preserved in a contentlist
.
- AST Type:
SectionHeader
value:Soccer
- AST Type:
TextBlock
value:The soccer game consists of the following components:
- AST Type:
List
- AST Type:
TextBlock
value:The game last 90 min.
The key challenges is anyway to preserve the order of content elements in a Section
or Paragraph
object.
- change
doSection()
in file/src/section/index.js
:
const paragraph_reg = /\n[\s]*\n[\n\s]+/g; // two or more newline -> one pargraph
//const paragraph_reg = /\n[\s\S]*\n/g; // just 2 newline with optional blanks,tabs, ... between \n
const doSection = function(section, wiki, options) {
// parse XML templates
wiki = parse.xmlTemplates(section, wiki, options);
//parse-out all {{templates}}
wiki = parse.templates(section, wiki, options);
// the aggregation of reference is currently done in the section resp. on the section level
// * to preserve the design of Spencer, provide 'section' as parameter of pargraph parsing
// * handle the <ref></ref> tags on deeper levels of the AST (Abstract Syntax Tree) with
// wiki = parse.references(section, wiki, options);
// now split the paragraphs and add them to the ContentList
let split = wiki.split(paragraph_reg); //.filter(s => s);
let paragraphs = new ContentList();
for (let i = 0; i < split.length; i++) {
let paragraph = {
type: 'paragraph',
contentlist: new ContentList()
};
// contentlist of a paragraphs could contain different types of content element:
//. "table", "list", "image", "math",...
content = split[i] || '';
// section is a parameter of doParagraph, so that references and citations can be handled
// on deeper levels of parsing the AST and it is still possible to add references, citations
// to the corresponding section, the paragraph belongs to.
// parse the content of the paragraph and populate the paragraph.contentlist
paragraph = doParagraph(section, paragraph, content , options);
// add the parsed paragraph to the contentlist
paragraphs.push(paragraph);
// push is a method of ContentList, to emulate the expected behaviour of arrays
}
return paragraphs
}
In an object-oriented view Paragraph
and Section
classes extend the class ContentList
. The Section
class has a content list for storing paragraphs only, while a paragraph is just a contentlist
but does not have section.title
and section.depth
attribute. The difference mainly appear if the output is rendered e.g. in LaTeX or HTML.
<p>My pargraph rendered in HTML</p>
So the object ContentList
could be generate in a following way:
- Create a subdirectory
src/contentlist
- a content list is generate in the file
src/contentlist/ContentList.js
-
paragraphs
inherit all the methods and attributes from thecontentlist
. - the method
doSection()
splits into paragraphs and stores the paragraphs in order of appearance in thecontentlist
- then methods for generation of output of plain text, HTML, LaTeX, ... is dependend on the content element type.
- generation output call the output method of each content element of the ContentList. In classical Javascript syntax it will look like this:
let contentlist = new ContentList();
// here populate the contentlist
const toHTML = function(mypar1,mypar2,...) {
var out = '';
for (var i = 0; i < contentlist.length; i++) {
out += contentlist[i].toHTML();
}
return out
}
const toLatex = function(mypar1,mypar2,...) {
var out = '';
for (var i = 0; i < contentlist.length; i++) {
out += contentlist[i].toLatex();
}
return out
}
const toMarkdown = function(mypar1,mypar2,...) {
var out = '';
for (var i = 0; i < contentlist.length; i++) {
out += contentlist[i].toMarkdown();
}
return out
}
To implement the content list in a generic way with a software design that allows adding new output formats the refactoring could be implemented in the following way (see possible other output formats on PanDoc Website ).
let contentlist = new ContentList();
// here populate the contentlist
const toOutput = function(format,mypar1,mypar2,...) {
var out = '';
for (var i = 0; i < contentlist.length; i++) {
out += contentlist[i].toOutput(format);
}
return out
}
Even the headers of the sections can be designed as first contents element of a ContentList
. The introduction of ContentList
- fixes the
loss of content element order
on the section level - builds a generic structure for building the Abstract Syntax Tree (AST)
The ContentList
provides a generic structure for the Abstract Syntax Tree (AST). Refering to example above.
We have to perform 3 major steps
- open the bullet list or enumeration
- create a content element in the
ContentList
for all items of the bullet list or enumeration and populate the content list with items of list
//let bulletlist = new ContentList();
let bulletlist = createNode4AST("BulletList");
//Attribute: bulletlist.type = "bulletlist"
bulletlist.push(createNode4AST("OpenBulletList"))
bulletlist.push(createNode4AST("ItemBulletList",parseContentList("2 Teams with 11 players each,")))
bulletlist.push(createNode4AST("ItemBulletList",parseContentList("3 referees")))
bulletlist.push(createNode4AST("CloseBulletList"))
The setting that bulletlist.type = "bulletlist"
may seem that the a tree node for OpenBulletList
and CloseBullet
not necessary, due to the fact the type already defines the following items as bullet items.
The tree node OpenBulletList
can be used to store formating attributes if desired, but all attributes for the bullet list may be store in the tree node BulletList
as attributes.
The method createNode4AST()
creates a very simple node for the Abstract Syntax Tree (AST) by return a hash with just the type
attribute. This AST node can populated with additional attributes that may be relevant for generation of output formats.
const createNode4AST = function(nodeid) {
return {
"type":nodeid
}
}
A tree node for Paragraph
will be created with createNode4AST("Paragraph")
and populated with more content. A node-specific constructor could use switch
command for adding type specific additional attributes.
const createNode4AST = function(nodeid) {
let ast_node = {
"type":nodeid
};
switch (nodeid) {
case "Paragraph","BulletList","EnumList","TextBlock","Sentence":
ast_node.contentlist = new ContentList()
break;
case "Section":
ast_node.title = "";
ast_node.depth = -1;
ast_node.contentlist = new ContentList()
break;
default:
}
return ast_node
}
var n1 = createNode4AST("ItemBulletList",parseContentList("2 Teams with 11 players each,"))
var n2 = createNode4AST("ItemBulletList",parseContentList("3 referees"))
The method parseContentList()
decomposes a string in tree nodes of the AST. Parsing a ContentList
or even TextBlock
will not be necessary in this example mentioned above, because the TextBlock
contains just one sentence. If the item contains substructure of the AST e.g. a TextBlock
then parseContentList()
will be split the TextBlock
into a ContentList
of sentences. The basic example mentioned above will provide used again as parsing source of the wiki:
==Soccer==
The soccer game consists of the following components:
* 2 Teams with 11 players each,
* 3 referees
The game last 90 min.
Parsing will create the following AST of the section above:
- AST Type:
SectionHeader
value:Soccer
- AST Type:
TextBlock
value:The output will be rendered in HTML
- AST Type:
OpenBulletList
- AST Type:
ItemBulletList
value:2 Teams with 11 players each,
- AST Type:
ItemBulletList
value:3 referees
- AST Type:
CloseBulletList
- AST Type:
TextBlock
value:The game last 90 min.
This example is a linear concatenation of content elements in a ContentList
. But this violates a bit the syntactical structure of the document. So it is recommended to design the BulletList
as an extension of the ContentList
.
- AST Type:
SectionHeader
value:Soccer
- AST Type:
TextBlock
value:The output will be rendered in HTML
- AST Type:
BulletList
as extension ofContentList
- AST Type:
OpenBulletList
- AST Type:
ItemBulletList
value:2 Teams with 11 players each,
- AST Type:
ItemBulletList
value:3 referees
- AST Type:
CloseBulletList
- AST Type:
- AST Type:
TextBlock
value:The game last 90 min.
- a
Paragraph
will be rendered differently that just aContentList
. TheContentList
as list of content elements concatenates just the generated output of elements of the list, whileParagraph
wrap the generated output e.g. for HTML withp
-tag. - A
TextBlock
is basically aContentList
ofSentences
,Citations
andReferences
. Tables, Infoboxes, ... were already parsed. If inline images are allowed as content elements of the content listSentence
must be decided by thewtf_wikipedia
maintainers (see Images ). Inline images are helpful for text comprehension.
The icon [[File:warning_icon.png]] visualizes a warning in the upcoming paragraph.
...
* [[File:warning_icon.png]] be aware of chemical X, it is nephrotoxic and destroys the kidney.
* self-protection should be applied with Y ...
- A table is a
ContentList
ofTableHeader
andTableRows
, - A
TableHeader
is aContentList
ofTHCell
- A
TableBody
is aContentList
ofTableRow
- A
TableRow
is aContentList
ofTableCell
- A
TableCell
is aSentence
, aTextBlock
or again aContentList
- Parsing Concepts are based on Parsoid - https://www.mediawiki.org/wiki/Parsoid
- Output: Based on concepts of the swiss-army knife of
document conversion
developed by John MacFarlane PanDoc - https://www.pandoc.org