Tokenizer for Wiki Transformation Framework - WTF

Tokenizer replaces document elements with tokens and stores the parsed content elements in a JSON. The library wtf_tokenizer was designed to work together with the great Wiki Markdown parser wtf_wikipedia developed by Spencer Kelly. Without his work for wtf_wikipedia this library wtf_tokenizer would not exist. The tokenizer may be implemented based on the description of micro libraries for wtf_tokenizer as part of the Wiki Transformation Framework (WTF).

Tokenize Wiki Markdown

With this tokenizer you will be able to replace the following content elements:

Mathematical Expressions,
Citations and References

Mathematical Expression

Mathematical expression are defined with the math-tag in Wiki markdown syntax:

text before the mathematical expression <MATH>\sum_{i=1}^{\infty} [x_i]
: v_i
</MATH> text after math.

Tokenizing with the encode()-call for mathematical expressions will create the following content as output.

text before the mathematical expression ___MATH_INLINE_7238234792_5___ text after math.

After the time index 7238234792 the enumeration ID 5 represents the 5th mathematical expression found in the Wiki markdown source text.

Citations and References

Citations and references are defined with the ref-tag in Wiki markdown syntax:

text before the reference <ref name="MyLabel">Peter Miller (2020) ...</ref> and text after the reference.
cite an already defined reference with <ref name="MyLabel"/> text after citation.

Tokenizing with the encode()-call for citations and reference will create the following content as output.

Tokenizer for Syntax Domains

The tokenizer converts a XML sections like REF-tag and mathematical expression wrapped in MATH-Tag into attribtues of the generated JSON,

text before math <MATH>
\sum_{i=1}^{\infty} [x_i]
: v_i
</MATH> text after math.
text before <ref>my reference ...</ref> and text after
cite an already defined reference with <ref name="MyLabel"/> text after citation.

into

text before math ___MATH_INLINE_7238234792_5___ text after math.
text before ___CITE_7238234792_3___ and text after    
cite an already defined reference with ___CITE_7238234792_MyLabel___ text after citation.

The challenge of parsing can be identified in the mathematical expression. The colon : in the first column of the line defines an indentation. But within a mathematical expression it is just a devision.

Uniqness of Markers with Time Stamp

The number 7238234792 is a unique integer generated by the time and date in milliseconds, to make the marker unique. Mathematical Expressions, Citation and References are extracted and replaced by an encode()-call of wtf_tokenizer. The tokenizer is defined in /src/index.js and requires the submodules. For further processing e.g. with the wtf_wikipedia library the tokens/markers are regarded as ordinary words in text.

If you want to generate different output formats with wtf_wikipedia (e.g. HTML, LaTeX, Markdown, ...) the tokens/markers can be replaced in the appropriate syntax by calling a detokenizer/decoder during post processing the generated output from other Wiki Transformation Framework tools like wtf_wikipedia or Wiki2Reveal. When the output is generated with wtf_wikipedia.html() or wtf_wikipedia.markdown(), then call because during output the final numbering of citations can be generated, if more that on articles are downloaded and aggregated.

So it makes sense, that the markers/tokens remain even in the JSON sentences, sections and paragraphs until the final output is generated. By wtf_tokenizer the corresponding data of the tokenizer will populate the doc.references in the same way as wtf_wikipedia but in addition wtf_wikipedia the label for backwards replacment in the output was appended to the records for any tokens. E.g. the record for the corresponding label (e.g. ___CITE_7238234792_3___ or ___MATH_INLINE_7238234792_5___ will also be stored for all references and mathematical expressions. This concept will allow that, that later on the markers for citations can be replaced by [6] in the IEEE citation style. If you want a decoding of tokens for citations in APA-Style, you will able to replace that e.g. by (Kelly 2018) with the call of wtf_tokenizer.text() or wtf_tokenizer.html(). The same would be performed for mathematical inline and block expressions, they need the original location of the mathematical expression e.g. in sentence (e.g. ___MATH_INLINE_7238234792_5___).

This needs the introduction of wtf_tokenizer.json() method will not replace any content in the output of the JSON-file. The replacement can be implemented if you want that to be performed in a specific use-case of your application.

Furthermore it must be mentioned that mathematical expression have different rendering styles in Wikipedia, Wikiversity. The block or inline type distinguish between mathematical expression in the text and mathematical expressions in a seperated line. The token label attribute for mathematical content incorporates the style information by adding that to the label name of the corresponding token of the mathematical expresssions in the wiki source.

Tokizer Steps and Workflow - Recommendation

Step 1: wtf_fetch() based on cross-fetch fetches the wiki source
- Input:
  - language="en" or language="de" to specify the language of the wiki source
  - domain="wikipedia" or domain="wikiversity" or domain="wikispecies" to select the wiki domain for the Wiki fetch() call to pulls the wiki sources from.
Output:
- wiki source text e.g. from wikipedia or wikiversity
- Remark: wtf_fetch extracts your wtf.fetch() in a separate module.
Step 2: wtf_tokenize()
- Input:
  - wiki source text e.g. from wikipedia or wikiversity fetched by wtf_fetch
- Output:
  - wiki source text where e.g. mathematical expressions are replaced by tokens like MATH-INLINE-839832492834_N12. wtf_wikipedia treats those tokens just as words in a sentence.
Step 3: wtf_wikipedia()
- Input:
  - wiki source text with tokenized citations and mathematical expressions
- Output: object doc of type Document. Application of output methods for text, html, latex, json containing the tokens as words in sentences. The tokens appear in the output of doc. html() or doc.latex() in wtf_wikipedia and in the JSON as well.
Step 4: wtf_tokenizer
- Input:
  - string in the export format, text with tokenized citations and mathematical expressions
- Output: detokenized export format in the out string is injected in the DeTokenizer w.g.wtf_tokenizer.html(out,data,options). In this case the output strint out is already in the HTML format. In the output out or in any other desired output format (e.g. markdown) the token replacement is performed e.g. for HTML the mathematical expressions are exported to MathJax and e.g. for latex the detokenizer replaces the word/token ___MATH_INLINE-839832492834_12___ by $\sum_{n=0}^{\infty} \frac{x^n}{n!}$ . The tokenizer can replace the tokens of type

   ___MATH_INLINE_793249879_5___
   ___MATH_BLOCK_793249879_6___

and pushes the latex code of mathematical expressions in the JSON data

Citations reference with a name

    <ref name="my citation" />

are replaced by

   ___CITE_LABEL_793249879_my_citation___

Use wtf_fetch to fetch Wiki markdown from wikipedia or Wikiversity and the apply wtf_tokenizer to tokenize

Citations or
Mathematical expression with a unique identifier in the Wiki markdown source. wtf_wikipedia developed by Spencer Kelly and <a href="https://github.com/spencermountain/wtf_wikipedia/graphs/contributors" target="ContributorsGithub>contributors can be used to generate output or the tokenizer can be used in Wiki2Reveal

wtf_wikipedia turns wikipedia's markup language into JSON while Wiki2Reveal creates a RevealJS presentation from the Wiki Markup source. In both use-cases of the wtf_tokenizer support you in handling the citations and mathematical expressions before parsing the content of a MediaWiki source. After Wiki2Reveal or

Demo HTML5-Application of `wtf_tokenizer`

The following wtf_tokenizer-demo is HTML-page, that imports the library wtf_fetch.js and the library wtf_tokenizer.js, which is generated by this modules.

wtf_fetch.js fetches articles from Wikipedia, Wikiversity, ... which are used as input files for testing the tokenizer-
uses HTML-form elements to determine the Wikipedia article and the domain from which the article should be downloaded.
Provides a Display Source button to show the current source file in the MediaWiki of Wikiversity or Wikipedia.
The download appends a source info at very end of the downloaded Wiki source, to create a reference in the text (like a citation - see function append_source_info()) :: Demo wtf_tokenizer
Wikipedia2Wikiversity uses wtf_tokenizer to download the Wikipedia markdown source into a textarea of an HTML file. The Wiki markdown source is processed and so that interwiki links from Wikiversity to Wikipedia work. Wikipedia2Wikiversity is also a demonstrator of an AppLSAC-0.

Decomposition of `wtf_wikipedia` in submodules

If you consider the source of wtf_wikipedia you can identify 3 major step:

wtf_fetch retrieving the wiki markup source from the MediaWiki API, i.e. https://www.wikipedia.org, https://www.wikiversity.org, https://www.wikivoyage.org, ...
wtf_parse, that parses wiki source into a Document object (Abstract Syntax Tree)
wtf_output, that generates/renders the output for a specific format from a given Document object.

`wtf_wikipedia` as integrator of modules

wtf_wikipedia integrates all these 3 tasks in one module. The provide module decomposes one of those tasks in this submodules. The submodules wtf_fetch, wtf_parse and wtf_output may be required independently in different project repository by a require command. Furthermore it improves maintenance, reusability of submodules and it separates the tasks in wtf_wikipedia in the submodules wtf_fetch, wtf_parse, wtf_output. If the modules are there wtf_wikipedia can be used just for chaining the tasks and other submodule can be added to the process chain in wtf_wikipedia. E.g. citation management would be a submodule called wtf_citation that could be implemented to insert a the citations in a document and fulfills a certain tasks. This modules uses the modular structure of wtf_wikipedia in the folder src/ to extract the current task in a separate repository. Later the current local require commands in wtf_wikipedia can be replaced by a remote require from npm.

`wtf_fetch` call before parsing with `wtf_tokenizer`

`wtf_wikipedia` call after `wtf_tokenizer`

Tokenizers parse specific content elements and replaces the content elements by unique identifiers. The identifiers/tokens must be handled as ordinary text elements/words that consist of characters, numbers, ... and which are not handled by the parser itsself. The unique identifiers will appear in the output format (export to HTML, Markdown, text, Open Document Format, ...) and as a final processing output the tokens will be replaced by a desired token handler that replaces mathematical expressions or citations according to requirements of the output format.

This could be documented in the README.md as developer recommendation and helps developers to understand the way forward and how they could add new wtf_modules in the chaining process. In this sense wtf_wikipedia will become the chain managment module of wtf_submodules.

Installation

The following examples uses wtf_fetch to download the Wiki source from the MediaWiki-API. The library wtf_tokenizer parses the wiki source and replaces mathematical expressions or citations by tokens, that will not be altered by wtf_wikipedia.

The decoding the tokens into the output format is dependent on the output format. Citations and mathematical expressions are handled differently according to the syntax of the output format.

Installation with NodeJS

The following installation incorporates fetching wiki sources from Wikipedia/Wikiversity.

npm install wtf_fetch npm install wtf_tokenizer

var wtf_fetch = require('wtf_fetch');
var wtf_tokenizer = require('wtf_tokenizer');

wtf_fetch.getPage('Swarm Intelligence', 'en','wikipedia' function(err, doc) {
  // doc contains the download wiki
  // options will be set that it tokenizes math expressions
  // citations will not be encoded
  var options = {
    "tokenize": {
      "math":true,
      "citations":false,
      "outformat":"html"    
    }
  };
  console.log("Source Wiki: " + doc.wiki);
  wtf_tokenizer.encode(doc,options);
  console.log("Encoded Tokens: " + doc.wiki);
  wtf_tokenizer.decode(doc,options);
  console.log("Decoded tokens: " + doc.wiki);
});

Installation/Usage in HTML page

You can just copy the library wtf_tokenizer with a script tag or just add the build of wtf_tokenizer.js or the compressed wtf_tokenizer.min.js from the repository directly and save the library e.g. into the js/ subdirectory of your HTML file. In this example we added also the library wtf_fetch.js to the example to fetch the Wiki source from the Wiki API.

<script src="js/wtf_fetch.min.js"></script>
<script src="js/wtf_tokenizer.min.js"></script>
<script>
  //(follows redirect)
  wtf_fetch.getPage('Water', 'en','wikiversity' function(err, doc) {
    // doc contains the download wiki
    // options will be set that it tokenizes math expressions
    // citations will not be encoded
    var options = {
      "tokenize": {
        "math":true,
        "citations":false,
        "outformat":"html"    
      }
    };
    console.log("Source Wiki: " + doc.wiki);
    wtf_tokenizer.encode(doc,options);
    console.log("Encoded Tokens: " + doc.wiki);
    // decode the mathematical expression
    // into HTML format with MathJax
    wtf_tokenizer.decode(doc,options);
    console.log("Decoded tokens: " + doc.wiki);
  });
</script>

What it does:

Assume you have downloaded the Wiki source code with wtf_fetch downloads Wiki markup source for an article from a MediaWiki of the Wiki Foundation
Allows different MediaWiki source, e.g. Wikipedia, Wikiversity, Wikivoyage, ...
Creates a JSON with the following format stored as example in JSON with the attributes for the fetch page in data. The JSON may look like this:

var data = {
  "wiki": "This is the content of the wiki article in wiki markdown ..."
  "title": "Swarm Intelligence",
  "lang": "en",
  "domain": "wikiversity",
  "url": "https://en.wikiversity.org/wiki/Swarm_Intelligence",
  "pageid": 2130123
}

If you want to access the Wiki markdown of the fetch article, access the data.wiki. The language and domain is stored in the JSON for the article because the attributes are helpful to expand relative links in the wiki to absolute links, that work also after having the document available on a other domain.

Processing MediaWiki Markdown with `wtf_tokenizer`

The fetched wiki markdown e.g. from Wikipedia is in general processed within the browser or in the NodeJS application.

wtf_wikipedia

The primary library for further processing is wtf_wikipedia by Spencer Kelly (see wtf_wikipedia ).

wiky.js - wiki2html.js

wiky.js wiki2html are simple libraries that convert sources from a MediaWiki to HTML. With these converters you can start with, to learn about parsing a wiki source document downloaded from a MediaWiki.

Parsoid:

Wikimedia's Parsoid javascript parser is the official wikiscript parser. It reliably turns wikiscript into HTML, but not valid XML.

To use it for data-mining, you'll need to:

parsoid(wikiText) -> [headless/pretend-DOM] -> screen-scraping

which is fine,

but getting structured data out of the Wiki source go ahead with Spencer Kelly library wtf_wikipedia

API

wtf_fetch.getPage(title, [lang], [domain], [options], [callback])

outputs:

The callback or promise will get a JSON of the following type that contains the markdown content in the wiki property of the returned JSON:

{
  "wiki": "This is the fetched markdown source of the article ...",
  "title": "My Wikipedia Title",
  "lang": "en",
  "domain": "wikpedia",
  "url": "https://en.wikipedia.org/wiki/My_Wikipedia_Title",
  "pageid": 12345  
}

Language and Domainname

You can retrieve the Wiki markdown from different MediaWiki products of the WikiFoundation. The domain name includes the Wiki product (e.g. Wikipedia or Wikiversity) and a language. The WikiID encoded the language and the domain determines the API that is called for fetching the source Wiki. The following WikiIDs are referring to the following domain name.

Language: en Domain: wikipedia: https://en.wikipedia.org
Language: de Domain: wikipedia: https://de.wikipedia.org
Language: fr Domain: wikipedia: https://fr.wikipedia.org
Language: en Domain: wikibooks: https://en.wikibooks.org',
Language: en Domain: wikinews: https://en.wikinews.org',
Language: en Domain: wikiquote: https://en.wikiquote.org',
Language: en Domain: wikisource: https://en.wikisource.org',
Language: en Domain: wikiversity: https://en.wikiversity.org',
Language: en Domain: wikivoyage: https://en.wikivoyage.org'

Examples

wtf_fetch.getPage(title, [lang], [domain] [options], [callback])

retrieves raw contents of a mediawiki article from the wikipedia action API.

This method supports the errback callback form, or returns a Promise if one is missing.

to call non-english wikipedia apis, add it's language-name as the second parameter

wtf_fetch.getPage('Toronto', 'de', 'wikipedia', function(err, doc) {
  var url = "https://" + doc.lang + "." + doc.domain + ".org";
  console.log("Wiki JSON fetched from '" +
       url + "/wiki/" + doc.title + "'\n" + JSON.stringify(doc,null,4));
  //doc.wiki = "Toronto ist mit 2,6 Millionen Einwohnern..."
});

you may also pass the wikipedia page id as parameter instead of the page title:

wtf_fetch.getPage(64646, 'de', 'wikipedia', function(err, doc) {
  console.log("Wiki JSON\n"+JSON.stringify(doc,null,4));
});

the fetch method follows redirects.

CLI

if you're scripting this from the shell, or from another language, install with a -g, and then run:

$ node ./bin/wtf_fetch.js George Clooney de wikipedia
# George Timothy Clooney (born May 6, 1961) is an American actor ...

$ node ./bin/wtf_fetch.js 'Toronto Blue Jays' en wikipedia

Command Line Interface was not implement so far.

Good practice:

The wikipedia api is pretty welcoming though recommends three things, if you're going to hit it heavily -

1️⃣ pass a Api-User-Agent as something so they can use to easily throttle bad scripts
2️⃣ bundle multiple pages into one request as an array
3️⃣ run it serially, or at least, slowly.

wtf_fetch.getPage(['Royal Cinema', 'Aldous Huxley'], 'en', 'wikipedia',{
  'Api-User-Agent': 'youremail@example.com'
}).then((docList) => {
  let allDocs = docList.map(doc => doc.wiki);
  console.log(allDocs);
});

Create Office Documents

wtf_fetch is just the first step in creating other formats directly from the Wikipedia source by "on-the-fly" conversion after downloading the Wiki source e.g. from Wikipedia.

Creating an Office document is just one example of an output file. ODT-output is currently (2018/11/04) not part of wtf_wikipedia but you may want to play around with the wtf_fetch or wtf_wikipedia to parse the Wiki source and convert the file in your browser into an Office document. The following source will support a bit in creating the Office documents.

PanDoc and ODF Editor

If you try PanDoc document conversion the key to generate Office documents is the export format ODF. LibreOffice can load and save even the OpenDocument Format and LibreOffice can load and save MICR0S0FT Office formats. So exporting to Open Document Format will be good option to start with in wtf_wikipedia. The following description are a summary of aspects that support developers in bringing the Office export format e.g. to web-based environment like the ODF-Editor. OpenDocument Format provides a comprehensive way forward for wtf_wikipedia to exchange documents from a MediaWiki source text reliably and effortlessly to different formats, products and devices. Regarding the different Wikis of the Wiki Foundation as a Content Sink e.g. the educational content in Wikiversity is no longer restricted to a single export format (like PDF) open ups access to other specific editors, products or vendors for all your needs. With wtf_wikipedia and an ODF export format the users have the opportunity to choose the 'best fit' application of the Wiki content. This section focuses on Office products.

Open Document Format ODT

Some important information to support Office Documents in the future

see WebODF how to edit ODF documents on the web or display slides. Current limitation of WebODF is, that does not render mathematical expressions, but alteration in WebODF editor does not remove the mathematical expressions from the ODF file (state 2018/04/07). WebODF does not render the mathematical expressions but this may be solved in the WebODF-Editor by using MathJax or KaTeX in the future.
The ODT-Format is the default export format of LibreOffice/OpenOffice. Supporting the Open Community Approach OpenSource office products are used to avoid commercial dependencies for using generated Office products.
- The ODT-Format of LibreOffice is basically a ZIP-File.
- Unzip shows the folder structure within the ZIP-format. Create a subdirectory e.g. with the name zipout/ and call unzip mytext.odt -d zipout (Linux, MacOSX).
- The main text content is stored in content.xml as the main file for defining the content of Office document
- Remark: Zipping the folder content again will create a parsing error when you load the zipped office document again in LibreOffice. This may be caused by an inappropriate order in the generated ZIP-file. The file mimetype must be the first file in the ZIP-archive.
- The best way to generate ODT-files is to generate an ODT-template mytemplate.odt with LibreOffice and all the styles you want to apply for the document and place a marker at specific content areas, where you want to replace the cross-compiled content with wtf_wikipedia in content.xml. The file content.xml contains the text and can be updated in ODT-ZIP-file. If you want to have a MlCR0S0FT 0ffice output, just save the ODT-file in LibreOffice as Word-file. Also marker replacement is possible in ODF-files (see also WebODF demos.
- Image must be downloaded from the MediaWiki (e.g. with an NPM equivalent of wget for fetching the image, audio or video) and add the file to the folder structure in the ZIP. Create a ODT-file with LibreOffice with an image and unzip the ODT-file to learn about way how ODT stores the image in the ODT zip-file.
JSZip: JSZip can be used to update and add certain files in a given ODT template (e.g. mytemplate.odt). Handling ZIP-files in a cross-compilation WebApp with wtf_wikipedia that runs in your browser and generates an editor environment for the cross-compiled Wiki source text (like the WebODF editor). The updating the ODT template as ZIP-file can be handled with JSZip by replacing the content.xml in a ZIP-archive. content.xml can be generated with wtf_wikipedia when the odf-export format is added to /src/output/odf (ToDo: Please create a pull request if you have done that).
LibreOffice Export: Loading ODT-files in LibreOffice allows to export the ODT-Format to
- Office documents doc- and docx-format,
- Text files (.txt),
- HTML files (.html),
- Rich Text files (.rtf),
- PDF files (.pdf) and even
- PNG files (.png).
Planing of the ODT support can be down in this README and collaborative implementation can be organized with Pull Requests PR.
Helpful Libraries: node-odt, odt

Word Export with Javascript Libraries

wtf_wikipedia supports HTML export,
the library html-docx-js supports cross-compilation of HTML into docx-format

Contributing

wtf_fetch is just a minor micro library to fetch the wiki markdown of an article in Wikipedia, Wikiversity, ... Please consider contribution to the wtf_wikipedia developed by Spencer Kelly - see wtf_wikipedia for further details and join in!

Acknowledgement

This library add the tokenizer to the Wiki Transformation Framework WTF wtf_wikipedia. The code of the library complements specific features to wtf_wikipedia was developed by Spencer Kelly. wtf_fetch is in general used to retrieve a specific article from Wikipedia, Wikiversity and wtf_fetch is based on cross-fetch which allows fetch the markdown of articles from Wikipedia, Wikiversity even from a local HTML file. This is great because you can fetch an article and process the article in a browser without the need to perform processing on a remot server. Special thanks to Spencer Kelly for creating and maintaining wtf_wikipedia. A great contribution to the OpenSource community especially for using Wiki content as Open Educational Resources.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
bin		bin
builds		builds
docs		docs
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
contributing.md		contributing.md
coverage.lcov		coverage.lcov
npm_patch.sh		npm_patch.sh
package-lock.json		package-lock.json
package.json		package.json
scratch.js		scratch.js

License

niebert/wtf_tokenizer

Folders and files

Latest commit

History

Repository files navigation