nLab corpus

This repository contains a "cleaned" version of the contents of the nLab (as of c. December 2020), with the intention of being used as a training corpus for various machine learning projects. The cleaning process simply strips out any non-textual elements (such as bullet points) and converts the LaTeX mathematics into unicode, wherever possible.

nlab_plain_normalized.txt is the concatenation of all the pages into one large text file.
nlab_plain.json has the same content as the plaintext file, but is organised into key-value pairs, with the key being the title of the page, and the value being its contents.
nlab_stats.json contains some basic statistics about the corpus, generated by spaCy.

A work-in-progress ontology of categorical concepts, extracted using Collard et al root- and rule-based method, can be explored at http://18.222.108.184:8080/. Note that the site may occasionally be down for updates and maintenance.

For licencing information, see the nLab licence.

Corpus Statistics

There are two types of part-of-speech tags in the corpus statistics, both generated by spaCy. The first tagset, labeled "pos" in nlab_stats.json, represents course-grained part of speech and is taken from the Universal POS tag set. The second tagset, "tag", is specific to spaCy's pretrained English model.

Details about the different tagsets, as well as other label schemes for this model, can be found on spaCy's website.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
README.md		README.md
mwe.csv		mwe.csv
nlab.zip		nlab.zip
nlab_compounds.tsv		nlab_compounds.tsv
nlab_examples.csv		nlab_examples.csv
nlab_phrases_with_embeddings.txt		nlab_phrases_with_embeddings.txt
nlab_plain.json		nlab_plain.json
nlab_plain.txt		nlab_plain.txt
nlab_stats.json		nlab_stats.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

mwe.csv

mwe.csv

nlab.zip

nlab.zip

nlab_compounds.tsv

nlab_compounds.tsv

nlab_examples.csv

nlab_examples.csv

nlab_phrases_with_embeddings.txt

nlab_phrases_with_embeddings.txt

nlab_plain.json

nlab_plain.json

nlab_plain.txt

nlab_plain.txt

nlab_stats.json

nlab_stats.json

Repository files navigation

nLab corpus

Corpus Statistics

About

Releases

Packages

Contributors 4

ToposInstitute/nlab-corpus

Folders and files

Latest commit

History

Repository files navigation

nLab corpus

Corpus Statistics

About

Resources

Stars

Watchers

Forks