AMALGUM v0.2

AMALGUM is a machine annotated multilayer corpus following the same design and annotation layers as GUM, but substantially larger (around 4M tokens). The goal of this corpus is to close the gap between high quality, richly annotated, but small datasets, and the larger but shallowly annotated corpora that are often scraped from the Web. Read more here: https://corpling.uis.georgetown.edu/gum/amalgum.html

Download

Latest data without Reddit texts is available under amalgum/ and some additional data beyond the target size of 4M tokens amalgum_extra/. (The amalgum directory contains around 500,000 tokens for each genre, while the extra directory contains some more data beyond the genre-balanced corpus.)

You may download the older version 0.1 of the corpus without Reddit texts as a zip. The complete corpus, with Reddit data, is available upon request: please email lg876@georgetown.edu.

Description

AMALGUM (A Machine-Annotated Lookalike of GUM) is an English web corpus spanning 8 genres with 4,000,000 tokens and several annotation layers.

Genres

Source data was scraped from eight different sources containing stylistically distinct text. Each text's source is indicated with a slug in its filename:

academic: MDPI
bio: Wikipedia
fiction: Project Gutenberg
interview: Wikinews, Interview category
news: Wikinews
reddit: Reddit
whow: wikiHow
voyage: wikiVoyage

Annotations

AMALGUM contains annotations for the following information:

Tokenization
UD and Extended PTB part of speech tags
Lemmas
UD dependency parses
(Non-)named nested entities
Coreference resolution
Rhetorical Structure Theory discourse parses (constituent and dependency versions)
Date/Time annotations in TEI format

These annotations are across four file formats: GUM-style XML, CONLLU, WebAnno TSV, and RS3.

You can see samples of the data for AMALGUM_news_khadr: xml, conllu, tsv, rs3

Performance

Current scores on the GUM corpus test set per task:

task	metric	performance
tokenizer	F1	99.92
sentencer	Acc / F1	99.85 / 94.35
xpos	Acc	98.16
dependencies	LAS / UAS*	92.16 / 94.25
NNER	Micro F1	70.8
coreference	CoNLL F1	51.4
RST	S / N / R	77.98 / 61.79 / 44.07

* Parsing scores ignore punctuation attachment; punctuation is attached automatically via udapi.

Further Information

Please see our paper.

Citation

@inproceedings{gessler-etal-2020-amalgum,
    title = "{AMALGUM} {--} A Free, Balanced, Multilayer {E}nglish Web Corpus",
    author = "Gessler, Luke  and
      Peng, Siyao  and
      Liu, Yang  and
      Zhu, Yilun  and
      Behzad, Shabnam  and
      Zeldes, Amir",
    booktitle = "Proceedings of The 12th Language Resources and Evaluation Conference",
    month = may,
    year = "2020",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "https://www.aclweb.org/anthology/2020.lrec-1.648",
    pages = "5267--5275",
    abstract = "We present a freely available, genre-balanced English web corpus totaling 4M tokens and featuring a large number of high-quality automatic annotation layers, including dependency trees, non-named entity annotations, coreference resolution, and discourse trees in Rhetorical Structure Theory. By tapping open online data sources the corpus is meant to offer a more sizable alternative to smaller manually created annotated data sets, while avoiding pitfalls such as imbalanced or unknown composition, licensing problems, and low-quality natural language processing. We harness knowledge from multiple annotation layers in order to achieve a {``}better than NLP{''} benchmark and evaluate the accuracy of the resulting resource.",
    language = "English",
    ISBN = "979-10-95546-34-4",
}

License

All annotations under the folders amalgum/ and amalgum_extra/ are available under a Creative Commons Attribution (CC-BY) license, version 4.0. Note that their texts are sourced from the following websites under their own licenses:

academic: MDPI, CC BY 4.0
bio: Wikipedia, CC BY-SA 3.0
fiction: Project Gutenberg, The Project Gutenberg License
interview: Wikinews, CC BY 2.5
news: Wikinews, CC BY 2.5
whow: wikiHow, CC BY-NC-SA 3.0
voyage: wikiVoyage, CC BY-SA 3.0

Development

See DEVELOPMENT.md.

Name		Name	Last commit message	Last commit date
Latest commit History 479 Commits
amalgum		amalgum
amalgum_extra		amalgum_extra
bin/datetime		bin/datetime
data		data
eval/syntax		eval/syntax
lib		lib
mediawiki_scraper		mediawiki_scraper
nlp_modules		nlp_modules
out		out
out_one		out_one
out_tiny		out_tiny
scripts		scripts
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
DEVELOPMENT.md		DEVELOPMENT.md
README.md		README.md
distil_gutenberg_meta.py		distil_gutenberg_meta.py
env.yml		env.yml
get_academic.py		get_academic.py
get_fiction.py		get_fiction.py
get_reddit.py		get_reddit.py
get_voyage.py		get_voyage.py
get_wikinews.py		get_wikinews.py
gutenberg_meta_filtered.tab		gutenberg_meta_filtered.tab
nlp_controller.py		nlp_controller.py
requirements.txt		requirements.txt
stats.py		stats.py
template_edits.txt		template_edits.txt
test_module.py		test_module.py

gucorpling/amalgum

Folders and files

Latest commit

History

Repository files navigation

AMALGUM v0.2

Download

Description

Genres

Annotations

Performance

Further Information

Citation

License

Development

About

Resources

Stars

Watchers

Forks

Languages