Multi-layered Cross-genre Corpus

The multi-layered cross-genre corpus (MLCG) is a comprehensive and diverse collection of texts that encompasses various genres, such as news articles, children's stories, and Reddit posts. This corpus has been specifically annotated at multiple layers to facilitate in-depth analysis and exploration of the texts' coreference resolution, causal relations, and temporal relations.

The MLCG corpus includes a wide array of text types, allowing researchers and language enthusiasts to study and compare the characteristics of different genres. The genres represented in the corpus have distinct stylistic and structural features that present unique challenges for both human annotators and machine learning models.

For instance, children's stories in the corpus exhibit a linear temporal structure and clearly defined causal relations, enabling a coherent narrative flow. On the other hand, news articles often employ non-linear temporal sequences, incorporating minimal first-person pronouns or conditional language to provide factual information. Reddit posts, in contrast, are characterized by author-centered explanations of ongoing situations, occasionally referencing the meta-textual aspects of the platform.

The annotation schemes used in the MLCG corpus have been carefully adapted from existing work to suit a broad range of text types. By incorporating annotations for coreference resolution, causal relations, and temporal relations, the corpus provides valuable insights into the interplay between different forms of semantic information within and across genres.

Researchers can leverage the MLCG corpus to explore the diverse textual characteristics and uncover the nuances associated with each genre. The availability of this corpus under the open-source Apache 2.0 license encourages collaboration, fosters advancements in natural language processing, and supports the development of more effective machine learning models and language understanding systems.

Corpus Breakdown

	Causal	Coref	Temporal
CNN¹	50	50	50
Fables²	50	50	50
Reddit³	100	100	150
Reuters⁴	50	50	50
Wind in the Willows²	-	-	50
Wizard of Oz²	50	50	50
Total	300	300	400

All data is tokenized using the ELIT Tokenizer⁵ and filtered to a length of 100-200 tokens. Reddit posts are additionally filtered using the Profanity-Check Python module⁶.

Contact

Jinho D. Choi

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
MLCG-guidelines		MLCG-guidelines
causal		causal
coref		coref
temporal		temporal
temporal_closure		temporal_closure
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MLCG-guidelines

MLCG-guidelines

causal

causal

coref

coref

temporal

temporal

temporal_closure

temporal_closure

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Multi-layered Cross-genre Corpus

Corpus Breakdown

Contact

About

Releases

Packages

Contributors 2

License

emorynlp/MLCG

Folders and files

Latest commit

History

Repository files navigation

Multi-layered Cross-genre Corpus

Corpus Breakdown

Contact

Footnotes

About

Resources

License

Stars

Watchers

Forks