GitHub - outerproduct/nyt-summ: Summarization datasets from the New York Times Annotated Corpus

Extraction and pre-processing of summarization datasets from the New York Times Annotated Corpus (LDC2008T19).

Installation

This library was developed and tested under Python 3.4. Feel free to send me errors or pull requests for extending compatibility to earlier versions of Python.

We depend on NLTK for first-pass sentence splitting and spaCy for verb detection via part-of-speech tagging.

$ pip install nltk
$ pip install spacy
$ python -m spacy download en_core_web_sm

Usage

The typical flow for constructing a summarization dataset consists of:

Reading the compressed NYT corpus on disk and caching documents with the required topics and summaries in a shelf. This is skipped if the shelf already exists.
Filtering these documents as per summary properties like length and degree of extractiveness and pre-processing them to resolve errors and artifacts.
Splitting the filtered dataset into a train/dev/test partition and caching it for further experimentation.

This flow is illustrated in main.py with all relevant parameters exposed as command-line arguments. To get started, run:

main.py --help

Citation

If you use this code in a research project, please cite:

Junyi Jessy Li, Kapil Thadani and Amanda Stent. The Role of Discourse Units in Near-Extractive Summarization. In Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL). 2016.

@InProceedings{li-thadani-stent-edusumm16,
  author    = {Li, Junyi Jessy  and  Thadani, Kapil  and  Stent, Amanda},
  title     = {The Role of Discourse Units in Near-Extractive Summarization},
  booktitle = {Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL)},
  year      = {2016},
}

Document IDs for the datasets used in this paper are available here.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
lexical		lexical
resources		resources
utils		utils
README.md		README.md
corpus.py		corpus.py
doc.py		doc.py
main.py		main.py
sentence.py		sentence.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lexical

lexical

resources

resources

utils

utils

README.md

README.md

corpus.py

corpus.py

doc.py

doc.py

main.py

main.py

sentence.py

sentence.py

Repository files navigation

Installation

Usage

Citation

About

Releases

Packages

Languages

outerproduct/nyt-summ

Folders and files

Latest commit

History

Repository files navigation

Installation

Usage

Citation

About

Resources

Stars

Watchers

Forks

Languages