Extraction and pre-processing of summarization datasets from the New York Times Annotated Corpus (LDC2008T19).
This library was developed and tested under Python 3.4. Feel free to send me errors or pull requests for extending compatibility to earlier versions of Python.
We depend on NLTK for first-pass sentence splitting and spaCy for verb detection via part-of-speech tagging.
$ pip install nltk
$ pip install spacy
$ python -m spacy download en_core_web_sm
The typical flow for constructing a summarization dataset consists of:
- Reading the compressed NYT corpus on disk and caching documents with the required topics and summaries in a shelf. This is skipped if the shelf already exists.
- Filtering these documents as per summary properties like length and degree of extractiveness and pre-processing them to resolve errors and artifacts.
- Splitting the filtered dataset into a train/dev/test partition and caching it for further experimentation.
This flow is illustrated in main.py
with all relevant parameters exposed as command-line arguments. To get started, run:
main.py --help
If you use this code in a research project, please cite:
Junyi Jessy Li, Kapil Thadani and Amanda Stent. The Role of Discourse Units in Near-Extractive Summarization. In Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL). 2016.
@InProceedings{li-thadani-stent-edusumm16,
author = {Li, Junyi Jessy and Thadani, Kapil and Stent, Amanda},
title = {The Role of Discourse Units in Near-Extractive Summarization},
booktitle = {Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL)},
year = {2016},
}
Document IDs for the datasets used in this paper are available here.