Skip to content

Latest commit

 

History

History
63 lines (49 loc) · 3.8 KB

README.md

File metadata and controls

63 lines (49 loc) · 3.8 KB

Understanding Points of Correspondence between Sentences for Abstractive Summarization

Dataset for our ACL SRW 2020 paper Understanding Points of Correspondence between Sentences for Abstractive Summarization

Citation

@inproceedings{lebanoff-etal-2020-understanding,
    title = "Understanding Points of Correspondence between Sentences for Abstractive Summarization",
    author = "Lebanoff, Logan and Muchovej, John and Dernoncourt, Franck and Kim, Doo Soon and Wang, Lidan and Chang, Walter and Liu, Fei",
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop",
    month = jul,
    year = "2020",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.acl-srw.26",
    pages = "191--198",
}

Presentation Video

Watch our presentation given virtually at ACL:

Watch our presentation given virtually at ACL:

Dataset

Fusing sentences containing disparate content is a remarkable human ability that helps create informative and succinct summaries. Such a simple task for humans has remained challenging for modern abstractive summarizers, substantially restricting their applicability in real-world scenarios.

We present a dataset that contains 1,599 sentence fusion examples (taken from 1,174 documents) with fine-grained Points of Correspondence annotations. Points of correspondence (PoC) are cohesive devices that tie two sentences together into a coherent text. The types of points of correspondence are delineated by text cohesion theory, covering pronominal and nominal referencing, repetition and beyond.

A point of correspondence is represented as a span of text from each sentence. Our dataset is in JSON format in the file PoC_dataset.json. Each example has the following attributes:

Attribute Content
Sentence_1 Tokenized input sentence 1
Sentence_2 Tokenized input sentence 2
Sentence_Fused Fused sentence created by merging Sentence_1 and Sentence_2
Sentence_1_Index Position of sentence in Full_Article
Sentence_2_Index Position of sentence in Full_Article
Sentence_Fused_Index Position of fused sentence in Full_Summary
Full_Article Full CNN news article. Each sentence is separated by tabs
Full_Summary Summary of the article. Each sentence is separated by tabs
PoCs List of Points of Correspondence

Each PoC has the following attributes:

Attribute Content
Sentence_1_Selection Token indices for beginning and end of the PoC in input sentence
Sentence_2_Selection Token indices for beginning and end of the PoC in input sentence
Sentence_Fused_Selection Token indices for beginning and end of the PoC in fused sentence
PoC_Type Can be any of Nominal, Pronominal, Common-Noun, Repetition and Event

Example Visualizations

We provide visualizations of every dataset example in the directory PoC_visualizations/, which can be opened in any browser, along with the code used to create them in visualize_poc.py.

The process is easy and can be seen below:

Example visualization

Model Outputs

The outputs of our models can be downloaded here: https://www.dropbox.com/sh/g34aj101oauwlx3/AABIdqbBXMAa8RFpb-I6Auh7a/Understanding%20Points%20of%20Correspondence%20between%20Sentences%20for%20Abstractive%20Summarization?dl=0

*Note: We tested only on the examples that had at least one point of correspondence, so there are 1494 outputs for each model.