REALSumm: Re-evaluating EvALuation in Summarization

Paper: Re-evaluating Evaluation in Text Summarization

Authors: Manik Bhandari, Pranav Gour, Atabak Ashfaq, Pengfei Liu, Graham Neubig

Outline

Leaderboard

Motivation

Evaluating summarization is hard. Most papers still use ROUGE, but recently a host of metrics (eg BERTScore, MoverScore) report better correlation with human eval. However, these were tested on older systems (the classic TAC meta-evaluation datasets are now 6-12 years old), how do they fare with SOTA models? Will conclusions found there hold with modern systems and summarization tasks?

Released Data

Including all the system variants, there are total 25 system outputs - 11 extractive and 14 abstractive.

Please read our reproducibility instructions in addition to our paper in order to reproduce this work for another dataset.

Type	Sys ID	System Output	Human Judgement	Paper	Variants	Bib
Extractive	1	Download	Download	Heterogeneous Graph Neural Networks for Extractive Document Summarization		Bib
	2	Download	Download	Extractive Summarization as Text Matching		Bib
	3	Download	Download	Searching for Effective Neural Extractive Summarization: What Works and What’s Next	LSTM+PN+RL	Bib
	4	Download	Download		BERT+TF+SL
	5	Download	Download		BERT+TF+PN
	6	Download	Download		BERT+LSTM+PN
	7	Download	Download		BERT+LSTM+PN+RL
	8	Download	Download	BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension		Bib
	9	Download	Download	Ranking Sentences for Extractive Summarization with Reinforcement Learning		Bib
	10	Download	Download	Neural Document Summarization by Jointly Learning to Score and Select Sentences		Bib
	11	Download	Download	BanditSum: Extractive Summarization as a Contextual Bandit		Bib
Abstractive	12	Download	Download	Learning by Semantic Similarity Makes Abstractive Summarization Better		Bib
	13	Download	Download	BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension		Bib
	14	Download	Download	Text Summarization with Pretrained Encoders	TransAbs	Bib
	15	Download	Download		Abs
	16	Download	Download		ExtAbs
	17	Download	Download	Pretraining-Based Natural Language Generation for Text Summarization		Bib
	18	Download	Download	Unified Language Model Pre-training for Natural Language Understanding and Generation	v1	Bib
	19	Download	Download		v2	Bib
	20	Download	Download	Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer	Base	Bib
	21	Download	Download		Large
	22	Download	Download		11B
	23	Download	Download	Bottom-Up Abstractive Summarization		Bib
	24	Download	Download	Fast Abstractive Summarization with Reinforce-Selected Sentence Rewriting		Bib
	25	Download	Download	Get To The Point Summarization with Pointer-Generator Networks		Bib

Meta-evaluation Tool

Calculate the metric scores for each of the summary and create a scores dict in the below format. See the section below to calculate scores with a new metric. Make sure to include litepyramid_recall in the scores dict, which is the metric used by human evaluators.
Run the analysis notebook on the scores dict to get all the graphs and tables used in the paper.

Calculating scores with a new metric

Update scorer.py such that (1) if there is any setup required by your metric, it is done in the __init__ function of scorer as the scorer will be used to score all systems. And (2) add your metric in the score function as

elif self.metric == "name_of_my_new_metric":
  scores = call_to_my_function_which_gives_scores(passing_appropriate_arguments)

where scores is a list of scores corresponding to each summary in a file. It should be a list of dictionaries e.g. [{'precision': 0.0, 'recall': 1.0} ...]

Calculate the scores and the scores dict using python get_scores.py --data_path ../selected_docs_for_human_eval/<abs or ext> --output_path ../score_dicts/abs_new_metric.pkl --log_path ../logs/scores.log -n_jobs 1 --metric <name of metric>
Your scores dict is generated at the output path.
Merge it with the scores dict with human scores provided in scores_dicts/ using python score_dict_update.py --in_path <score dicts folder with the dicts to merge> --out_path <output path to place the merged dict pickle> -action merge
Your dict will be merged with the one with human scores and the output will be placed in out_path. You can now run the analysis notebook on the scores dict to get all the graphs and tables used in the paper.

Scores dict format used

{
    doc_id: {
            'doc_id': value of doc id,
            'ref_summ': reference summary of this doc,
            'system_summaries': {
                system_name: {
                        'system_summary': the generated summary,
                        'scores': {
                            'js-2': the actual score,
                            'rouge_l_f_score': the actual score,
                            'rouge_1_f_score': the actual score,
                            'rouge_2_f_score': the actual score,
                            'bert_f_score': the actual score
                        }
                }
            }
        }
}

Bib

@inproceedings{Bhandari-2020-reevaluating,
title = "Re-evaluating Evaluation in Text Summarization",
author = "Bhandari, Manik  and Narayan Gour, Pranav  and Ashfaq, Atabak  and  Liu, Pengfei and Neubig, Graham ",
booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
year = "2020"
}

Name		Name	Last commit message	Last commit date
Latest commit History 94 Commits
analysis		analysis
fig		fig
human_annotations		human_annotations
logs		logs
process_data		process_data
scores_dicts		scores_dicts
scoring		scoring
selected_docs_for_human_eval		selected_docs_for_human_eval
williams_test_data		williams_test_data
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
reproducibility.md		reproducibility.md

License

neulab/REALSumm

Folders and files

Latest commit

History

Repository files navigation

REALSumm: Re-evaluating EvALuation in Summarization

Paper: Re-evaluating Evaluation in Text Summarization

Authors: Manik Bhandari, Pranav Gour, Atabak Ashfaq, Pengfei Liu, Graham Neubig

Outline

Leaderboard

Motivation

Released Data

Meta-evaluation Tool

Calculating scores with a new metric

Scores dict format used

Bib

About

Resources

License

Stars

Watchers

Forks

Languages