Evaluating Factual Consistency of Texts with Semantic Role Labeling

Jing Fan^*, Dennis Aumiller^*, and Michael Gertz
Institute of Computer Science, Heidelberg University
^* These authors contributed equally to this work.

You can reach us via the Github issues, or write us a mail to aumiller@informatik.uni-heidelberg.de!

2023-05-23: A pre-print of our work is now available on arXiv.
2023-05-15: Our work has been accepted at *SEM 2023! We will update the citation once the proceedings become available.

Installation

We provide an exhaustive list of required packages through the requirements.txt file. However, given the finicky dependency issues surrounding the (nowadays deprecated) AllenNLP release, as well as the spaCy versions required, we strongly suggest creating a new environment in which to install this package.

You can install the required core dependencies with

python3 -m pip install -r requirements.txt

This works (guaranteed) for Python versions 3.8 and 3.9; we do not guarantee a full compatibility with 3.10. Furthermore, we encountered some (temporary?) issues regarding the dependency on typing-extensions==4.6.0, respectively pydantic. More information can be found in this Github issue. Should you encounter a similar problem, consider manuall downgrading your typing-extensions version to typing-extensions==4.5.0.

Usage

The general usage of our metric SRLScore is as follows:

from SRLScore import SRLScore

# Default values are reasonable for most cases
scorer = SRLScore()

scorer.score(input_text, summary_text)

You can also see the example_usage.py file. Note that SRLScore heavily relies on annotations generated by a (neural) SRL tagger. This means that, if you have a GPU available, the processing time should be significantly faster.

Experimental Results & Data from the Paper

To repeat experiments that we performed, you may run the eval.sh script in this folder. We further experimented with leaeve-one-argument-out variants of our weights, which is documented in eval_leave_out-exp.sh.

Scripts to reproduce the baseline scores (particularly for BARTScore and CoCo, the two most competitive methods with implementations available), can be found in baselines/. For CoCo, you may further need to clone the respective paper's code repository, copy our coco_commands.sh script in their main folder, and run from there.

significance_testing.py will re-compute the significance of differences between various methods. Note that we apply Bonferroni correction, which makes the significance threshold fairly small!

Citation

If you found this repository helpful, please consider citing our work:

@article{fan-etal-2023-evaluating,
  title={{Evaluating Factual Consistency of Texts with Semantic Role Labeling}}, 
  author={Jing Fan and Dennis Aumiller and Michael Gertz},
  journal={CoRR},
  volume={abs/2305.13309},
  year={2023},
  eprint={2305.13309},
  eprinttype={arXiv},
  primaryClass={cs.CL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 114 Commits
baselines		baselines
data		data
verb_map_info		verb_map_info
.gitignore		.gitignore
README.md		README.md
SRLScore.py		SRLScore.py
custom_datatypes.py		custom_datatypes.py
eval.sh		eval.sh
eval_leave_out_exp.sh		eval_leave_out_exp.sh
evaluation.py		evaluation.py
example_usage.py		example_usage.py
extract_tuples.py		extract_tuples.py
install_dependencies.sh		install_dependencies.sh
processor.py		processor.py
requirements.txt		requirements.txt
requirements_eval.txt		requirements_eval.txt
significance_testing.py		significance_testing.py
tuple_comparison.py		tuple_comparison.py
utils.py		utils.py

heyjing/SRLScore

Folders and files

Latest commit

History

Repository files navigation

Evaluating Factual Consistency of Texts with Semantic Role Labeling

Installation

Usage

Experimental Results & Data from the Paper

Citation

About

Topics

Resources

Stars

Watchers

Forks

Languages