Evaluation Interfaces for Generative Models

This repository is intended as a central hub for interfaces that support evaluation of generative model outputs. For more motivation on why this is important, see our recent paper on how to simplify evaluation of generative models. The Genie leaderboard leverages interfaces in this repository.

The interfaces here use the jinja2 format accepted by the amti library.

This is repository is a WiP. More interfaces will be added and contributing guidelines defined.

Currently supported tasks

xsum (summarization)

Getting started

Clone this repository, install python>=3.6, and run

pip install -r requirements.txt

Previewing an evaluation template

Obtain a model predictions file for the target dataset $DATASET. To generate a sample $DATASET-sample.json predictions file, run python src/make_sample_predictions.py --dataset $DATASET.
Run python src/process.py $DATASET-sample.json --dataset $DATASET to produce $DATASET-processed.jsonl, which will be used to instantiate the evaluation tasks. You may substitute $DATASET-sample.json with your own predictions file.
Run amti preview-batch templates/$DATASET/$TEMPLATE $DATASET-processed.jsonl to start a local web server to preview the evaluation tasks with the specified template $TEMPLATE. (To view the first task, navigate to http://127.0.0.1:8000/hits/0/.)

Notes:

Currently, the above instructions work only for DATASET=xsum and TEMPLATE=mturk-specs-likert. We encourage development of additional templates for the xsum dataset, and will be expanding to additional datasets.
The xsum model predictions file format is described here.
New evaluation templates should use the jinja2 format, as in the templates/xsum/mturk-specs-likert/question.xml.j2 example. Templates should be organized according to the templates/$DATASET/$TEMPLATE directory pattern. For more details about how evaluation templates can be used with the amti tool for managing HITs on Amazon Mechanical Turk, see those docs. The Genie leaderboard handles model submissions and HIT management for hosted tasks.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
src		src
templates/xsum/mturk-specs-likert		templates/xsum/mturk-specs-likert
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src

src

templates/xsum/mturk-specs-likert

templates/xsum/mturk-specs-likert

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

Evaluation Interfaces for Generative Models

Currently supported tasks

Getting started

Previewing an evaluation template

About

Releases

Packages

Contributors 2

Languages

License

allenai/evaluation-interfaces

Folders and files

Latest commit

History

Repository files navigation

Evaluation Interfaces for Generative Models

Currently supported tasks

Getting started

Previewing an evaluation template

About

Resources

License

Stars

Watchers

Forks

Languages