This repository is intended as a central hub for interfaces that support evaluation of generative model outputs. For more motivation on why this is important, see our recent paper on how to simplify evaluation of generative models. The Genie leaderboard leverages interfaces in this repository.
The interfaces here use the jinja2 format accepted by the amti library.
This is repository is a WiP. More interfaces will be added and contributing guidelines defined.
- xsum (summarization)
Clone this repository, install python>=3.6, and run
pip install -r requirements.txt
- Obtain a model predictions file for the target dataset
$DATASET
. To generate a sample$DATASET-sample.json
predictions file, runpython src/make_sample_predictions.py --dataset $DATASET
. - Run
python src/process.py $DATASET-sample.json --dataset $DATASET
to produce$DATASET-processed.jsonl
, which will be used to instantiate the evaluation tasks. You may substitute$DATASET-sample.json
with your own predictions file. - Run
amti preview-batch templates/$DATASET/$TEMPLATE $DATASET-processed.jsonl
to start a local web server to preview the evaluation tasks with the specified template$TEMPLATE
. (To view the first task, navigate to http://127.0.0.1:8000/hits/0/.)
Notes:
- Currently, the above instructions work only for
DATASET=xsum
andTEMPLATE=mturk-specs-likert
. We encourage development of additional templates for the xsum dataset, and will be expanding to additional datasets. - The xsum model predictions file format is described here.
- New evaluation templates should use the jinja2 format, as in the
templates/xsum/mturk-specs-likert/question.xml.j2
example. Templates should be organized according to thetemplates/$DATASET/$TEMPLATE
directory pattern. For more details about how evaluation templates can be used with the amti tool for managing HITs on Amazon Mechanical Turk, see those docs. The Genie leaderboard handles model submissions and HIT management for hosted tasks.