Skip to content

Latest commit

 

History

History
93 lines (76 loc) · 3.83 KB

BENCHMARKS.md

File metadata and controls

93 lines (76 loc) · 3.83 KB

Benchmark results on conversational response selection

This page contains benchmark results for the baselines and other methods on the datasets contained in this repository.

Please feel free to submit the results of your model as a pull request.

All the results are for models using only the context feature to select the correct response. Models using extra contexts are not reported here (yet).

For a description of the baseline systems, see baselines/README.md.

Reddit

These are results on the data from 2015 to 2018 inclusive, (TABLE_REGEX="^201[5678]_[01][0-9]$")$").

1-of-100 accuracy
Baselines
TF_IDF 26.4%
BM25 27.5%
USE_SIM 36.6%
USE_MAP 40.8%
USE_LARGE_SIM 41.4%
USE_LARGE_MAP 47.7%
ELMO_SIM 12.5%
ELMO_MAP 20.6%
BERT_SMALL_SIM 17.1%
BERT_SMALL_MAP 24.5%
BERT_LARGE_SIM 14.8%
BERT_LARGE_MAP 24.0%
USE_QA_SIM 46.3%
USE_QA_MAP 46.6%
Other models
N-gram dual-encoder [1] 61.3%
ConveRT [2] 68.3%

OpenSubtitles

1-of-100 accuracy
Baselines
TF_IDF 10.9%
BM25 10.9%
USE_SIM 13.6%
USE_MAP 15.8%
USE_LARGE_SIM 14.9%
USE_LARGE_MAP 18.0%
ELMO_SIM 9.5%
ELMO_MAP 13.3%
BERT_SMALL_SIM 13.8%
BERT_SMALL_MAP 17.5%
BERT_LARGE_SIM 12.2%
BERT_LARGE_MAP 16.8%
USE_QA_SIM 16.8%
USE_QA_MAP 17.1%
Other models
Fine-tuned N-gram dual-encoder [1] 30.6%
ConveRT (not fine-tuned) [2] 21.5%
ConveRT (not fine-tuned) MAP [2] 23.1%

AmazonQA

1-of-100 accuracy
Baselines
TF_IDF 51.8%
BM25 52.3%
USE_SIM 47.6%
USE_MAP 54.4%
USE_LARGE_SIM 51.3%
USE_LARGE_MAP 61.9%
ELMO_SIM 16.0%
ELMO_MAP 35.5%
BERT_SMALL_SIM 27.8%
BERT_SMALL_MAP 45.8%
BERT_LARGE_SIM 25.9%
BERT_LARGE_MAP 44.1%
USE_QA_SIM 67.0%
USE_QA_MAP 70.7%
Other models
N-gram dual-encoder [1] 71.3%
ConveRT (not fine-tuned) [2] 67.0%
ConveRT (not fine-tuned) MAP [2] 71.6%
ConveRT (fine-tuned) 84.3%

Note the result for [1] here differs from the original paper, as we found a bug in the evaluation. Updated versions of the papers are in progress.

References

[1] A Repository of Conversational Datasets. Henderson et al. Proceedings of the Workshop on NLP for Conversational AI, 2019.

[2] ConveRT: Efficient and Accurate Conversational Representations from Transformers. Henderson et al. arXiv pre-print 2019. These results can be reproduced with baselines/run_baseline.py --method CONVERT_[SIM|MAP].