Skip to content

sunlab-osu/covid-faq

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

COUGH: A Challenge Dataset and Models for COVID-19 FAQ Retrieval

Introduction

This repository contains the dataset for paper "COUGH: A Challenge Dataset and Models for COVID-19 FAQ Retrieval".

In this work, we present a large challenging dataset, COUGH, for COVID-19 FAQ retrieval. Specifically, similar to a standard FAQ dataset, COUGH consists of three parts: FAQ Bank, User Query Bank and Annotated Relevance Set. FAQ Bank contains ~16K FAQ items scraped from 55 credible websites (e.g., CDC and WHO). For evaluation, we introduce User Query Bank and Annotated Relevance Set, where the former contains 1201 human-paraphrased queries while the latter contains ~32 human-annotated FAQ items for each query. A list of websites where we collect FAQ items can be found at List_of_websites.txt (Please consult the appendix in our paper for detailed statistics).

Statistics and Comparison of COUGH with representative counterparts:

FAQIR StackFAQ LocalGov Sun and Sedoc Poliak et al. COUGH (ours)
Domain Yahoo! StackExachange Government COVID-19 COVID-19 COVID-19
# of FAQs 4313 719 1786 690 2115 15919
# of Queries (Q) 1233 1249 784 6495* 24240* 1201
# of annotations per Q 8.22 Not Applicable <10 5 5 32.17
Query Length 7.30 13.84 ** ** ** 12.97
FAQ-query Length 12.30 10.39 ** ** ** 13.00
FAQ-answer Length 33.00 76.54 ** ** ** 113.58
Language English English Japanese English Multi-lingual Multi-lingual
# of sources 1 1 1 12 34 55

*: Extracted from existing resources (e.g., COVID-19 Twitter dataset).
**: Not Applicable, as either not in English or not publicly available.

Examples from COUGH dataset are shown below:

COUGH Examples

License and Terms of Use

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

This dataset can be used for research and education purpose only. It is shared under the CC BY-NC-SA 4.0 license with attribution to the source websites listed in List_of_websites.txt. If you want to use the dataset for other purposes, please check the terms of use for each individual source.

Dataset

COUGH can be freely accessed and downloaded under data directory of this repo (Delimiter used in following csv files: comma (,))

Note that, to replicate results as reported in paper, please use data/FAQ_Bank_eval.csv.

Useful Codes

You can refer to this package for an easy-to-use BM25 search engine.

You can refer to this repository for handy SentenceBERT models.

Please counsult our paper at Section 5 (Experiment) for more information about how we set up experiments and deploy baseline models.

Citation

Please cite our paper if you use the COUGH dataset from this repo:

@inproceedings{zhang2021cough,
    title = "{COUGH}: A Challenge Dataset and Models for {COVID}-19 {FAQ} Retrieval",
    author = "Zhang, Xinliang Frederick  and
      Sun, Heming  and
      Yue, Xiang  and
      Lin, Simon  and
      Sun, Huan",
    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, {EMNLP} 2021",
    year = "2021",
    pages = "3759--3769",
}

About

COUGH: A Challenge Dataset and Models for COVID-19 FAQ Retrieval

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published