COUGH: A Challenge Dataset and Models for COVID-19 FAQ Retrieval

Introduction

This repository contains the dataset for paper "COUGH: A Challenge Dataset and Models for COVID-19 FAQ Retrieval".

In this work, we present a large challenging dataset, COUGH, for COVID-19 FAQ retrieval. Specifically, similar to a standard FAQ dataset, COUGH consists of three parts: FAQ Bank, User Query Bank and Annotated Relevance Set. FAQ Bank contains ~16K FAQ items scraped from 55 credible websites (e.g., CDC and WHO). For evaluation, we introduce User Query Bank and Annotated Relevance Set, where the former contains 1201 human-paraphrased queries while the latter contains ~32 human-annotated FAQ items for each query. A list of websites where we collect FAQ items can be found at List_of_websites.txt (Please consult the appendix in our paper for detailed statistics).

Statistics and Comparison of COUGH with representative counterparts:

	FAQIR	StackFAQ	LocalGov	Sun and Sedoc	Poliak et al.	COUGH (ours)
Domain	Yahoo!	StackExachange	Government	COVID-19	COVID-19	COVID-19
# of FAQs	4313	719	1786	690	2115	15919
# of Queries (Q)	1233	1249	784	6495*	24240*	1201
# of annotations per Q	8.22	Not Applicable	<10	5	5	32.17
Query Length	7.30	13.84	**	**	**	12.97
FAQ-query Length	12.30	10.39	**	**	**	13.00
FAQ-answer Length	33.00	76.54	**	**	**	113.58
Language	English	English	Japanese	English	Multi-lingual	Multi-lingual
# of sources	1	1	1	12	34	55

*: Extracted from existing resources (e.g., COVID-19 Twitter dataset).
**: Not Applicable, as either not in English or not publicly available.

Examples from COUGH dataset are shown below:

License and Terms of Use

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

This dataset can be used for research and education purpose only. It is shared under the CC BY-NC-SA 4.0 license with attribution to the source websites listed in List_of_websites.txt. If you want to use the dataset for other purposes, please check the terms of use for each individual source.

Dataset

COUGH can be freely accessed and downloaded under data directory of this repo (Delimiter used in following csv files: comma (,))

data/FAQ_Bank.csv is the full FAQ Bank containing a total number of 15919 FAQ items.
data/FAQ_Bank_eval.csv is the FAQ Bank specifically used for evaluation purpose, containing a total number of 7117 English non-Forum FAQ items.
data/User_Query_Bank.csv is the User Query Bank containing a total number of 1201 user queries.
data/Annotated_Relevance_Set.csv is the Annotated Relevance Set containing a total number of 39760 annotated <User Query, FAQ item> tuples.

Note that, to replicate results as reported in paper, please use data/FAQ_Bank_eval.csv.

Useful Codes

You can refer to this package for an easy-to-use BM25 search engine.

You can refer to this repository for handy SentenceBERT models.

Please counsult our paper at Section 5 (Experiment) for more information about how we set up experiments and deploy baseline models.

Citation

Please cite our paper if you use the COUGH dataset from this repo:

@inproceedings{zhang2021cough,
    title = "{COUGH}: A Challenge Dataset and Models for {COVID}-19 {FAQ} Retrieval",
    author = "Zhang, Xinliang Frederick  and
      Sun, Heming  and
      Yue, Xiang  and
      Lin, Simon  and
      Sun, Huan",
    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, {EMNLP} 2021",
    year = "2021",
    pages = "3759--3769",
}

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
data		data
COUGH_Examples.png		COUGH_Examples.png
LICENCE.md		LICENCE.md
List_of_websites.txt		List_of_websites.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

COUGH_Examples.png

COUGH_Examples.png

LICENCE.md

LICENCE.md

List_of_websites.txt

List_of_websites.txt

README.md

README.md

Repository files navigation

COUGH: A Challenge Dataset and Models for COVID-19 FAQ Retrieval

Introduction

License and Terms of Use

Dataset

Useful Codes

Citation

About

Releases

Packages

License

sunlab-osu/covid-faq

Folders and files

Latest commit

History

Repository files navigation

COUGH: A Challenge Dataset and Models for COVID-19 FAQ Retrieval

Introduction

License and Terms of Use

Dataset

Useful Codes

Citation

About

Resources

License

Stars

Watchers

Forks