COVID-19 Disinformation Twitter Dataset (COVID-19 Disinfo dataset)

This repository contains a dataset and experimental scripts associated the work "Fighting the COVID-19 Infodemic: Modeling the Perspective of Journalists, Fact-Checkers, Social Media Platforms, Policy Makers, and the Society". The COVID-19 Disinfo dataset consisting of tweets annotated with fine-grained labels related to disinformation about COVID-19. The labels answer seven different questions that are of interests to journalists, fact-checkers, social media platforms, policymakers, and society as a whole. There are annotations for Arabic, Bulgarian, Dutch and English.

Table of contents:

Contents of the Distribution
Questions with Labels
List of Versions
Download
Experiments
Publication
Credits
Licensing
Contact
Acknowledgment

Contents of the Distribution

===============================================

Directory Structure

=======================
The directory contains the following files and sub-directories:

The following directories contains different data splits (train/dev/test) for both binary and multiclass. Each file is tab-separated, consists of tweet_id, and labels for Q1-7. For privacy concern we are not able to release tweet text and associated json objects.

data/arabic/: contains data plits for Arabic.
data/bulgarian/: contains data plits for Bulgarian.
data/dutch/: contains data plits for Dutch.
data/english/: contains data plits for English.
data/multilang/: contains multilingual data (tweets from all languages are combined in different splits for both binary and multiclass settings).

data/LICENSE_CC_BY_NC_SA_4.0.txt: license information
bin/ Please see readme for details
Readme.md this file
tweet_ids_with_compliance_status.json: Contains tweet ids with their complience status.

Examples

============

Please don't take hydroxychloroquine (Plaquenil) plus Azithromycin for #COVID19 UNLESS your doctor prescribes it. Both drugs affect the QT interval of your heart and can lead to arrhythmias and sudden death, especially if you are taking other meds or have a heart condition.
Labels:

Q1: Yes;
Q2: NO: probably contains no false info
Q3: YES: definitely of interest
Q4: NO: probably not harmful
Q5: YES:very-urgent
Q6: NO:not-harmful
Q7: NO: YES:discusses_cure

BREAKING: @MBuhari’s Chief Of Staff, Abba Kyari, Reportedly Sick, Suspected Of Contracting #Coronavirus | Sahara Reporters A top government source told SR on Monday that Kyari has been seriously “down” since returning from a trip abroad. READ MORE: https://t.co/Acy5NcbMzQ https://t.co/kStp4cmFlr.
Labels:

Q1: Yes;
Q2: NO: probably contains no false info
Q3: YES: definitely of interest
Q4: NO: definitely not harmful
Q5: YES:not-urgent
Q6: YES:rumor
NO: YES:classified_as_in_question_6

Statistics

============
Initial distribution of the annotated dataset

Arabic data: 4542 tweets
Bulgarian data: 4966 tweets
Dutch data: 3697 tweets
English data: 2665 tweets

More detail is available in the paper[1] download.

Twitter Batch compliance

============
Tweets might have been deleted for many reasons:
https://developer.twitter.com/en/docs/twitter-api/compliance/batch-compliance/introduction For such cases, it is necessary to maintain such compliance.

Questions with Labels

Below is the list of the questions and the possible labels (answers). See the paper below or the above micromappers links for detailed definition of the annotation guidelines.

1. Does the tweet contain a verifiable factual claim?
Labels:

YES: if it contains a verifiable factual claim;
NO: if it does not contain a verifiable factual claim;
Don’t know or can’t judge: the content of the tweet does not have enough information to make a judgment. It is recommended to categorize the tweet using this label when the content of the tweet is not understandable at all. For example, it uses a language (i.e., non-English) or references that are difficult to understand;

2. To what extent does the tweet appear to contain false information?
Labels:

NO, definitely contains no false information
NO, probably contains no false information
Not sure
YES, probably contains false information
YES, definitely contains false information

3. Will the tweet’s claim have an effect on or be of interest to the general public?
Labels:

NO, definitely not of interest
NO, probably not of interest
Not sure
YES, probably of interest
YES, definitely of interest

4. To what extent does the tweet appear to be harmful to society, person(s), company(s) or product(s)?
Labels:

NO, definitely not harmful
NO, probably not harmful
Not sure
YES, probably harmful
YES, definitely harmful

5. Do you think that a professional fact-checker should verify the claim in the tweet?
Labels:

NO, no need to check
NO, too trivial to check
YES, not urgent
YES, very urgent
Not sure

6. Is the tweet harmful for society and why?
Labels:

NO, not harmful
NO, joke or sarcasm
Not sure
YES, panic
YES, xenophobic, racist, prejudices, or hate-speech
YES, bad cure
YES, rumor or conspiracy
YES, other

7. Do you think that this tweet should get the attention of a government entity?
Labels:

NO, not interesting
Not sure
YES, categorized as in question 6
YES, other
YES, blame authorities
YES, contains advice
YES, calls for action
YES, discusses action taken
YES, discusses cure
YES, asks question

List of Versions

v1.0 [2021/11/05]: initial distribution of the annotated dataset

Arabic data: 4966 tweets
English data: 4542 tweets
Bulgarian data: 3697 tweets
Dutch data: 2665 tweets

Download

Please see the dataset directory for get the tweet ids and labels. To crawl tweets please use tweets hydrators tools:

In case if you do not have twitter account or access credentials please create a Twitter Account. Then follow this guide to retrieve access credentials for the Twitter API.

Publications:

Please cite the following papers if you are using the data or annotation guidelines

Firoj Alam and Shaden Shaar and Fahim Dalvi and Hassan Sajjad and Alex Nikolov and Hamdy Mubarak and Giovanni Da San Martino and Ahmed Abdelali and Nadir Durrani and Kareem Darwish and Abdulaziz Al-Homaid and Wajdi Zaghouani and Tommaso Caselli and Gijs Danoe and Friso Stolk and Britt Bruntink and Preslav Nakov, "Fighting the COVID-19 Infodemic: Modeling the Perspective of Journalists, Fact-Checkers, Social Media Platforms, Policy Makers, and the Society", Findings of EMNLP 2021, download.
Firoj Alam, Fahim Dalvi, Shaden Shaar, Nadir Durrani, Hamdy Mubarak, Alex Nikolov, Giovanni Da San Martino,3Ahmed Abdelali,1Hassan Sajjad,1Kareem Darwish,1Preslav Nakov, "Fighting the COVID-19 Infodemic in Social Media: A Holistic Perspective and a Call to Arms", Proceedings of the International AAAI Conference on Web and Social Media. (Vol. 15, pp. 913-922). 2021. download.

@inproceedings{alam2020fighting,
    title={Fighting the {COVID}-19 Infodemic: Modeling the Perspective of Journalists, Fact-Checkers, Social Media Platforms, Policy Makers, and the Society},
    author={Firoj Alam and Shaden Shaar and Fahim Dalvi and Hassan Sajjad and Alex Nikolov and Hamdy Mubarak and Giovanni Da San Martino and Ahmed Abdelali and Nadir Durrani and Kareem Darwish and Abdulaziz Al-Homaid and Wajdi Zaghouani and Tommaso Caselli and Gijs Danoe and Friso Stolk and Britt Bruntink and Preslav Nakov},
    booktitle = {Findings of EMNLP 2021},
    year={2021},
}

@InProceedings{alam2020call2arms,
  title		= {Fighting the {COVID}-19 Infodemic in Social Media: A
		  Holistic Perspective and a Call to Arms},
  author	= {Alam, Firoj and Dalvi, Fahim and Shaar, Shaden and
		  Durrani, Nadir and Mubarak, Hamdy and Nikolov, Alex and {Da
		  San Martino}, Giovanni and Abdelali, Ahmed and Sajjad,
		  Hassan and Darwish, Kareem and Nakov, Preslav},
  year		= {2021},
  pages		= {913-922},
  month	= {May},
  volume	= {15},
  booktitle	= {Proceedings of the International {AAAI} Conference on Web
		  and Social Media},
  series	= {ICWSM~'21},
  url		= {https://ojs.aaai.org/index.php/ICWSM/article/view/18114}
}

Credits

Firoj Alam, Qatar Computing Research Institute, HBKU, Qatar
Shaden Shaar, Qatar Computing Research Institute, HBKU, Qatar
Alex Nikolov, Sofia University, Bulgaria
Hamdy Mubarak, Qatar Computing Research Institute, HBKU, Qatar
Giovanni Da San Martino, University of Padova, Italy
Ahmed Abdelali, Qatar Computing Research Institute, HBKU, Qatar
Fahim Dalvi, Qatar Computing Research Institute, HBKU, Qatar
Nadir Durrani, Qatar Computing Research Institute, HBKU, Qatar
Hassan Sajjad, Qatar Computing Research Institute, HBKU, Qatar
Kareem Darwish, Qatar Computing Research Institute, HBKU, Qatar
Preslav Nakov, Qatar Computing Research Institute, HBKU, Qatar
Abdulaziz Al-Homaid, Qatar Computing Research Institute, HBKU, Qatar
Wajdi Zaghouani, Hamad Bin Khalifa University, Qatar
Tommaso Caselli, University of Groningen, The Netherlands
Gijs Danoe, University of Groningen, The Netherlands
Friso Stolk, University of Groningen, The Netherlands
Britt Bruntink, University of Groningen, The Netherlands

Licensing

This dataset is published under CC BY-NC-SA 4.0 license, which means everyone can use this dataset for non-commercial research purpose: https://creativecommons.org/licenses/by-nc/4.0/.

Contact

Please contact tanbih@qcri.org

Acknowledgment

Thanks to the QCRI's Crisis Computing team for facilitating us with Micromappers.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
bin		bin
data		data
README.md		README.md
_config.yml		_config.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bin

bin

data

data

README.md

README.md

_config.yml

_config.yml

Repository files navigation

COVID-19 Disinformation Twitter Dataset (COVID-19 Disinfo dataset)

Contents of the Distribution

Directory Structure

Examples

Statistics

Twitter Batch compliance

Questions with Labels

List of Versions

Download

Publications:

Credits

Licensing

Contact

Acknowledgment

About

Releases

Packages

Contributors 2

Languages

firojalam/COVID-19-disinformation

Folders and files

Latest commit

History

Repository files navigation

COVID-19 Disinformation Twitter Dataset (COVID-19 Disinfo dataset)

Contents of the Distribution

Directory Structure

Examples

Statistics

Twitter Batch compliance

Questions with Labels

List of Versions

Download

Publications:

Credits

Licensing

Contact

Acknowledgment

About

Topics

Resources

Stars

Watchers

Forks

Languages