Skip to content

firojalam/COVID-19-disinformation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 

Repository files navigation

COVID-19 Disinformation Twitter Dataset (COVID-19 Disinfo dataset)

This repository contains a dataset and experimental scripts associated the work "Fighting the COVID-19 Infodemic: Modeling the Perspective of Journalists, Fact-Checkers, Social Media Platforms, Policy Makers, and the Society". The COVID-19 Disinfo dataset consisting of tweets annotated with fine-grained labels related to disinformation about COVID-19. The labels answer seven different questions that are of interests to journalists, fact-checkers, social media platforms, policymakers, and society as a whole. There are annotations for Arabic, Bulgarian, Dutch and English.

Table of contents:

Contents of the Distribution

===============================================

Directory Structure

=======================
The directory contains the following files and sub-directories:

  1. The following directories contains different data splits (train/dev/test) for both binary and multiclass. Each file is tab-separated, consists of tweet_id, and labels for Q1-7. For privacy concern we are not able to release tweet text and associated json objects.
  • data/arabic/: contains data plits for Arabic.
  • data/bulgarian/: contains data plits for Bulgarian.
  • data/dutch/: contains data plits for Dutch.
  • data/english/: contains data plits for English.
  • data/multilang/: contains multilingual data (tweets from all languages are combined in different splits for both binary and multiclass settings).
  1. data/LICENSE_CC_BY_NC_SA_4.0.txt: license information
  2. bin/ Please see readme for details
  3. Readme.md this file
  4. tweet_ids_with_compliance_status.json: Contains tweet ids with their complience status.

Examples

============

Please don't take hydroxychloroquine (Plaquenil) plus Azithromycin for #COVID19 UNLESS your doctor prescribes it. Both drugs affect the QT interval of your heart and can lead to arrhythmias and sudden death, especially if you are taking other meds or have a heart condition.
Labels:

  1. Q1: Yes;
  2. Q2: NO: probably contains no false info
  3. Q3: YES: definitely of interest
  4. Q4: NO: probably not harmful
  5. Q5: YES:very-urgent
  6. Q6: NO:not-harmful
  7. Q7: NO: YES:discusses_cure

BREAKING: @MBuhari’s Chief Of Staff, Abba Kyari, Reportedly Sick, Suspected Of Contracting #Coronavirus | Sahara Reporters A top government source told SR on Monday that Kyari has been seriously “down” since returning from a trip abroad. READ MORE: https://t.co/Acy5NcbMzQ https://t.co/kStp4cmFlr.
Labels:

  1. Q1: Yes;
  2. Q2: NO: probably contains no false info
  3. Q3: YES: definitely of interest
  4. Q4: NO: definitely not harmful
  5. Q5: YES:not-urgent
  6. Q6: YES:rumor
  7. NO: YES:classified_as_in_question_6

Statistics

============
Initial distribution of the annotated dataset

  • Arabic data: 4542 tweets
  • Bulgarian data: 4966 tweets
  • Dutch data: 3697 tweets
  • English data: 2665 tweets

More detail is available in the paper[1] download.

Twitter Batch compliance

============
Tweets might have been deleted for many reasons:
https://developer.twitter.com/en/docs/twitter-api/compliance/batch-compliance/introduction For such cases, it is necessary to maintain such compliance.

Questions with Labels

Below is the list of the questions and the possible labels (answers). See the paper below or the above micromappers links for detailed definition of the annotation guidelines.

1. Does the tweet contain a verifiable factual claim?
Labels:

  • YES: if it contains a verifiable factual claim;
  • NO: if it does not contain a verifiable factual claim;
  • Don’t know or can’t judge: the content of the tweet does not have enough information to make a judgment. It is recommended to categorize the tweet using this label when the content of the tweet is not understandable at all. For example, it uses a language (i.e., non-English) or references that are difficult to understand;

2. To what extent does the tweet appear to contain false information?
Labels:

  1. NO, definitely contains no false information
  2. NO, probably contains no false information
  3. Not sure
  4. YES, probably contains false information
  5. YES, definitely contains false information

3. Will the tweet’s claim have an effect on or be of interest to the general public?
Labels:

  1. NO, definitely not of interest
  2. NO, probably not of interest
  3. Not sure
  4. YES, probably of interest
  5. YES, definitely of interest

4. To what extent does the tweet appear to be harmful to society, person(s), company(s) or product(s)?
Labels:

  1. NO, definitely not harmful
  2. NO, probably not harmful
  3. Not sure
  4. YES, probably harmful
  5. YES, definitely harmful

5. Do you think that a professional fact-checker should verify the claim in the tweet?
Labels:

  1. NO, no need to check
  2. NO, too trivial to check
  3. YES, not urgent
  4. YES, very urgent
  5. Not sure

6. Is the tweet harmful for society and why?
Labels:

  1. NO, not harmful
  2. NO, joke or sarcasm
  3. Not sure
  4. YES, panic
  5. YES, xenophobic, racist, prejudices, or hate-speech
  6. YES, bad cure
  7. YES, rumor or conspiracy
  8. YES, other

7. Do you think that this tweet should get the attention of a government entity?
Labels:

  1. NO, not interesting
  2. Not sure
  3. YES, categorized as in question 6
  4. YES, other
  5. YES, blame authorities
  6. YES, contains advice
  7. YES, calls for action
  8. YES, discusses action taken
  9. YES, discusses cure
  10. YES, asks question

List of Versions

v1.0 [2021/11/05]: initial distribution of the annotated dataset

  • Arabic data: 4966 tweets
  • English data: 4542 tweets
  • Bulgarian data: 3697 tweets
  • Dutch data: 2665 tweets

Download

Please see the dataset directory for get the tweet ids and labels. To crawl tweets please use tweets hydrators tools:

In case if you do not have twitter account or access credentials please create a Twitter Account. Then follow this guide to retrieve access credentials for the Twitter API.

Publications:

Please cite the following papers if you are using the data or annotation guidelines

  1. Firoj Alam and Shaden Shaar and Fahim Dalvi and Hassan Sajjad and Alex Nikolov and Hamdy Mubarak and Giovanni Da San Martino and Ahmed Abdelali and Nadir Durrani and Kareem Darwish and Abdulaziz Al-Homaid and Wajdi Zaghouani and Tommaso Caselli and Gijs Danoe and Friso Stolk and Britt Bruntink and Preslav Nakov, "Fighting the COVID-19 Infodemic: Modeling the Perspective of Journalists, Fact-Checkers, Social Media Platforms, Policy Makers, and the Society", Findings of EMNLP 2021, download.

  2. Firoj Alam, Fahim Dalvi, Shaden Shaar, Nadir Durrani, Hamdy Mubarak, Alex Nikolov, Giovanni Da San Martino,3Ahmed Abdelali,1Hassan Sajjad,1Kareem Darwish,1Preslav Nakov, "Fighting the COVID-19 Infodemic in Social Media: A Holistic Perspective and a Call to Arms", Proceedings of the International AAAI Conference on Web and Social Media. (Vol. 15, pp. 913-922). 2021. download.

@inproceedings{alam2020fighting,
    title={Fighting the {COVID}-19 Infodemic: Modeling the Perspective of Journalists, Fact-Checkers, Social Media Platforms, Policy Makers, and the Society},
    author={Firoj Alam and Shaden Shaar and Fahim Dalvi and Hassan Sajjad and Alex Nikolov and Hamdy Mubarak and Giovanni Da San Martino and Ahmed Abdelali and Nadir Durrani and Kareem Darwish and Abdulaziz Al-Homaid and Wajdi Zaghouani and Tommaso Caselli and Gijs Danoe and Friso Stolk and Britt Bruntink and Preslav Nakov},
    booktitle = {Findings of EMNLP 2021},
    year={2021},
}

@InProceedings{alam2020call2arms,
  title		= {Fighting the {COVID}-19 Infodemic in Social Media: A
		  Holistic Perspective and a Call to Arms},
  author	= {Alam, Firoj and Dalvi, Fahim and Shaar, Shaden and
		  Durrani, Nadir and Mubarak, Hamdy and Nikolov, Alex and {Da
		  San Martino}, Giovanni and Abdelali, Ahmed and Sajjad,
		  Hassan and Darwish, Kareem and Nakov, Preslav},
  year		= {2021},
  pages		= {913-922},
  month	= {May},
  volume	= {15},
  booktitle	= {Proceedings of the International {AAAI} Conference on Web
		  and Social Media},
  series	= {ICWSM~'21},
  url		= {https://ojs.aaai.org/index.php/ICWSM/article/view/18114}
}

Credits

  • Firoj Alam, Qatar Computing Research Institute, HBKU, Qatar
  • Shaden Shaar, Qatar Computing Research Institute, HBKU, Qatar
  • Alex Nikolov, Sofia University, Bulgaria
  • Hamdy Mubarak, Qatar Computing Research Institute, HBKU, Qatar
  • Giovanni Da San Martino, University of Padova, Italy
  • Ahmed Abdelali, Qatar Computing Research Institute, HBKU, Qatar
  • Fahim Dalvi, Qatar Computing Research Institute, HBKU, Qatar
  • Nadir Durrani, Qatar Computing Research Institute, HBKU, Qatar
  • Hassan Sajjad, Qatar Computing Research Institute, HBKU, Qatar
  • Kareem Darwish, Qatar Computing Research Institute, HBKU, Qatar
  • Preslav Nakov, Qatar Computing Research Institute, HBKU, Qatar
  • Abdulaziz Al-Homaid, Qatar Computing Research Institute, HBKU, Qatar
  • Wajdi Zaghouani, Hamad Bin Khalifa University, Qatar
  • Tommaso Caselli, University of Groningen, The Netherlands
  • Gijs Danoe, University of Groningen, The Netherlands
  • Friso Stolk, University of Groningen, The Netherlands
  • Britt Bruntink, University of Groningen, The Netherlands

Licensing

This dataset is published under CC BY-NC-SA 4.0 license, which means everyone can use this dataset for non-commercial research purpose: https://creativecommons.org/licenses/by-nc/4.0/.

Contact

Please contact tanbih@qcri.org

Acknowledgment

Thanks to the QCRI's Crisis Computing team for facilitating us with Micromappers.