GitHub - alexa/unreliable-news-detection-biases

Hidden Biases in Unreliable News Detection Datasets

Official Code for the paper:

Hidden Biases in Unreliable News Detection Datasets

Xiang Zhou, Heba Elfardy, Christos Christodoulopoulos, Thomas Butler and Mohit Bansal

EACL 2021

Dependencies

The code is tested on Python 3.7 and PyTorch 1.6.0

Other dependencies are listed in requirements.txt and can be installed by running pip install -r requirements.txt

Datasets

Download Original Datasets

The experiments and results in our paper mainly involve two datasets: NELA and FakeNewsNet

For the NELA dataset, we use both the 2018 version and the 2019 version. To reproduce experiments, please first download both versions (on the download page, please select all and choose the original format) and put it under the data directory. Then, decompress nela/2018/articles.tar.gz and nela/2019/nela-gt-2019-json.tar.bz2 and put them under the original directory. The structure of data should look like this:

data
└── nela
    ├── 2018
    │   ├── articles
    │   │   └── ... 
    │   ├── articles.db.gz
    │   ├── articles.tar.gz
    │   ├── labels.csv
    │   ├── labels.txt
    │   ├── nela_gt_2018-new_schema.tar.bz2
    │   ├── README.md
    │   └── titles.tar.gz
    └── 2019
        ├── labels.csv
        ├── nela-eng-2019
        │   └── ... 
        ├── nela-gt-2019-json.tar.bz2
        ├── nela-gt-2019.tar.bz2
        ├── README-1.md
        ├── README.md
        └── source-metadata.json

The FakeNewsNet dataset can be crawled using the code from its official GitHub repo. After downloading the dataset, please put it also under the data/fakenewsnet_dataset/raw, and the whole data folder should look like this:

data
├── fakenewsnet_dataset
│   └── raw
└── nela
    └── ...

The default location of data directory is under the root directory. If you prefer storing your data in other locations, you can change the variables in constants.py

Create Dataset Splits

To create the random/site/time split of NELA in the paper, run python data_helper.py nela {site, time, random}

To create the random label split, run python data_helper.py nela random_label (Note you have to manually rename the split dataset after creating the random label split)

To create the split of FakeNewsNet in the paper, run python data_helper.py fnn {site, time, random}

Train Baseline Models

Example scripts to train baseline models used in this paper can be found under the scripts directory (Please refer to Sec. 4.1 in the paper for detailed descriptions of the baseline models). You can change the dataset path to train different baselines.

To train the logistic regression baseline, run bash scripts/lr.sh

To train the title-only RoBERTa models, run bash scripts/roberta_title.sh

To train the title+article RoBERTa models, run bash scripts/roberta_title_article.sh

Reproducing Analysis Experiments

Source Level Results

Get the predictions on the validation set (by running eval commands in the model training scripts).
To get source-level accuracies, run python source_evaluation.py --pred_file [PREDICTION_FILE] --key_file [KEY_FILE] --pred_type [PRED_TYPE]. Set PRED_TYPE to clean for the logistic regression model and the title-only RoBERTa model and full for the title+article RoBERTa model due to different output formats. Please refer to the python file for the details of other arguments.

Extracting Salient Features

Train a logistic regression model using bash scripts/lr.sh and save the trained model by adding the save_model [MODEL_PATH] argument.
To extract salient features from logistic regression baselines, run python analysis_lr.py --model_path [MODEL_PATH]. Please refer to the python file for the details of other arguments.

Site Similarity Analysis

Create 5 different domain splits using different seeds by running python data_helper.py nela site [SEED]
To get site similarity results in Table 7 in the paper, train 5 title+article baselines on each of these 5 different domain splits by running bash scripts/roberta_title_article.sh on and put all the predictions under the output directory. Change the SAVE_DIRS and the SITE_PREDS variables in site_similarity.py to match your saved path and run python site_similarity.py

Word Cloud Visualization

Save the titles with correct or wrong predictions in file correct.title and wrong.title respectively by running python dump_titles.py --pred_file [PREDICTION_FILE] --key_file [KEY_FILE] --pred_type [PRED_TYPE]. Set PRED_TYPE to clean for the logistic regression model and the title-only RoBERTa model and full for the title+article RoBERTa model due to different output formats. Then, put correct.title and wrong.title in the same directory as draw_cloud_unigram.py.
To draw the word cloud showing the most salient words in examples with correct or wrong (determined by the PRINT_TYPE variable in the script) prediction, run python draw_cloud_unigram.py

Security

See CONTRIBUTING for more information.

License

This project is licensed under the Apache-2.0 License.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
my_allen_lib		my_allen_lib
scripts		scripts
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
THIRD-PARTY-LICENSES		THIRD-PARTY-LICENSES
analysis_lr.py		analysis_lr.py
bow_lr.py		bow_lr.py
constants.py		constants.py
data_helper.py		data_helper.py
draw_cloud_unigram.py		draw_cloud_unigram.py
dump_titles.py		dump_titles.py
requirements.txt		requirements.txt
run_bert.py		run_bert.py
run_lr.py		run_lr.py
site_similarity.py		site_similarity.py
source_evaluation.py		source_evaluation.py
tf_utils.py		tf_utils.py
transformer_processors.py		transformer_processors.py
utils.py		utils.py

License

alexa/unreliable-news-detection-biases

Folders and files

Latest commit

History

Repository files navigation

Hidden Biases in Unreliable News Detection Datasets

Dependencies

Datasets

Download Original Datasets

Create Dataset Splits

Train Baseline Models

Reproducing Analysis Experiments

Source Level Results

Extracting Salient Features

Site Similarity Analysis

Word Cloud Visualization

Security

License

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Languages