Twitter Fake News Network

An exploration of the users responsible for the circulation of fake news Twitter. Specifically looking into how likely is it that an article shared by an user is fake.

How to run

1. Collect tweets and retweets

Run the scripts for tweet and retweet collection

python collect_tweets.py
python collect_retweets.py

2. Compile a list of users

Run the next script to create a list of user ids from the collected tweets and retweets

python create_dataset.py

3. Collect user information

Use tweego to collect information about the users from their ids

tweego -d "fakenewsnet_dataset" -k "keys.json" -n "all" -u

4. Filter dataset

Filter the dataset to include only users that:

Have shared more than 2 new articles
Follow less than or equal to 5000 other users
Have at least 5000 followers

python filter_dataset.py

The reason behind these constraints is:

We want users who've shared at least a couple of articles to establish a pattern
Users who have followed more than 5k users have done so most likely with a bot
Anyone with more than 5k followers is bound to contribute to news spreading among a large community

5. Create user network

Use the tweego tool to construct the user network and store it as a .gml file

tweego -d "fakenewsnet_dataset" -k "keys.json" -n "all_10k" -so -g

6. Add features to graph

Add features to the nodes like, the number of news articles, if a user is verified or not, and the amount of fake news shared as a fraction of the total news

python edit_graph.py

7. Classification

Run the classification scripts in any order

GNN
node2vec

Purpose of each notebook-

GNN - Use different Graph Neural Networks to classify fake and real users
node2vec - Use node2vec combined with different ML models to classify fake and real users

Background

Dataset

To build a classification model that would find patterns in ego networks to detect users that share predominantly fake news, a dataset containing edges between users and a database of tweets and retweets that have been manually classified as real or fake is required. Such a dataset does not exist already, but it can be generated.

1. Tweego

Tweego is a tool to generate second order ego networks for users from Twitter. This means it'll collect all the friends of friends for a given set of users and generate a graph.

2. FakeNewsNet

FakeNewsNet is a fake news data repository, which contains two comprehensive datasets that includes news content, social context, and dynamic information. The full paper can be found here. The news is obtained from two fact-checking websites to obtain news with ground truth labels for fake news and true news, these websites are-

PolitiFact: In PolitiFact, journalists and domain experts review the political news and provide fact-checking evaluation results to claim news articles as fake or real.
GossipCop: GossipCop is a website for fact-checking entertainment stories aggregated from various media outlets. GossipCop provides rating scores on the scale of 0 to 10 to classify a news story as the degree from fake to real.

The most important feature of FakeNewsNet is that it also downloads tweets and retweets sharing the news articles from Twitter. This means that we can get the profile of users that shared the tweets from Twitter, and then combine it with our list of verified users to see how many fake/real news articles every verified user shared.

Stats

Sample size of users: 9687

	Friends	Followers	Listed	Statuses	Articles shared	Fake ratio
mean	1648.99	257043.19	1299.34	137100.58	17.11	0.42
median	1189.00	17216.00	256.00	72894.00	5.00	0.33
min	0.00	5000.00	0.00	139.00	3.00	0.0
max	5000.00	72123733.00	215288.00	8318206.00	55768.00	1.00

Creating Labels

To create the labels, the ratio of fake news shared to total number of articles shared is considered. The FakeNewsNet dataset contains real and fake news for both Politifact and GossipCop. So first the total number of fake/real news articles a user has shared is calculated by checking how many times their display name or id matches the id or display name of the user sharing a tweet. From this we can get the total number of fake and real news articles a user has shared from both sources(Politifact and GossipCop) and then find the ratio of fake to total news shared.

If more than half the news articles a user has tweeted are fake, then that user is assigned a label of 1, and if less than half are fake, they are given a label of 0. So a label of 1 means the account shares mostly fake news, and a label of 0 means the news shared is mostly real.

Edgelist

Using nucoll it is possible to generate a GML file of a users first and second degree relationships on Twitter. In order to generate the graph, nucoll retrieves the handle's friends (or followers) and all friends-of-friends (2nd degree relationships). It then looks for friend relationships among the friends/followers of the handle.

In this case the handle we supply to nucoll is @verified, and a file with all the 1st and 2nd degree relationships of users that are friends of @verified is generated.

Because of Twitter's very restrictive API rate limits, generating the edge list of all 330k+ verified users is not feasible, so the users are filtered. The following restrictions were applied-

The user must have shared at least one real, and one fake article
The user must be following less than 10k people. The reason for this is, it's highly unlikely that a user with more than 10000 friends manually followed so many accounts and they probably used bots.

When these constraints are applied, around 3000 users are left. The edge list for these users is stored in a .gml file, which can be imported to create a networkx graph.

Classification

Two different approaches are taken to build a classification model.

Node2vec

Node2vec learns continuous representations for nodes in a graph. The implementation of node2vec used can be found here.

After combining node2vec with the node features, the classifiers trained are-
- Random forest
- SVM
- Logistic regression
- XGBoost
Graph neural networks

GNNs directly operate on the graph structure
- GraphSAGE - Learns the embedding for each node in an inductive way. Each node is represented by the aggregation of its neighborhood. Thus, even if a new node unseen during training time appears in the graph, it can still be properly represented by its neighboring nodes.
- Graph Convolutional Networks - A neural network, designed to work on graphs

Analysis

Baseline

For a baseline, the performance of classifiers on just the sentiment and empath features without any network information is taken.

	Accuracy	Precision	Recall	f1 Score
Naive Bayes	0.771	0.660	0.660	0.660
KNN	0.747	0.680	0.630	0.600
Logistic Reg	0.775	0.690	0.690	0.680
SVM	0.716	0.720	0.720	0.720
XGBoost	0.761	0.710	0.710	0.710
Random Forest	0.767	0.660	0.660	0.660

GNNs

	Accuracy	Precision	Recall	f1 Score
GraphSage	0.730	0.710	0.850	0.773
GCN	0.671	0.654	0.827	0.709
GAT	0.541	0.543	0.987	0.650

Node2vec

	Accuracy	Precision	Recall	f1 Score
Naive Bayes	0.732	0.630	0.620	0.610
KNN	0.727	0.670	0.670	0.670
Logistic Reg	0.787	0.680	0.680	0.670
SVM	0.728	0.730	0.730	0.720
XGBoost	0.793	0.730	0.730	0.720
Random Forest	0.785	0.670	0.660	0.660

Learnt embeddings

The classifiers are trained on the embeddings learnt by the GraphSAGE and GCN models

GraphSAGE

	Accuracy	Precision	Recall	f1 Score
Naive Bayes	0.724	0.720	0.720	0.720
KNN	0.691	0.690	0.690	0.690
Logistic Reg	0.700	0.700	0.700	0.700
SVM	0.719	0.730	0.720	0.710
XGBoost	0.713	0.720	0.710	0.710
Random Forest	0.679	0.680	0.680	0.680

GCN

	Accuracy	Precision	Recall	f1 Score
Naive Bayes	0.702	0.700	0.700	0.700
KNN	0.621	0.670	0.620	0.520
Logistic Reg	0.716	0.730	0.720	0.710
SVM	0.725	0.740	0.730	0.720
XGBoost	0.729	0.730	0.730	0.730
Random Forest	0.695	0.690	0.690	0.690

GAT

	Accuracy	Precision	Recall	f1 Score
Naive Bayes	0.662	0.670	0.660	0.660
KNN	0.646	0.670	0.650	0.530
Logistic Reg	0.717	0.730	0.720	0.710
SVM	0.719	0.740	0.720	0.710
XGBoost	0.724	0.730	0.720	0.720
Random Forest	0.595	0.600	0.600	0.090

Results

Accuracy

	Naive Bayes	KNN	Logistic Reg	SVM	XGBoost	Random Forest
Baseline	0.659	0.628	0.686	0.716	0.710	0.662
GraphSAGE	0.724	0.691	0.700	0.719	0.713	0.679
GCN	0.702	0.621	0.716	0.725	0.729	0.695
Node2vec	0.610	0.671	0.678	0.728	0.728	0.659
GAT	0.662	0.646	0.717	0.719	0.724	0.595

From the table above, it is evident that for classifying users as fake news sources, the structure of the network helps increase the accuracy of classification

Citing

If you use this repository in your research please cite

@misc{Saha_TwitterFakeNet_2020,
	author = {Saha, Aveek},
	month = {3},
	title = {{TwitterFakeNet}},
	url = {https://github.com/Aveek-Saha/TwitterFakeNet},
	year = {2020}
}

Name		Name	Last commit message	Last commit date
Latest commit History 175 Commits
datasets_old		datasets_old
figures_old		figures_old
old		old
verified		verified
.gitattributes		.gitattributes
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
collect_retweets.py		collect_retweets.py
collect_tweets.py		collect_tweets.py
create_dataset.py		create_dataset.py
edit_graph.py		edit_graph.py
filter_dataset.py		filter_dataset.py
gnn_analysis.py		gnn_analysis.py
no_graph_analysis.py		no_graph_analysis.py
node2vec_analysis.py		node2vec_analysis.py
requirements.txt		requirements.txt
user_analysis.py		user_analysis.py
utils.py		utils.py

License

Aveek-Saha/TwitterFakeNet

Folders and files

Latest commit

History

Repository files navigation

Twitter Fake News Network

How to run

1. Collect tweets and retweets

2. Compile a list of users

3. Collect user information

4. Filter dataset

5. Create user network

6. Add features to graph

7. Classification

Background

Dataset

1. Tweego

2. FakeNewsNet

Stats

Creating Labels

Edgelist

Classification

Node2vec

Graph neural networks

Analysis

Baseline

GNNs

Node2vec

Learnt embeddings

Results

Citing

About

Topics

Resources

License

Stars

Watchers

Forks

Languages