DUCK

This repository is forked from the official implementation of DUCK. Since the original project contains plenty of errors and unspecified parameters, I modify the code and further state the dataset preparation more clearly so that this project can be run easier.

If you have any questions, please contact me at joshchang0111.ee10@nycu.edu.tw .
If you find this code useful, please feel free to let me know, thanks!

Dependencies

Python 3.8.10

$ pip install transformers==4.2.1
$ pip install Cython
$ pip install scikit-learn==0.21.3
$ pip install networkx==3.0
$ pip install pyro-ppl==0.3.0
$ pip install numpy==1.24.2
$ pip install pandas==1.4.4
$ pip install matplotlib
$ pip install ipdb

Install PyTorch and PyTorch Geometric as follows.

## Env: NVIDIA GeForce GTX 1080
$ pip install torch==1.9.0+cu102 -f https://download.pytorch.org/whl/torch_stable.html
$ pip install torch-scatter==2.0.8 -f https://data.pyg.org/whl/torch-1.9.0+cu102.html
$ pip install torch-sparse==0.6.11 -f https://data.pyg.org/whl/torch-1.9.0+cu102.html
$ pip install torch-geometric==2.2.0

## Env: NVIDIA GeForce RTX 3090
$ pip install torch==1.11.0+cu113 -f https://download.pytorch.org/whl/torch_stable.html
$ pip install torch-scatter==2.0.9 -f https://data.pyg.org/whl/torch-1.11.0+cu113.html
$ pip install torch-sparse==0.6.15 -f https://data.pyg.org/whl/torch-1.11.0+cu113.html
$ pip install torch-geometric==2.2.0

Dataset

All datasets are publicly accessible.

Data crawling tool

twarc

Data preparation

Since Twitter15 and Twitter16 datasets only provide the tweet IDs of the responses under each thread, you need to crawl the textual content of these responses via Twitter API. For these two datasets, I use the version released by yunzhusong/AARD where those content has already been crawled.

For training the model, you need to prepare several files in the following structure.

|__ data
    |__ {$DATASET_NAME}_5fold
        |__ data.label.txt
        |__ fold0
            |__ _x_train.pkl
            |__ _x_test.pkl
        |__ fold1
        |__ fold2
        |__ fold3
        |__ fold4
    |__ {$DATASET_NAME}graph
        |__ {$TWEET_ID_0}.npz
        |__ ...

Each fold directory (e.g. fold0) contains _x_train.pkl and _x_test.pkl. Both pickle files contain a list of tweet IDs, indicating the threads for training and testing respectively.

Also, {$DATASET_NAME}graph contains .npz files where the file names are the tweet IDs listed in _x_train.pkl and _x_test.pkl. Each .npz file contains the following information.

{
    "root":        , ## textual content of the source post
    "rootindex":   , ## 0
    "nodecontent": , ## textual content of all the responses
    "edgematrix":  , ## edge index for the conversational graph
    "y":             ## label of the root (in number format)
}

These files can be generated by running the following commands.

$ python preprocess.py --make_label --dataset $DATASET_NAME
$ python preprocess.py --split_5_fold --dataset $DATASET_NAME
$ python preprocess.py --build_graph --dataset $DATASET_NAME

How to run the code?

Train BERT+GAT with Comment Tree

python train.py \
    --datasetName $dataset \
    --baseDirectory ./data \
    --n_classes $n_classes \
    --foldnum $fold \
    --mode CommentTree \
    --modelName Simple_GAT_BERT \
    --batchsize $batch_size \
    --learningRate $lr_bert \
    --learningRateGraph $lr_gnn \
    --dropout_gat $dropout \
    --n_epochs 20 \

Can adjust the argument --max_tree_len if your GPU memory is not enough.

Train BERT+GAT with Comment Tree & Two-Tier Transformer with Comment Chain

python train.py \
    --datasetName $dataset \
    --baseDirectory ./data \
    --n_classes $n_classes \
    --foldnum $fold \
    --mode CommentTree \
    --modelName CCCTNet \
    --batchsize $batch_size \
    --learningRate $lr_bert \
    --learningRateGraph $lr_gnn \
    --dropout_gat $dropout \
    --n_epochs 20 \
    --max_tree_len 40 \
    --result_path ./result/CCCT

Detailed arguments can be found in scripts/run.sh & scripts/run_ccct.sh.

Publicaton

This is the source code for DUCK: Rumour Detection on Social Media by Modelling User and Comment Propagation Networks.

If you find this code useful, please let us know and cite our paper.
If you have any question, please contact Lin at: s3795533 at student dot rmit dot edu dot au.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
model		model
scripts		scripts
.gitignore		.gitignore
README.md		README.md
dataset.py		dataset.py
preprocess.py		preprocess.py
requirements.txt		requirements.txt
train.py		train.py
utils.py		utils.py
vgae_preprocessing.py		vgae_preprocessing.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

model

model

scripts

scripts

.gitignore

.gitignore

README.md

README.md

dataset.py

dataset.py

preprocess.py

preprocess.py

requirements.txt

requirements.txt

train.py

train.py

utils.py

utils.py

vgae_preprocessing.py

vgae_preprocessing.py

Repository files navigation

DUCK

Dependencies

Dataset

Data crawling tool

Data preparation

How to run the code?

Train BERT+GAT with Comment Tree

Train BERT+GAT with Comment Tree & Two-Tier Transformer with Comment Chain

Publicaton

About

Releases

Packages

Languages

joshchang0111/DUCK-code

Folders and files

Latest commit

History

Repository files navigation

DUCK

Dependencies

Dataset

Data crawling tool

Data preparation

How to run the code?

Train BERT+GAT with Comment Tree

Train BERT+GAT with Comment Tree & Two-Tier Transformer with Comment Chain

Publicaton

About

Resources

Stars

Watchers

Forks

Languages