Paraphrase Detection: Human vs. Machine Content

This is the official repository for the paper Paraphrase Detection: Human vs. Machine Content.

Setup

We recommend using Python 3.10 for this project.

First install the requirements: pip install -r requirements.txt

To use GloVe and Fasttext, you need to place their corresponding pre-trained word vectors into the models directory.

GloVe: Get the glove.6B.11d.txt from here.
Fasttext: Get the cc.en.300.bin from here.

Experiments

The project has multiple scripts included, each used for separate parts of the experiment.

Parse datasets from the datasets folder to a unified json format: parse.py
Create the BERT embeddings for text pairs in true_data.json and visualize them with t-SNE: embedding_handler.py
Apply detection methods (training & testing): detect_paraphrases.py
Evaluate the detection results: evaluate.py
Get examples sorted by best / worst / random performance: get_examples.py

Datasets

Not all datasets used in the paper are freely available to the public which is why we do not offer the prediction results on text pairs from these datasets for download. However, you are free to reprocess the experiments using all datasets from the paper once you got access.

This study includes twelve datasets (seven human-generated and five machine-generated). For further information, please refer to the paper.

Human-generated datasets: ETPC, QQP, TURL, SaR, MSCOCO, ParaSCI, APH

Machine-generated datasets: MPC, SAv2, ParaNMT-50M, PAWS-Wiki, APT

Results

We evaluated the results of our experiments in the linked paper above. However, we provide additional material here that was not used in the final version of the paper.

t-SNE visualizations of each datasets BERT embeddings

Dataset	Aquisition Type	Mixed	Paraphrases Only
APH	Human	Live View	Live View
APT	Machine	Live View	Live View
ETPC	Human	Live View	Live View
MPC	Machine	Live View	Live View
MSCOCO	Human	Live View	Live View
PAWS-Wiki	Machine	Live View	Live View
ParaNMT-50M	Machine	Live View	Live View
ParaSCI	Human	Live View	Live View
QQP	Human	Live View	Live View
SAv2	Machine	Live View	Live View
SaR	Human	Live View	Live View
TURL	Human	Live View	Live View
All Datasets	Mixed	Live View	Live View

Grid Search Results

We performed a 2-fold randomized grid search of 25 iterations once per detection method. The grid search results can be seen in this directory.

One-on-one correlation graphs of detection methods

For a detailed view at each one-on-one correlation, please refer to this directory.

Citation

If you use this repository or our paper for your research work, please cite us in the following way.

@misc{becker2023paraphrase,
      title={Paraphrase Detection: Human vs. Machine Content}, 
      author={Jonas Becker and Jan Philip Wahle and Terry Ruas and Bela Gipp},
      year={2023},
      eprint={2303.13989},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 111 Commits
datasets		datasets
models		models
output		output
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
detect_paraphrases.py		detect_paraphrases.py
embedding_handler.py		embedding_handler.py
evaluate.py		evaluate.py
get_examples.py		get_examples.py
parse.py		parse.py
requirements.txt		requirements.txt
setup.py		setup.py

License

jonas-becker/pd-human-vs-machine-content

Folders and files

Latest commit

History

Repository files navigation

Paraphrase Detection: Human vs. Machine Content

Setup

Experiments

Datasets

Results

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Languages