Skip to content

jpwahle/iconf22-paraphrase

Repository files navigation

Identifying Machine-Paraphrased Plagiarism

arXiv DOI HuggingFace Dataset

This repositroy implements the paper "Identifying Machine-Paraphrased Plagiarism". We structure the repository in two parts: (1) the classical machine learning models relying on static word embeddings, and (2) the neural language models learning paraphrase detection end-to-end.

You can find detailled descriptions on how to reconstruct each of the experiments in the respective README files.

Dataset

The dataset that was created for this publication can be downloaded from HuggingFace and Zenodo. If you want to evaluate the neural language models, you can also use the NLM/prepare_data.sh script which will download and and extract the data for you.

Results

To reproduce the machine learning and word embedding experiments, follow ML-README. To reproduce the neural language model experiments, follow NLM-README.

Detailed ML results

Spinbot

Spinnerchief-DF

Spinnerchief-IF

Detailed NLM results

The checkpoints for each experiment can be found under the huggingface models. The names for our models are:

  • jpelhaw/bert-base-uncased-pd
  • jpelhaw/bart-base-pd
  • jpelhaw/xlnet-base-cased-pd
  • jpelhaw/electra-base-discriminator-pd
  • jpelhaw/longformer-base-4096-pd
  • jpelhaw/albert-base-uncased-pd
  • jpelhaw/distilbert-base-uncased-pd
  • jpelhaw/roberta-base-pd

The detailed results for each experiments are shown in the following table:

Citation

If you use this repository or the results from our paper for your research work, please cite us in the following way.

@inproceedings{Wahle2022b,
  title        = {{Identifying} {Machine}-{Paraphrased} {Plagiarism}},
  author       = {Wahle, Jan Philip and Ruas, Terry and Foltynek, Tomas and Meuschke, Norman and Gipp, Bela},
  year         = 2022,
  month        = {February},
  booktitle    = {Proceedings of the iConference},
  location     = {Virtual Event},
  topic        = {pd},
  doi          = {https://doi.org/10.1007/978-3-030-96957-8_34}
}

If you used the dataset, please also cite.

@inproceedings{Foltynek2020,
  title = {Detecting {Machine}-obfuscated {Plagiarism}},
  booktitle = {Proceedings of the {iConference} 2020},
  author = {Folt{\'y}nek, Tom{\'a}{\v s} and Ruas, Terry and Scharpf, Philipp and Meuschke, Norman and Schubotz, Moritz and Grosky, William and Gipp, Bela},
  year = {2020},
  series    = {Lecture Notes in Computer Science},
  publisher = {Springer},
  doi = {https://doi.org/10.5281/zenodo.3608000}
}

About

The official implementation of the iConference 2022 paper "Identifying Machine-Paraphrased Plagiarism".

Resources

Stars

Watchers

Forks