ExaPPC

ExaPPC is a large paraphrase corpus consisting of monolingual sentence-level paraphrases using different sources.

It is the first large-scale paraphrase dataset in Persian that can be used to train state of the art language models like BERT. There are 2.3M labeled sentence pairs in the corpus consisting of 1M paraphrase label and 1.3M non-paraphrase label. Efforts were made manually and semi-automatically to construct this corpus.

The advantages of this corpus compared to the existing ones are the number of pair sentences, sentence Length variation and textual diversity, including formal and dialogue sentences.

Data

Format of data is as follows and delimited by tab:

sentence1, sentence2, label(paraphrase, non-paraphrase), manner(Human-Annotation, Semi-Automatically)

Statistics of ExaPPC by number of sentence pairs and tokens are as below:

Dataset	ExaPPC
Size(sentence pairs)	2,342,145
Size(tokens)	102,149,576
Average sentence length	22
Distribution	Total paraphrase: 986k/Total non-paraphrase: 1.3 M
Number of labels	2(Paraphrase, Non-paraphrase)

Results

Our results suggest that ExaPPC will be helpful in a variety of NLP applications like paraphrase detection task.

We fine-tuned ParsBert using presented corpus and developing a paraphrase detection in Persian with an accuracy of 94% on our test data.

Future Releases

Future releases of ExaPPC will focus on expanding the paraphrase pairs regarding data size usable for paraphrase generation downstream task and increasing the size of related pairs to have a more divergent corpus. Our goal is to provide ExaPPC as a continuous updating and improvement resource.

Citation

If you are using this dataset, please cite the initial paper:

@INPROCEEDINGS{9786243,
  author={Sadeghi, Reyhaneh and Karbasi, Hamed and Akbari, Ahmad},
  booktitle={2022 8th International Conference on Web Research (ICWR)}, 
  title={ExaPPC: a Large-Scale Persian Paraphrase Detection Corpus}, 
  year={2022},
  volume={},
  number={},
  pages={168-175},
  doi={10.1109/ICWR54782.2022.9786243}}

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.gitattributes		.gitattributes
ExaPPC.txt.gz		ExaPPC.txt.gz
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitattributes

.gitattributes

ExaPPC.txt.gz

ExaPPC.txt.gz

LICENSE

LICENSE

README.md

README.md

Repository files navigation

ExaPPC

Data

Results

Future Releases

Citation

About

Releases

Packages

Contributors 3

License

exaco/exappc

Folders and files

Latest commit

History

Repository files navigation

ExaPPC

Data

Results

Future Releases

Citation

About

Resources

License

Stars

Watchers

Forks