GitHub - LanguageNet/LanguageNet.github.io: Large Scale Multilingual Paraphrase Corpus

News

We are building large-scale multi-lingual paraphrase datasets right now. As planned, we will have 10-language corpus and each language has 50k pairs!
Currently this repository contains English paraphrase only. Please check our LanguageNet website to download.

Paraphrase-dataset

This repository contains code and data used in the following paper, please cite if you use it for your research:

@inproceedings{lan2017continuously,
  author     = {Lan, Wuwei and Qiu, Siyu and He, Hua and Xu, Wei},
  title      = {A Continuously Growing Dataset of Sentential Paraphrases},
  booktitle  = {Proceedings of The 2017 Conference on Empirical Methods on Natural Language Processing (EMNLP)},
  year       = {2017},
  publisher  = {Association for Computational Linguistics},
  pages      = {1235--1245},
  location   = {Copenhagen, Denmark}
  url        = {http://aclweb.org/anthology/D17-1127}
}

A few notes

Put your own Twitter keys into config.py and modify line 59 in main.py before running the code.
Training and testing file is the subset of raw data with human annotation, both files have the same format, each line contains: sentence1 \tab sentence2 \tab (n,6) \tab url
For each sentence pair, there are 6 Amazon Mechanical Turk workers annotating it. 1 representa paraphrase and 0 represents non-paraphrase. So totally n out 6 workers think the pair is paraphrase. If n<=2, we treat them as non-paraphrase; if n>=4, we treat them as paraphrase; if n==3, we discard them.
After discarding n==3, we can get 42200 for training and 9324 for testing.

License

It is released for non-commercial use under the CC BY-NC-SA 3.0 license. Use of the data must abide by the Twitter Terms of Service and Developer Policy.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
README.md		README.md
index.html		index.html
project.css		project.css

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

index.html

index.html

project.css

project.css

Repository files navigation

News

Paraphrase-dataset

A few notes

License

About

Releases

Packages

Contributors 2

Languages

LanguageNet/LanguageNet.github.io

Folders and files

Latest commit

History

Repository files navigation

News

Paraphrase-dataset

A few notes

License

About

Topics

Resources

Stars

Watchers

Forks

Languages