Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Augmentation #17

Open
andra-pumnea opened this issue Mar 21, 2020 · 4 comments
Open

Data Augmentation #17

andra-pumnea opened this issue Mar 21, 2020 · 4 comments
Assignees

Comments

@andra-pumnea
Copy link
Contributor

Experiment with different methods for data augmentation, report results and compare to baseline.

@borhenryk
Copy link
Contributor

I will check later on the back translation

@stedomedo
Copy link
Contributor

stedomedo commented Mar 22, 2020

There is a possibility to use PPDB to generate additional paraphrased questions:
http://paraphrase.org/#/download

@Timoeller
Copy link
Contributor

Any updates on creating more questions?

Maybe @HenrykBorzymowski can use the MS Azure translator here for backtranslation? They have free 2M chars per month I heard : )

@borhenryk
Copy link
Contributor

I have tried the google/uda project (https://github.com/google-research/uda). It has a back-translation part that allows you to take existing sentences, translate them into French and then back into English with different temperature parameters which will increase the sample size of the existing dataset.

Unfortunately the repository is quite outdated and the packages with the given versions do not work anymore.

Please install these packages (with python==2.7) and then follow the instructions in the UDA readme file to make it work:

pip install tensorflow-gpu====1.15.2
install pip tensor2tensor==1.15.2
pip install tensorflow probability==0.7.0

The following command translates the provided sample file in the directory back_translate (google/uda). It automatically divides paragraphs into sentences, translates English sentences into French, and then translates them back into English. Go to the back_translate directory and execute it:

download bash.sh
bash run.sh
  • download.sh will download the translation model
  • run.sh performs the back_translation with a certain temperature. (def. 0.9)

I tried some temperature settings (0.3, 0.5, 0.7, 0.9) for the eval_question_similarity_en.csv table and found that rather small temperatures work better for our case (0.3 or 0.5). With 0.7 and 0.9 we get quite a lot of random translations :D

Attached you will find the results if someone is interested :) This could help us to get more variance in our sentences and to be less dependent on certain words that appear in our training set.

eval_question_similarity_back_trans.xlsx

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants