Data Augmentation #17

andra-pumnea · 2020-03-21T10:39:09Z

Experiment with different methods for data augmentation, report results and compare to baseline.

borhenryk · 2020-03-21T10:42:37Z

I will check later on the back translation

stedomedo · 2020-03-22T08:55:51Z

There is a possibility to use PPDB to generate additional paraphrased questions:
http://paraphrase.org/#/download

Timoeller · 2020-03-23T18:19:20Z

Any updates on creating more questions?

Maybe @HenrykBorzymowski can use the MS Azure translator here for backtranslation? They have free 2M chars per month I heard : )

borhenryk · 2020-03-25T10:54:27Z

I have tried the google/uda project (https://github.com/google-research/uda). It has a back-translation part that allows you to take existing sentences, translate them into French and then back into English with different temperature parameters which will increase the sample size of the existing dataset.

Unfortunately the repository is quite outdated and the packages with the given versions do not work anymore.

Please install these packages (with python==2.7) and then follow the instructions in the UDA readme file to make it work:

pip install tensorflow-gpu====1.15.2
install pip tensor2tensor==1.15.2
pip install tensorflow probability==0.7.0

The following command translates the provided sample file in the directory back_translate (google/uda). It automatically divides paragraphs into sentences, translates English sentences into French, and then translates them back into English. Go to the back_translate directory and execute it:

download bash.sh
bash run.sh

download.sh will download the translation model
run.sh performs the back_translation with a certain temperature. (def. 0.9)

I tried some temperature settings (0.3, 0.5, 0.7, 0.9) for the eval_question_similarity_en.csv table and found that rather small temperatures work better for our case (0.3 or 0.5). With 0.7 and 0.9 we get quite a lot of random translations :D

Attached you will find the results if someone is interested :) This could help us to get more variance in our sentences and to be less dependent on certain words that appear in our training set.

eval_question_similarity_back_trans.xlsx

tholor added the NLP / Modeling label Mar 21, 2020

tholor assigned andra-pumnea and borhenryk Mar 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Augmentation #17

Data Augmentation #17

andra-pumnea commented Mar 21, 2020

borhenryk commented Mar 21, 2020

stedomedo commented Mar 22, 2020 •

edited

Timoeller commented Mar 23, 2020

borhenryk commented Mar 25, 2020

Data Augmentation #17

Data Augmentation #17

Comments

andra-pumnea commented Mar 21, 2020

borhenryk commented Mar 21, 2020

stedomedo commented Mar 22, 2020 • edited

Timoeller commented Mar 23, 2020

borhenryk commented Mar 25, 2020

stedomedo commented Mar 22, 2020 •

edited