Cannot reproduce your results #1

george-philipp · 2019-09-16T21:35:50Z

Hey there,

Thank you for putting up this repo. I quickly run your method, the word mover distance with unigram, on the WMT17 de-en language pair, and the pearson correlation is only 0.645, quite worse from what you report in the paper. Can you double check the code release?

Also, it takes me 8 mins to run on these 560 sentences. Is this expected or am I doing something wrong?

andyweizhao · 2019-09-17T10:00:28Z

Thank you for your interest. This is a preliminary web service with major implementation included. To reproduce the numbers in paper, additional steps are required (need to slightly change the code.. but I will add them soon):

use the BERT model fine-tuned on MNLI instead of the origin version.
simply remove the subwords that contain "##" in the unigram setting, because the latter part like "ing" and "ed" is often nothing with the core meaning like "watching" and "watched".

Due to time constraints, the current version of web service supports CPU environment only, but it will have more features released in the next update.

george-philipp · 2019-09-18T02:15:18Z

hi @andyweizhao thank you for your swift reply.

I understand that the code base is not using the MNLI model. However, the correlation I computed is still worse than those shown in the BERT+PMEANS row.

By the way, do you apply this trick (removing subwords) for all of the studies in your paper? For example, do you also use this trick for the HMD + BERT in table 5?

andyweizhao · 2019-09-18T22:56:39Z

Hi George,

I forgot one additional step: TF-IDF weights are required. I will try to fix these issues this week.

When combining BERT-MNLI, TF-IDF and removing subwords, you will see the similar numbers as the ones below in my server (wmd-unigram):
de-en {'pearson': 0.7082533292728657}

I used this trick in all tasks and most of language pairs except "fi-en" and "lv-en"..

andyweizhao · 2019-09-22T20:05:56Z

I just updated the repo to support the reproducibility on MT. I will close the current issue, please create new ones if you have additional questions.

george-philipp · 2019-09-22T21:03:00Z

Wow thank you for making this happen. This is very helpful.

I try to run the codes but seems like some of the files are missing, namely the translation data. Could you be so kind to also upload them?

andyweizhao · 2019-09-22T21:13:51Z

Sure thing. I just uploaded them.

Alex-Fabbri · 2020-03-29T13:21:05Z

Hi, thanks for the great work!
Following up on reproducing results, when I run examples/run_MT.py with v1 of moverscore I'm able to reproduce the results "WMD-1+BERTMNLI+PMeans" from the readme, but when I run v2 I get different results than "WMD-2+BERTMNLI+PMeans" :

cs-en pearson: 0.67
de-en pearson: 0.66
ru-en pearson: 0.71
tr-en pearson: 0.73
zh-en pearson: 0.70

I'm attaching the result of running pip freeze > requirements.txt
requirements.txt

Do you have any ideas on the cause of the difference?

Thank you!

andyweizhao · 2020-03-30T10:44:29Z

Hi Alex,
For reproducing results, moverscore_v1 is all you need, i.e., set the parameter "n_gram" to 1 for WMD-1 and set it to 2 for WMD-2. However, the running speed of this version is creepy. I made an easygoing version called moverscore_v2 for accelerating, e.g., use DistilledBERT instead of BERT, make codes efficient and remove WMD-2, which sadly drops a little in performance but still correlates well with human judgments. Choose these two versions sensibly on purpose :)

Alex-Fabbri · 2020-03-30T13:02:09Z

That makes sense. Thanks a lot for the clarification!

andyweizhao closed this as completed Sep 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot reproduce your results #1

Cannot reproduce your results #1

george-philipp commented Sep 16, 2019

andyweizhao commented Sep 17, 2019

george-philipp commented Sep 18, 2019

andyweizhao commented Sep 18, 2019

andyweizhao commented Sep 22, 2019

george-philipp commented Sep 22, 2019

andyweizhao commented Sep 22, 2019

Alex-Fabbri commented Mar 29, 2020

andyweizhao commented Mar 30, 2020

Alex-Fabbri commented Mar 30, 2020

Cannot reproduce your results #1

Cannot reproduce your results #1

Comments

george-philipp commented Sep 16, 2019

andyweizhao commented Sep 17, 2019

george-philipp commented Sep 18, 2019

andyweizhao commented Sep 18, 2019

andyweizhao commented Sep 22, 2019

george-philipp commented Sep 22, 2019

andyweizhao commented Sep 22, 2019

Alex-Fabbri commented Mar 29, 2020

andyweizhao commented Mar 30, 2020

Alex-Fabbri commented Mar 30, 2020