Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

How to select the 5000/1500 words when building the dictionaries? #24

Closed
fallingstar621 opened this issue Feb 9, 2018 · 6 comments
Closed

Comments

@fallingstar621
Copy link

fallingstar621 commented Feb 9, 2018

Hi, I was wondering how the 5000+ pairs and 1500+ pairs were selected to build the training/testing dictionary? As the full dictionary can contain 100K+ pairs, do we take just the top frequent words? I understand the pre-defined dictionary is only used in the first iteration of supervised training, but how much will the initial selection of translation pairs affect the alignment performance? Another question is that why selecting 5000? Will it help when including more translation pairs in the training dictionary? Thanks in advance!

@glample
Copy link
Contributor

glample commented Feb 10, 2018

Hello,

In the supervised approach, we generated translations for all words from the source language to the target language, and vice-versa (a translation being a pair (x, y) associated with the probability for y of being the correct translation of x). Then, we considered all pairs of words (x, y) such that y has a high probability of being a translation of x, but also that x has a high probability of being a translation of y. Then, we sorted all generated translation pairs by frequency of the source word, and took the 5000 first resulting pairs for training, and the 1500 following ones for testing.

The initial selection pair most likely has an impact on the alignment performance, but we did not study this extensively. But we noticed that based on how we were selecting the pairs, the results in the supervised setting were different. In particular, when we were selecting pairs for which there was very little ambiguity / no multiple possible translations, then the translation accuracy was better, but note that the test set was also not the same, and maybe the difference of test pairs alone was enough to explain the differences.

Previous works have shown that using more than 5000 pairs of words does not improve the performance (Artetxe et al., 2017), and can even be detrimental (see Dinu et al., 2015). This is why we decided to consider 5000 pairs only (also because we wanted to be consistent with previous works).

@fallingstar621
Copy link
Author

@glample thank you for providing more insights! Also Congratulations on the acceptance of the paper!

@glample
Copy link
Contributor

glample commented Feb 12, 2018

Thank you :)

@glample glample closed this as completed Feb 12, 2018
@fallingstar621
Copy link
Author

fallingstar621 commented Feb 15, 2018

@glample Can I ask another question? Why pre-defined dictionary is only used in the first iteration of supervised training? Can we use the pre-defined dictionary rather than build from the embedding in the following iterations? I tried supervised training for several language pairs. In some cases, I observed that the precision@k metric actually drops over iterations (starting from the second iteration). In particular, the number of translation pairs Does that mean the Procrustes can make the alignment worse? Have you experienced this kind of "convergence" problem in your experiments? Any suggestion on changing the parameters (e.g., number of iterations, dico_threshold, dico_max_rank, etc.)? Thanks in advance!

@glample
Copy link
Contributor

glample commented Feb 15, 2018

Can we use the pre-defined dictionary rather than build from the embedding in the following iterations? Do you mean it is possible to use the pre-defined dictionary in addition to the dictionary generated by the alignement, or instead of the generated dictionary? Currently we use the generated dictionary for the next iteration, and totally discard the pre-defined dictionary. But it is true that you could probably use a combination of both and make the supervised + refinement model even stronger.

We sometimes observed that the iterations at step t >= 2 were not as good as the initial one, but this was only for languages where embeddings are difficult to align like en-ru or en-zh. For pairs composed of European languages we did not observed anything like this.

@fallingstar621
Copy link
Author

@glample Thanks for the reply. Again, great insights!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants