Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multilingual IR with Machine-Translated FAQ #46

Open
2 of 5 tasks
stedomedo opened this issue Mar 22, 2020 · 9 comments
Open
2 of 5 tasks

Multilingual IR with Machine-Translated FAQ #46

stedomedo opened this issue Mar 22, 2020 · 9 comments
Assignees
Labels
enhancement New feature or request

Comments

@stedomedo
Copy link
Contributor

stedomedo commented Mar 22, 2020

Building multilingual models (zero-shot, transfer learning, etc.) takes time.

So, in the meantime, as stated in #2 , we could machine-translate FAQs from English into other languages and add them to the search cluster, so that they can be retrieved at foreign language input. Translations in the background don't need to be perfect, but sufficient for retrieval (adequacy before fluency/grammar).

TODOs:

  • Scrape the English FAQ from data/scrapers repo
  • Build machine-translator tool (e.g. with https://pypi.org/project/googletrans/)
  • Translate some samples to check quality
  • Translate all English FAQ
  • Add data to ESC with columns: language, original_english_doc, is_machine_translated
@tholor
Copy link
Member

tholor commented Mar 22, 2020

Great idea. @stedomedo! Did I get this right, that we would still need language-specific models for question similarity with this approach?

Would it be an alternative to translate the user question live to English and then to the matching with our FAQs? With that approach we could easily leverage English models for question similarity.

@stedomedo
Copy link
Contributor Author

Yes, that's an option.
Query translation quality could suffer though from short lengths.
I'm currently exploring translation quality.
Thanks!

@stedomedo
Copy link
Contributor Author

The googletrans lib does not work reliably, so I made a free trial account on MS Azure, also because they offer up to 2M characters of translation for free per month.

Here is the English FAQ data including columns for Arabic:
https://github.com/stedomedo/COVID-QA/blob/auto_translators/data/faqs/MT_ar_faq_covidbert.csv

@stedomedo
Copy link
Contributor Author

And the MS translator:
https://github.com/stedomedo/COVID-QA/blob/auto_translators/data/translators/ms_translate.py

MS Translator is supposed to be quite good for Arabic. For other languages, Google or DeepL are better options (afaik they don't offer free credits)

Checking which real-time translation option is best to use, incl. budget-wise

@stedomedo
Copy link
Contributor Author

@tholor @Timoeller I have a question on the (desired) search workflow.
Is it:
user query -> match query to question with BERT -> search with elastic (tfidf, bm25)
?

So could a multilingual workflow be like this:
query -> detect lang
-> if EN -> match query to question with BERT -> search with elastic (tfidf, bm25)
-> if AR -> search directly with elastic (tfidf, bm25)
?
In this case, no multilingual-BERT or other-language-BERT or real-time translation would be needed.

@Timoeller
Copy link
Contributor

Good points.

Can you create a PR with the translation and the script for doing so? I would merge it to have this functionality in the repo.

About the language detection and the switch between bert + ES and only ES: we could implement it this way if multilingual isnt working well for other languages.

Do you have experience with language detection and could write a script for this, so we can integrate this into the backend? We need lang detection there anyways, because we want to adjust output texts like "source" "cateogry" etc. The script should be rather efficient, since this will limit response time...

@Timoeller Timoeller self-assigned this Mar 22, 2020
@Timoeller Timoeller added the enhancement New feature or request label Mar 22, 2020
@stedomedo
Copy link
Contributor Author

stedomedo commented Mar 23, 2020

One idea for "simple" transfer learning:
In Machine Translation this technique is commonly used when you have a low resource language. Basically, you build a model for language Y on top of the model for language X by just continuing the training (1-2 epochs) with the Y language data. Vocabs would need to be pooled on all languages though.
This could work for small data sizes or/and maybe also machine-translated texts.

@Timoeller
Copy link
Contributor

Timoeller commented Mar 23, 2020

That is exactly the idea! : )
With multilingual models like mBert or XLM-R this "zero shot learning" is easily possible because the vocab is already in one pool for all supported languages.
See e.g. Table 1 or 3 in XLM-R paper for zero shot transfer.

So if we train a multilingual model in Sentence Bert on Quora, we will also be able to match all other languages - hopefully with good performance 💃

@ViktorAlm
Copy link
Contributor

You are probably aware of these datasets but heres some multilingual similarity data. I have a NMT model for english->swedish if you want me to I could NMT and add some data for better performance on scandinavian languages.

https://github.com/google-research-datasets/paws
https://www.nyu.edu/projects/bowman/xnli/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants