Multilingual IR with Machine-Translated FAQ #46

stedomedo · 2020-03-22T12:01:07Z

Building multilingual models (zero-shot, transfer learning, etc.) takes time.

So, in the meantime, as stated in #2 , we could machine-translate FAQs from English into other languages and add them to the search cluster, so that they can be retrieved at foreign language input. Translations in the background don't need to be perfect, but sufficient for retrieval (adequacy before fluency/grammar).

TODOs:

Scrape the English FAQ from data/scrapers repo
Build machine-translator tool (e.g. with https://pypi.org/project/googletrans/)
Translate some samples to check quality
Translate all English FAQ
Add data to ESC with columns: language, original_english_doc, is_machine_translated

The text was updated successfully, but these errors were encountered:

tholor · 2020-03-22T12:11:33Z

Great idea. @stedomedo! Did I get this right, that we would still need language-specific models for question similarity with this approach?

Would it be an alternative to translate the user question live to English and then to the matching with our FAQs? With that approach we could easily leverage English models for question similarity.

stedomedo · 2020-03-22T12:15:47Z

Yes, that's an option.
Query translation quality could suffer though from short lengths.
I'm currently exploring translation quality.
Thanks!

stedomedo · 2020-03-22T14:33:32Z

The googletrans lib does not work reliably, so I made a free trial account on MS Azure, also because they offer up to 2M characters of translation for free per month.

Here is the English FAQ data including columns for Arabic:
https://github.com/stedomedo/COVID-QA/blob/auto_translators/data/faqs/MT_ar_faq_covidbert.csv

stedomedo · 2020-03-22T14:37:01Z

And the MS translator:
https://github.com/stedomedo/COVID-QA/blob/auto_translators/data/translators/ms_translate.py

MS Translator is supposed to be quite good for Arabic. For other languages, Google or DeepL are better options (afaik they don't offer free credits)

Checking which real-time translation option is best to use, incl. budget-wise

stedomedo · 2020-03-22T14:46:20Z

@tholor @Timoeller I have a question on the (desired) search workflow.
Is it:
user query -> match query to question with BERT -> search with elastic (tfidf, bm25)
?

So could a multilingual workflow be like this:
query -> detect lang
-> if EN -> match query to question with BERT -> search with elastic (tfidf, bm25)
-> if AR -> search directly with elastic (tfidf, bm25)
?
In this case, no multilingual-BERT or other-language-BERT or real-time translation would be needed.

Timoeller · 2020-03-22T15:51:59Z

Good points.

Can you create a PR with the translation and the script for doing so? I would merge it to have this functionality in the repo.

About the language detection and the switch between bert + ES and only ES: we could implement it this way if multilingual isnt working well for other languages.

Do you have experience with language detection and could write a script for this, so we can integrate this into the backend? We need lang detection there anyways, because we want to adjust output texts like "source" "cateogry" etc. The script should be rather efficient, since this will limit response time...

stedomedo · 2020-03-23T13:58:49Z

One idea for "simple" transfer learning:
In Machine Translation this technique is commonly used when you have a low resource language. Basically, you build a model for language Y on top of the model for language X by just continuing the training (1-2 epochs) with the Y language data. Vocabs would need to be pooled on all languages though.
This could work for small data sizes or/and maybe also machine-translated texts.

Timoeller · 2020-03-23T18:33:36Z

That is exactly the idea! : )
With multilingual models like mBert or XLM-R this "zero shot learning" is easily possible because the vocab is already in one pool for all supported languages.
See e.g. Table 1 or 3 in XLM-R paper for zero shot transfer.

So if we train a multilingual model in Sentence Bert on Quora, we will also be able to match all other languages - hopefully with good performance 💃

ViktorAlm · 2020-03-24T13:35:00Z

You are probably aware of these datasets but heres some multilingual similarity data. I have a NMT model for english->swedish if you want me to I could NMT and add some data for better performance on scandinavian languages.

https://github.com/google-research-datasets/paws
https://www.nyu.edu/projects/bowman/xnli/

Timoeller self-assigned this Mar 22, 2020

Timoeller added the enhancement New feature or request label Mar 22, 2020

Timoeller assigned stedomedo Mar 22, 2020

stedomedo mentioned this issue Mar 22, 2020

Auto translators #54

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multilingual IR with Machine-Translated FAQ #46

Multilingual IR with Machine-Translated FAQ #46

stedomedo commented Mar 22, 2020 •

edited

tholor commented Mar 22, 2020

stedomedo commented Mar 22, 2020

stedomedo commented Mar 22, 2020

stedomedo commented Mar 22, 2020

stedomedo commented Mar 22, 2020

Timoeller commented Mar 22, 2020

stedomedo commented Mar 23, 2020 •

edited

Timoeller commented Mar 23, 2020 •

edited

ViktorAlm commented Mar 24, 2020

Multilingual IR with Machine-Translated FAQ #46

Multilingual IR with Machine-Translated FAQ #46

Comments

stedomedo commented Mar 22, 2020 • edited

tholor commented Mar 22, 2020

stedomedo commented Mar 22, 2020

stedomedo commented Mar 22, 2020

stedomedo commented Mar 22, 2020

stedomedo commented Mar 22, 2020

Timoeller commented Mar 22, 2020

stedomedo commented Mar 23, 2020 • edited

Timoeller commented Mar 23, 2020 • edited

ViktorAlm commented Mar 24, 2020

stedomedo commented Mar 22, 2020 •

edited

stedomedo commented Mar 23, 2020 •

edited

Timoeller commented Mar 23, 2020 •

edited