Dataset Details for OpenSubtitles #143

tuzhucheng · 2020-04-25T19:34:40Z

Hi, I have some questions about the training details of LASER. In Appendix A it is stated that:

OpenSubtitles2018: A parallel corpus of movie subtitles in 57 languages. The corpus size varies from a few thousand sentences to more than 50 million. We keep at most 2 million entries for each language pair.

For Chinese and Portuguese, there are separate entries depending on the locale: http://opus.nlpl.eu/OpenSubtitles.php

For Chinese we have 2 locales: zh_cn, zh_tw
For Portuguese we also have 2 locales: pt_br, pt

I'm wondering if in this case we keep 2 million for each locale, for a total of 4 million for Chinese and 4 million for Portuguese, or do we pick 1 million for each locale for a total of 2 million per language.

In addition, how are the 2 million sentences sampled? Is it just the first 2 million for each language pair?

Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset Details for OpenSubtitles #143

Dataset Details for OpenSubtitles #143

tuzhucheng commented Apr 25, 2020

Dataset Details for OpenSubtitles #143

Dataset Details for OpenSubtitles #143

Comments

tuzhucheng commented Apr 25, 2020