-
Notifications
You must be signed in to change notification settings - Fork 459
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some language pairs unexpectedly low quality or missing #129
Comments
I just noticed also that the very top sentence pairs in eg Armenian-German are terrible.
As you can see from the differing lengths, symbols and so on, these are completely different sentences. Here is one that is mostly in Latin, so you can appreciate how different:
My threshold is the very conservative 1.10, and these are, as mentioned, the very top pairs. |
I would contrast this with Arabic-Chinese. It also doesn't have many quality pairs, but the vector/similarity/threshold seem to be working as expected. |
I have a theory: the sentence segmentation for many language is terrible. (Whether you use the P. Koehn's old Moses script or the Python implementations of it.) For example, Armenian has its own sentence-final punctuation mark So most of the Armenian "sentences" got split right in the middle, where there something like a name (equivalent of "J. Smith"), and nearly none of them end on actual sentence boundary unless it was improperly using borrowed Selected but representative examples from the "top" of the Armenian-French:
So, first name, and also "y." for "year", and the Russian abbreviation "engl." for English. On the other hand, only 2% (!) end with the actual (And this may explain some similar issues in BERT.) |
I partly back some claims here. The very top sentences of ja-ko, from what I have seen, are terrible as well. These are terrible, but those below are quite well aligned.
From what I can see, the alignments of sentences having a score below 1.15 and over 1.06 are still quite accurate, though I cannot back up my claim with adequate objective proof (there are simply too many sentences). But just a glimpse: from 10 sentences with the highest score on this pair, I see that only two out of ten is correct. On the contrary, for sentences 10001 to 10010, with a score of ~1.09, ten out of ten are correct alignments. For sentences 20001 to 20010, with a score of ~1.08, nine out of ten are correct. Sentence 20004, which is incorrect, probably has a typo in the original sentence, because they only differ by one word. So this is already two language pairs that have horrible alignment on the first sentences. Is this also true for other language pairs? |
Hello,We have indeed observed that alignments with (unusually) high scores are wrong, for various language pairs. We still need to understand this....If you find any typical pattern that would be very useful!e.g. some of these wrong alignments seem to contain many special characters or digits, sometimes wrong language etcAn ugly hack fir now, would be to exclude sentence pairs with such high scores.
-------- Message d'origine --------De : stet_stet <notifications@github.com> Date : 25/02/2020 11:56 (GMT+00:00) À : facebookresearch/LASER <LASER@noreply.github.com> Cc : Subscribed <subscribed@noreply.github.com> Objet : Re: [facebookresearch/LASER] Some language pairs unexpectedly low quality or missing (#129) I partly back some claims here. The very top sentences of ja-ko, from what I have seen, are terrible as well. These are terrible, but those below are quite well aligned.
1.204509423478882 作者は阿川弘之(文)と岡部冬彦(絵)。 하나님은 아브라함과 이삭과 야곱의 하나님이다.
1.1998532105175352 大学の略称は慈恵医大(じけいいだい)、慈恵(じけい)、慈大(じだい)。 권적(權適)은 고려 충혜왕 때 문신이다.
From what I can see, the alignments of sentences having a score below 1.15 and over 1.06 are still quite accurate, though I cannot back up my claim with adequate objective proof (there are simply too many sentences).
But just a glimpse: from 10 sentences with the highest score on this pair, I see that only two out of ten is correct. On the contrary, for sentences 10001 to 10010, with a score of ~1.09, ten out of ten are correct alignments. For sentences 20001 to 20010, with a score of ~1.08, nine out of ten are correct. Sentence 20004, which is incorrect, probably has a typo in the original sentence, because they only differ by one word.
Would there be a reason why only the top sentences are horrible? Also, is this also true for other sentence pairs?
—You are receiving this because you are subscribed to this thread.Reply to this email directly, view it on GitHub, or unsubscribe.
[
{
"@context": "http://schema.org",
"@type": "EmailMessage",
"potentialAction": {
"@type": "ViewAction",
"target": "#129?email_source=notifications\u0026email_token=AITHYUUJBKDL7NZSDWWFHJDREUBPHA5CNFSM4KVIKRLKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEM3V34Q#issuecomment-590831090",
"url": "#129?email_source=notifications\u0026email_token=AITHYUUJBKDL7NZSDWWFHJDREUBPHA5CNFSM4KVIKRLKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEM3V34Q#issuecomment-590831090",
"name": "View Issue"
},
"description": "View this Issue on GitHub",
"publisher": {
"@type": "Organization",
"name": "GitHub",
"url": "https://github.com"
}
}
]
|
The main pattern I see is bad sentence segmentation, which highly correlates with an alphabet/orthography using its own punctuation. (Note that Cyrillic and modern Greek and Georgian and Hebrew use the same sentence punctuation as Latin alphabet languages, and even Persian and Arabic do, with the addition of the backwards question mark, which doesn't occur much in Wikipedia.) In the case of Armenian (which I know well), so many of the candidates just aren't sentences, so it's impossible to align them. Luckily addressing it is easy, because sentence-final punctuation isn't overloaded in those orthographies the way
It's elegant in some ways, maybe elegant enough to belong in the paper! (So instead of which threshold maximises BLEU, which quintile or decile maximises BLEU.) Because the highest scoring pairs are those where the sentences are simply the same, which is almost always wrong in one direction or the other. So right now, we discard those that are too similar by string edit distance. Which is hackier in some sense. But my guess is that doing this, certain language pairs would be reduced from a few sentence pairs to zero sentence pairs. |
Survey on en-ko TL;DR Please exclude appendix/references/bibliography Section: Plagued by References
and Many more. These are formatted as: (English sentence) "Retrieved on (Date)" (The same sentence again). Obviously these alignments are not of much value. After some googling I found these sentences to be from the "Bibliography" or "Reference" sections of the Korean wikipedia. Example Such problems are expected be likely the more documens are translated from another language. For a more meaningful corpora I would like to suggest excluding Bibliography sections in preprocessing steps. Section: These "English" sentences are in Korean
These sentences also come from "References" section of the English wikipedia. Overall You start seeing some meaningful alignments when you go over line 1000, such as:
but most uppder sections are plagued by bad alignments, gathered from the References section. Once again, I strongly suggest you exclude them. |
survey on en-ja Examples of wrong alignments with large scores:
Correct or not, there are awfully many quotes related to religion, containing the word Lord, hell, Allah, god. The problem here is that many sentences containing the word "Lord" has faulty alignments. And blatantly faulty at that. But the word "Lord", with a capital 'L', this one's problematic. It's effects extend quite far. In fact, even after line ~10000, there are many wrong quotes that
I suggest you to look more into words like Lord, Allah, that are used frequently in one setting, not in other. My disk space and Linguistic skills are limited, and it would be hard for me to do more. I suspect that this has to do with word usage frequency. Much western culture is built upon the religion I and you and he and she and we know; other cultures arent. Also I suspect that Allah has to do very very much with some group of people. Since there are so much examples of the English word "Lord" being used, LASER might have developed a bias or something? Or is it just because WikiMatrix was made by matching the whole wiki with another.
Oh, and, the reference section again. Please exclude. |
Criteria I think you(ppl looking at this issue) should look for: let this word be A. If all three, probably this word should be excluded or somehow dealt with.
|
survey on ja-ko.
for the latter...In relation to content pointed out by @hoschwenk ..., parentheses typically follow those terms and names, because
for the former...the problem with "Lord" seems to be on ja-ko as well.
|
I also noticed issues with "Allah", specifically English sentences that say "He" aligned with French sentences that say "Allah". To address it, I created a blacklist of named entitities that must occur on both sides in equal measure. Of course, it only works for languages with the same script and so on, unless the entities are translingual constants, like ".org". |
I saw the same sentence segmentation issue with Georgian, which does not have it's own sentence punctuation. The problems were driven only by |
I'm not sure to what extent this is a LASER-specific issue, since I observed the same problem for certain language pairs (e.g. English-Kazakh) using Google's LaBSE, which is a similar model to LASER in its use cases but is trained with a Transformer (BERT-based) architecture instead of BiLSTM. However, my experiments with LASER on Christodoulopoulos' Bible corpus in 101 languages / 5050 language pairs have yielded terrible results for many language pairs on the bitext retrieval task—much worse than LaBSE. This is bizarre, since this doesn't match up with LASER's performance on datasets such as Taoteba or BUCC. I am using this port of LASER instead of the one from the original Github, and I wonder whether there's some severe tokenization issue for particular languages. |
Some language pairs are oddly missing from ...WikiMatrix/list_of_bitexts.txt, against my intuitions on which ones would have more data and thus more matching sentences.
For example, Armenian (
hy
) pairings exist with German, French, Russian, Italian, Spanish and Portuguese, but not with English.Another odd thing about it is that Armenian Wikipedia is not exactly a low-resource Wikipedia - 30th by number of articles -
"Southern" Azerbaijani Turkish (
azb
, the variant spoken in Iran and still written in its original Perso-Arabic script) has a pairing only with French!For comparison,
azb
has many articles as (post-Soviet) Azerbaijani Turkizh (az
), which is now written in the Latin alphabet and has pairings with 34 languages.Similarly, it smells a bit fishy that Chinese and Hindi have only as many pairings as Galician or Esperanto.
But still, not as suspicious as a language having pairs with French but not with English.
I know you used a relatively objective cutoff, and there are many factors in how articles are created that would affect the number of true match, I'm just wondering if there may be a bug. Most likely related to the handling of non-Latin scripts.
The text was updated successfully, but these errors were encountered: