Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some language pairs unexpectedly low quality or missing #129

Open
bittlingmayer opened this issue Feb 14, 2020 · 13 comments
Open

Some language pairs unexpectedly low quality or missing #129

bittlingmayer opened this issue Feb 14, 2020 · 13 comments

Comments

@bittlingmayer
Copy link

Some language pairs are oddly missing from ...WikiMatrix/list_of_bitexts.txt, against my intuitions on which ones would have more data and thus more matching sentences.

For example, Armenian (hy) pairings exist with German, French, Russian, Italian, Spanish and Portuguese, but not with English.

Another odd thing about it is that Armenian Wikipedia is not exactly a low-resource Wikipedia - 30th by number of articles -

"Southern" Azerbaijani Turkish (azb, the variant spoken in Iran and still written in its original Perso-Arabic script) has a pairing only with French!

For comparison, azb has many articles as (post-Soviet) Azerbaijani Turkizh (az), which is now written in the Latin alphabet and has pairings with 34 languages.

Similarly, it smells a bit fishy that Chinese and Hindi have only as many pairings as Galician or Esperanto.

But still, not as suspicious as a language having pairs with French but not with English.

I know you used a relatively objective cutoff, and there are many factors in how articles are created that would affect the number of true match, I'm just wondering if there may be a bug. Most likely related to the handling of non-Latin scripts.

@bittlingmayer
Copy link
Author

bittlingmayer commented Feb 15, 2020

I just noticed also that the very top sentence pairs in eg Armenian-German are terrible.

Թամիլերեն Վիքիպեդիա (թամ.՝ தமிழ் விக்கிப்பீடியா), Վիքիպեդիա բազմալեզու և ազատ հանրագիտարանի Թամիլերեն տարբերակն է։ Այսօր՝ 20 Մարտի 2019 թվականին Վիքիպեդիայի Թամիլերեն բաժինը ունի 121 174 հոդված։ Գրանցված են 151 402 մասնակից, նրանցից 403 կատարել են գոնե մեկ խմբագրում անցած 30 օրվա ընթացքում, իսկ 41 մասնակից ունեն ադմինիստրատորի կարգավիճակ։ Ընդհանուր խմբագրումների թիվը հատում է 2 673 858-ը։ Թամիլերեն Վիքիպեդիա Մեդիա ֆայլեր	Geht es um die Bestimmung des Umfangs einer Gegenleistung für eine bestimmte Leistung, so enthält das Gesetz in § 316 BGB eine Auslegungsvorschrift, nach der im Zweifel für diesen Fall das Leistungsbestimmungsrecht demjenigen zusteht, der die Gegenleistung zu fordern hat.

Մենք երբեք համաիսլամականություն չենք հիմնել։ Perhaps we said "We are establishing it and we shall complete it."	Was er sagt, werden wir erben, und er wird ganz allein zu uns kommen."

Գալֆայեան, Չօմախլու (Կեսարիա), Նիւ Եորք, Տպ.	Denn ihr wisst nicht, an welchem Tag euer Herr kommt).

Ֆարիդ Մամեդով (ադրբեջաներեն` Fərid Məmmədov, ծնվ.	(Du sollst dir kein Bildnis noch irgendein Gleichnis machen.

«Արա՛, ինչ ուզում ես» (ֆր.՝ Fais ce que voudras): Նման կարգուկանոնը եղբայր Ժանը հետևյալ կերպ է բացատրում.	Er schickt die (Blitze und) Donnerschläge und trifft damit, wen er will.

As you can see from the differing lengths, symbols and so on, these are completely different sentences. Here is one that is mostly in Latin, so you can appreciate how different:

Rihanna — This Is What You Came For (նիդեր.).	Die Allaicha (russisch Аллаиха oder Аллайха, translit.

My threshold is the very conservative 1.10, and these are, as mentioned, the very top pairs.

@bittlingmayer
Copy link
Author

bittlingmayer commented Feb 19, 2020

I would contrast this with Arabic-Chinese. It also doesn't have many quality pairs, but the vector/similarity/threshold seem to be working as expected.

@bittlingmayer
Copy link
Author

bittlingmayer commented Feb 20, 2020

I have a theory: the sentence segmentation for many language is terrible.

(Whether you use the P. Koehn's old Moses script or the Python implementations of it.)

For example, Armenian has its own sentence-final punctuation mark ։, and the Latin/Cyrillic full-stop . is only used in abbrevations, URLs etc. But the scripts and libs split it as if it's English.

So most of the Armenian "sentences" got split right in the middle, where there something like a name (equivalent of "J. Smith"), and nearly none of them end on actual sentence boundary unless it was improperly using borrowed ? or !, which only occurs in relatively dirty corpora.

Selected but representative examples from the "top" of the Armenian-French:

Պրիոնային հիվանդությունների պաթոգենեզը Պրիոններ // Успехи биологической химии, т.	Est alors fusionnée avec la chaire de physico-chimie de l'adaptation biologique.
...
Արամ Վրույր, «Պետրոս Հ.	Je me dis : « Merde !
...
Artist with a disability) և «հաշմանդամության խնդրով արտահայտված նկարիչ» (англ.	L'œuvre sans interprète est un corps sans vie (anonyme).
...
Ալթըն մեռած է 1749-ին, մինչդեռ Վ.	Les travaux ne furent achevés qu'en 1749.
...
«Ժամանակակից Վեներոլոգիա» Խաչիկ Խաչիկյան 2008 թ.	Blessé, il manque la totalité de la saison 2008.
...
Is the letter «Y» a vowel or a consonant?	L’article « La » doit-il être agglutiné ou non ?
...

So, first name, and also "y." for "year", and the Russian abbreviation "engl." for English.
And this issue affects not only the candidate pairs for alignment, but presumably also the sentences on which LASER itself was trained.

On the other hand, only 2% (!) end with the actual ։. But 10% have ։ in the middle of the line, ie combine parts of multiple sentences.

(And this may explain some similar issues in BERT.)

@bittlingmayer bittlingmayer changed the title Some language pairs oddly missing? Some language pairs unexpectedly low quality or missing Feb 20, 2020
@stet-stet
Copy link

stet-stet commented Feb 25, 2020

I partly back some claims here. The very top sentences of ja-ko, from what I have seen, are terrible as well. These are terrible, but those below are quite well aligned.

1.204509423478882 作者は阿川弘之(文)と岡部冬彦(絵)。 하나님은 아브라함과 이삭과 야곱의 하나님이다.
1.1998532105175352 大学の略称は慈恵医大(じけいいだい)、慈恵(じけい)、慈大(じだい)。 권적(權適)은 고려 충혜왕 때 문신이다.

From what I can see, the alignments of sentences having a score below 1.15 and over 1.06 are still quite accurate, though I cannot back up my claim with adequate objective proof (there are simply too many sentences).

But just a glimpse: from 10 sentences with the highest score on this pair, I see that only two out of ten is correct. On the contrary, for sentences 10001 to 10010, with a score of ~1.09, ten out of ten are correct alignments. For sentences 20001 to 20010, with a score of ~1.08, nine out of ten are correct. Sentence 20004, which is incorrect, probably has a typo in the original sentence, because they only differ by one word.

So this is already two language pairs that have horrible alignment on the first sentences. Is this also true for other language pairs?

@hoschwenk
Copy link
Contributor

hoschwenk commented Feb 26, 2020 via email

@bittlingmayer
Copy link
Author

bittlingmayer commented Feb 27, 2020

If you find any typical pattern that would be very useful!

The main pattern I see is bad sentence segmentation, which highly correlates with an alphabet/orthography using its own punctuation.

(Note that Cyrillic and modern Greek and Georgian and Hebrew use the same sentence punctuation as Latin alphabet languages, and even Persian and Arabic do, with the addition of the backwards question mark, which doesn't occur much in Wikipedia.)

In the case of Armenian (which I know well), so many of the candidates just aren't sentences, so it's impossible to align them.

Luckily addressing it is easy, because sentence-final punctuation isn't overloaded in those orthographies the way . is.

An ugly hack fir now, would be to exclude sentence pairs with such high scores.

It's elegant in some ways, maybe elegant enough to belong in the paper! (So instead of which threshold maximises BLEU, which quintile or decile maximises BLEU.)

Because the highest scoring pairs are those where the sentences are simply the same, which is almost always wrong in one direction or the other.

So right now, we discard those that are too similar by string edit distance. Which is hackier in some sense.

But my guess is that doing this, certain language pairs would be reduced from a few sentence pairs to zero sentence pairs.

@stet-stet
Copy link

stet-stet commented Mar 1, 2020

Survey on en-ko

TL;DR Please exclude appendix/references/bibliography

Section: Plagued by References

1.2068080339297096 Rock, the AP before co-founding Something Else! 2011년 7월 20일에 확인함.  Rock, the AP before co-founding Something Else!
1.1811347559307146 "Three and One-Half Centuries at a Glance". 2007년 6월 9일에 확인함.  “Three and One-Half Centuries at a Glance”.
1.1766367946322376 "Why Spain were anything but boring". 2013년 6월 30일에 확인함.  “Why Spain were anything but boring”.
1.1756872485333454 According to Article 4 of the 1994 Paris Protocol. 2013년 11월 29일에 확인함.  According to Article 4 of the 1994 Paris Protocol .
1.174842893439351 "How healthy is the air you breathe?". 2018년 5월 7일에 확인함.  “How healthy is the air you breathe?”

and Many more. These are formatted as: (English sentence) "Retrieved on (Date)" (The same sentence again). Obviously these alignments are not of much value. After some googling I found these sentences to be from the "Bibliography" or "Reference" sections of the Korean wikipedia. Example

Such problems are expected be likely the more documens are translated from another language. For a more meaningful corpora I would like to suggest excluding Bibliography sections in preprocessing steps.

Section: These "English" sentences are in Korean

1.169077450026554 "4.19 그날, 시인 신동엽도 거리에 있었다" (in Korean). “4.19 그날, 시인 신동엽도 거리에 있었다”.
1.1715741588791 "분단되지 않았다면 날파람도 이어졌으리라. "분단되지 않았다면 날파람도 이어졌으리라.
1.1573627234072754 "경제 거물들은 헬스장에서도 경쟁적". “경제 거물들은 헬스장에서도 경쟁적”.
1.1533367362276539 "치킨무에도 들어있는데…사카린 진짜 먹어도 되나?". “치킨무에도 들어있는데…사카린 진짜 먹어도 되나?”.

These sentences also come from "References" section of the English wikipedia.

Overall

You start seeing some meaningful alignments when you go over line 1000, such as:

1.126812501904812 But I was one in a hundred, so I'm sure you would never be able to identify my voice. 하지만 저는 백명 중의 한 명이였기에, 제 목소리를 확인할 수 없을거라고 확신해요.
1.1133550194443553 All animals are equal. 모든 동물들은 평등하다.
1.1132613179728192 He works as a coach in the Chicago area. 그는 시카고 지역에서 코치로 일하고 있다.

but most uppder sections are plagued by bad alignments, gathered from the References section. Once again, I strongly suggest you exclude them.

@stet-stet
Copy link

stet-stet commented Mar 1, 2020

survey on en-ja
Actually, this might be what you are looking for.
Tl;dr Look into English words similar to "Lord" in en-ja.

Examples of wrong alignments with large scores:

1.2204652999675554 It will destroy everything at the bidding of its Lord." 主の為なら主にすら嘘をつく。(If for the lord, even the lord will tell a lie)
1.2093175221568597 In The Lord of the Rings, they refer to his kind as Beornings. 魂主の手首には「主の証」として具現化する。(I have no idea what a 魂主 is, but definitely not about LotR)
1.1920562664693404 Yasseen MUSA (QAT). 『クルアーン』(コーラン)ではアラビア語でムーサー (موسى Mūsā) と呼ばれる。(The Quran is, in Arabic, "Mūsā".)
1.1969704966791581 If we have been made sons of God, we have also been made gods." 『わたしたちはアッラーの子であり,かれに愛でられる。(We are sons of Allah, and are loved by him.)

Correct or not, there are awfully many quotes related to religion, containing the word Lord, hell, Allah, god. The problem here is that many sentences containing the word "Lord" has faulty alignments. And blatantly faulty at that.

But the word "Lord", with a capital 'L', this one's problematic. It's effects extend quite far. In fact, even after line ~10000, there are many wrong quotes that

  • includes the word "Lord" with a capital L
  • Sounds like something from the bible (O Lord~ thee~)

1.1145393686921667 And when Judas the traitor did not believe and asked, How, then, will such growth be accomplished by the Lord?, the Lord said, Those who live until those times will see. 丞相魏相は「次公は酒に酔っていなくてもおかしいではないか」と笑って言い、他の者は目配せして蓋寛饒を卑下した。(This sentence has a name(丞相魏相) the English does not, you know it's wrong without translating. Line ~2000.)
1.1033810404217756 Amana: O my lord, what is it you desire? 主よ,誰があなたの御前に立ち続けることができるでしょうか。(O lord, Who would be able to stand before you, forever? Line ~7000.)
1.1030694109966213 7:11); And the LORD spake, which belongs to , Speak unto the children of Israel, He that offers, (Lev. 家伝に岐阜中納言織田秀信の5世の孫、織田信長の7世の孫という。(line ~8000)
1.1002342490396704 They were given a Named Hero "Amdûr, Lord of Blades". なお、一期分が与えられた領主を一期領主(いちごりょうしゅ)と呼ぶ。 (line ~9600)
1.100150268234152 (Fugal chorus with orchestra) Lo, thus shall the man be blessed That feareth the Lord Blessed shall he be He shall be blessed. だが叩き上げの人間であるため同じ境遇の人間には優しく、面倒見がよい。
1.0993201262385617 The Lord said, "Who then is the faithful and wise steward, whom his lord will set over his household, to give them their portion of food at the right times? この乃公(目下の者へ向けて使う一人称)が児女の思い通りになると思うのか」と言い放った。 (line ~10500)
1.0985183981994355 And when he came to him, he found him overheated and say "Lord, send me your light to shine in front of me in this way that I do not know." 後に意識が遠のく中で、賛美歌の「主よ御許に近づかん」(Nearer, My God, to Thee)の中のフレーズを呟いた。(line 11000)

I suggest you to look more into words like Lord, Allah, that are used frequently in one setting, not in other. My disk space and Linguistic skills are limited, and it would be hard for me to do more.

I suspect that this has to do with word usage frequency. Much western culture is built upon the religion I and you and he and she and we know; other cultures arent. Also I suspect that Allah has to do very very much with some group of people. Since there are so much examples of the English word "Lord" being used, LASER might have developed a bias or something? Or is it just because WikiMatrix was made by matching the whole wiki with another.

1.1875199846857094 "Would Steve Jobs Have Liked the New Biography? アンディ・ハーツフェルド "Would Steve Jobs Have Liked the New Biography?

Oh, and, the reference section again. Please exclude.

@stet-stet
Copy link

stet-stet commented Mar 1, 2020

Criteria I think you(ppl looking at this issue) should look for: let this word be A. If all three, probably this word should be excluded or somehow dealt with.

  • There is a counterpart of A, but this is used in different contexts in two languages. (In Eng: Lord = religious figure, but in ja: Lord = mostly feudal lord, or your master.)
  • A is a word used in a religious setting
  • A is a word used frequently in one cultural setting, but not in other(s).

@stet-stet
Copy link

survey on ja-ko.
TL;DR Sentences that are high-scored and wrong in ja-ko either

  • Mention figures like "the Lord" "Allah" (하느님, 주, 그분 in Korean, アッラー in Japanese)
  • Mention a historical figure or that-country's-history-specific term, which are unlikely to appear in other languages (권적(權適), 권근, 안의, 마리현(馬利縣), 이안현(利安縣) in Korean; don't know for Japanese sorry)
    It is hard to find wrong alignments that do not fit into this criterion, at least in the first 100 lines. (exception: line 3, talks about philosophy)

for the latter...In relation to content pointed out by @hoschwenk ..., parentheses typically follow those terms and names, because

  • In Japanese, the pronounciation of terms must be explained(e.g. 慈恵(じけい))
  • In Korean, people like to write the original kanji with how they are written in Hangul, for example as in 권적(權適)
    Is this the case for other languages too?

for the former...the problem with "Lord" seems to be on ja-ko as well.

1.204509423478882 作者は阿川弘之(文)と岡部冬彦(絵)。 하나님은 아브라함과 이삭과 야곱의 하나님이다. (Korean: has the words Lord, Issac and Jacob, and then the Lord again.)
1.1805857230727903 またその人物の言葉から下の名前が「信雄(のぶかつ)」であることが分かる。 그러나 하느님께서는 주님께서 청하시는 것은 무엇이나 들어주신다는 것을 저는 지금도 알고 있습니다.”
1.155025975545211 『わたしの主であり,あなたがたの主であられるアッラーに仕えなさい。 주여, 침방(寢房)에서 사귀는 사랑의 사귐의 때를 허락하소서.
1.1506467389394084 「アッラー(の道)のために,わたしを助ける者は誰か。 하느님께서 나를 보우하실 것이며, 그분의 성인들께서 나를 도우실 것이다!”
All wrong alignments.

@bittlingmayer
Copy link
Author

I also noticed issues with "Allah", specifically English sentences that say "He" aligned with French sentences that say "Allah".

To address it, I created a blacklist of named entitities that must occur on both sides in equal measure. Of course, it only works for languages with the same script and so on, unless the entities are translingual constants, like ".org".

@bittlingmayer
Copy link
Author

I saw the same sentence segmentation issue with Georgian, which does not have it's own sentence punctuation. The problems were driven only by . in abbrevations incorrectly treated as a sentence boundary, even if the preceding word had only a single letter.

@AlexJonesNLP
Copy link

AlexJonesNLP commented Apr 19, 2021

I'm not sure to what extent this is a LASER-specific issue, since I observed the same problem for certain language pairs (e.g. English-Kazakh) using Google's LaBSE, which is a similar model to LASER in its use cases but is trained with a Transformer (BERT-based) architecture instead of BiLSTM. However, my experiments with LASER on Christodoulopoulos' Bible corpus in 101 languages / 5050 language pairs have yielded terrible results for many language pairs on the bitext retrieval task—much worse than LaBSE. This is bizarre, since this doesn't match up with LASER's performance on datasets such as Taoteba or BUCC. I am using this port of LASER instead of the one from the original Github, and I wonder whether there's some severe tokenization issue for particular languages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants