Use languages' alphabets to make detection more accurate #83

thorn0 · 2020-02-17T18:42:38Z

Что это за язык? is a Russian sentence, which is detected as Bulgarian (bul 1, rus 0.938953488372093, mkd 0.9353197674418605). However, neither Bulgarian nor Macedonian have the letters э and ы in their alphabets.

Same with Чекаю цієї хвилини., which is Ukrainian, but is detected as Northern Uzbek with probability 1 whereas Ukrainian gets only 0.33999999999999997. However, the letters є and ї are used only in Ukrainian whereas the Uzbek Cyrillic alphabet doesn't include as many as five letters from this sentence, namely: ю, ц, і, є and ї.

I know that Franc is supposed to be not good with short input strings, but taking alphabets into account seems to be a promising way to improve the accuracy.

The text was updated successfully, but these errors were encountered:

wooorm · 2020-02-17T18:53:30Z

That’s a good idea, it’s similar to how Google works!
However, I don‘t think it should be so “black and white”, as “the letter ы is not available in bulgarian or macedonian” should still be matched as English.

We could do something with a special character list that enhances scores of certain scripts?

I remember there is a turkish i variant that isn’t used anywhere else as well, forgot what it was tho

thorn0 · 2020-02-17T18:59:15Z

The dotless i (ı) is used not only in Turkish. Other languages whose alphabets are based on the Turkish alphabet have it too. E.g. Azerbaijani and Crimean Tatar.

thorn0 · 2020-02-17T19:01:18Z

We could do something with a special character list that enhances scores of certain scripts?

Scripts like Latin, Cyrillic, etc.? You meant languages, not scripts then, right?

thorn0 · 2020-02-17T19:21:26Z

It's not only a matter of which characters the alphabet has, it's also about which ones it doesn't. In Чекаю цієї хвилини., there are 5 letters that aren't in the Uzbek alphabet. It's 31% of all the letters in the string. In no way should Uzbek get the highest ranking in such a situation.

thorn0 · 2020-02-18T10:13:36Z

@wooorm Do you happen to know a programmatic way to get the alphabet (the set of used characters) for a given language?

wooorm · 2020-02-18T10:50:24Z

I think it’s vague what even an alphabet is, but I did found this list on wikipedia: https://en.wikipedia.org/wiki/Wikipedia:Language_recognition_chart. Interesting stuff!

Franc supports the most languages possible, as it uses the biggest training set (UDHR). It’s designed to not discriminate against languages with few speakers, and I can how adding a feature such as this would (because there is no data about alphabets for lots of languages).

There are projects that focus on less language and do things like what you’re proposing. Have you looked at https://github.com/CLD2Owners/cld2?

thorn0 · 2020-02-18T10:55:15Z

I thought I saw something on the Unicode site where for each character there was information by which languages it is used, but now I can't find it.

I think it’s vague what even an alphabet is

Right. Some characters sometimes aren't considered separate letters of the alphabet (e.g. umlauts in German), etc. That's why I wrote "alphabet (the set of used characters)".

wooorm · 2020-02-18T11:22:32Z

I don’t think there’s an automated way to do it.

I think it could be possible to either do it character-based, e.g., like so:

  "э": [
     "bul": -3,
     "mkd": -3,
     "rus": 3,
     "bel": 3,
     // ...or so
  ]

Or based on n-grams/regexes:

  "tje$": [["nld", 2]]
  "^z": [["nld", 1]]

But this is an error-prone and “soft” approach, compared to the current “hard” data-model

An alternative idea is to look at the TRY field in hunspell dictionaries.
E.g., the Russian dictionary defines:

TRY оаитенрсвйлпкьыяудмзшбчгщюжцёхфэъАВСМКГПТЕИЛФНДОЭРЗЮЯБХЖШЦУЧЬЫЪЩЙЁ

And Macedonian:

TRY аеоинвтрслпкудмзбчгјшцњжфхќџѓљѕѐѝАЕОИНВТРСЛПКУДМЗБЧГЈШЦЊЖФХЌЏЃЉЅЀЍ-’!.

These are mostly ordered already based from frequent -> infrequent

thorn0 · 2020-02-18T11:56:00Z

Found it! http://cldr.unicode.org/translation/-core-data/exemplars

Letter frequency is an important thing too, but on the other hand letters that are unique to some language are often infrequent in it. E.g. ѕ (Cyrillic) in Macedonian and є in Ukrainian.

wooorm · 2020-02-19T11:18:05Z

Nice, we can crawl them from cldr: bg, ru, mk

wooorm · 2020-03-27T08:11:16Z

@thorn0 Is this something you’d be interested to work on?

thorn0 · 2020-03-27T10:58:55Z

It's unlikely I'll have time for this any time time soon.

niftylettuce · 2020-06-07T22:07:59Z

@thorn0 @wooorm I would put a $50 bug bounty on this payable by PayPal if anyone had the time!

muratcorlu · 2021-09-02T13:28:40Z

I remember there is a turkish i variant that isn’t used anywhere else as well, forgot what it was tho

@wooorm Yes, ı and İ are specific to Turkish.

This comment has been minimized.

Sign in to view

wooorm mentioned this issue Jun 28, 2021

Improved accuracy for small documents #100

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use languages' alphabets to make detection more accurate #83

Use languages' alphabets to make detection more accurate #83

thorn0 commented Feb 17, 2020

wooorm commented Feb 17, 2020

thorn0 commented Feb 17, 2020

thorn0 commented Feb 17, 2020 •

edited

thorn0 commented Feb 17, 2020

thorn0 commented Feb 18, 2020 •

edited

wooorm commented Feb 18, 2020

thorn0 commented Feb 18, 2020 •

edited

wooorm commented Feb 18, 2020

thorn0 commented Feb 18, 2020 •

edited

wooorm commented Feb 19, 2020

wooorm commented Mar 27, 2020

thorn0 commented Mar 27, 2020

niftylettuce commented Jun 7, 2020

This comment has been minimized.

muratcorlu commented Sep 2, 2021

Use languages' alphabets to make detection more accurate #83

Use languages' alphabets to make detection more accurate #83

Comments

thorn0 commented Feb 17, 2020

wooorm commented Feb 17, 2020

thorn0 commented Feb 17, 2020

thorn0 commented Feb 17, 2020 • edited

thorn0 commented Feb 17, 2020

thorn0 commented Feb 18, 2020 • edited

wooorm commented Feb 18, 2020

thorn0 commented Feb 18, 2020 • edited

wooorm commented Feb 18, 2020

thorn0 commented Feb 18, 2020 • edited

wooorm commented Feb 19, 2020

wooorm commented Mar 27, 2020

thorn0 commented Mar 27, 2020

niftylettuce commented Jun 7, 2020

This comment has been minimized.

muratcorlu commented Sep 2, 2021

thorn0 commented Feb 17, 2020 •

edited

thorn0 commented Feb 18, 2020 •

edited

thorn0 commented Feb 18, 2020 •

edited

thorn0 commented Feb 18, 2020 •

edited