Add Kanji support #152

OuOu2021 · 2023-02-08T13:28:18Z

Before we start, I would like to make clear some concepts. Kanji is Japanese character based on Chinese symbols. And I will take Chinese character as a joint name of Simplified Chinese character, Traditional Chinese character and Kanji.

It seems that all Chinese characters will be identified as Chinese with confidence values of 100 percent in Lingua which is not right. In fact, some Kanji words are written entirely the same in Chinese (like 豆腐(tofu), 科学(science)), while some of Kanji are neither used in Simplified Chinese nor Traditional Chinese at all. For example, economy is written as "经济" in Simplified Chinese, "經濟" in Traditional Chinese and "経済" in Kanji, but they are all 100% determined by Lingua 1.4 to be Chinese.

This is not a big problem as a slightly lengthier text like twitter in Japanese is likely to have kana which can help Lingua to distinguish it, but it's still incorrect to determine undoubtable Kanji only used in Japanese as 100% Chinese, so I have to point out it.

Also see greyblake/whatlang-rs/issues/122

OuOu2021 · 2023-02-08T13:32:06Z

経済: (Chinese, 1.0)
和製漢字: (Chinese, 1.0)
雫: (Chinese, 1.0)
労働: (Chinese, 1.0)
峠: (Chinese, 1.0)
勉強中: (Chinese, 1.0)
自動販売機: (Chinese, 1.0)

They are all 100% Japanese words.

pemistahl · 2023-02-15T08:26:50Z

Hi @OuOu2021, thank you for reaching out to me. You can probably imagine how difficult it is to solve this problem. The language models I use for Chinese and Japanese are obviously insufficient for words such as your examples. Perhaps it helps to determine which characters are really unique to Chinese or Japanese and to extend the language models with this information. I will try to improve the library in this regard but it may take significant time as the todo list is pretty long already.

pemistahl added the enhancement New feature or request label Feb 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Kanji support #152

Add Kanji support #152

OuOu2021 commented Feb 8, 2023 •

edited

OuOu2021 commented Feb 8, 2023

pemistahl commented Feb 15, 2023

Add Kanji support #152

Add Kanji support #152

Comments

OuOu2021 commented Feb 8, 2023 • edited

OuOu2021 commented Feb 8, 2023

pemistahl commented Feb 15, 2023

OuOu2021 commented Feb 8, 2023 •

edited