You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Before we start, I would like to make clear some concepts. Kanji is Japanese character based on Chinese symbols. And I will take Chinese character as a joint name of Simplified Chinese character, Traditional Chinese character and Kanji.
It seems that all Chinese characters will be identified as Chinese with confidence values of 100 percent in Lingua which is not right. In fact, some Kanji words are written entirely the same in Chinese (like 豆腐(tofu), 科学(science)), while some of Kanji are neither used in Simplified Chinese nor Traditional Chinese at all. For example, economy is written as "经济" in Simplified Chinese, "經濟" in Traditional Chinese and "経済" in Kanji, but they are all 100% determined by Lingua 1.4 to be Chinese.
This is not a big problem as a slightly lengthier text like twitter in Japanese is likely to have kana which can help Lingua to distinguish it, but it's still incorrect to determine undoubtable Kanji only used in Japanese as 100% Chinese, so I have to point out it.
Hi @OuOu2021, thank you for reaching out to me. You can probably imagine how difficult it is to solve this problem. The language models I use for Chinese and Japanese are obviously insufficient for words such as your examples. Perhaps it helps to determine which characters are really unique to Chinese or Japanese and to extend the language models with this information. I will try to improve the library in this regard but it may take significant time as the todo list is pretty long already.
Before we start, I would like to make clear some concepts. Kanji is Japanese character based on Chinese symbols. And I will take Chinese character as a joint name of Simplified Chinese character, Traditional Chinese character and Kanji.
It seems that all Chinese characters will be identified as Chinese with confidence values of 100 percent in Lingua which is not right. In fact, some Kanji words are written entirely the same in Chinese (like 豆腐(tofu), 科学(science)), while some of Kanji are neither used in Simplified Chinese nor Traditional Chinese at all. For example, economy is written as "经济" in Simplified Chinese, "經濟" in Traditional Chinese and "経済" in Kanji, but they are all 100% determined by Lingua 1.4 to be Chinese.
This is not a big problem as a slightly lengthier text like twitter in Japanese is likely to have kana which can help Lingua to distinguish it, but it's still incorrect to determine undoubtable Kanji only used in Japanese as 100% Chinese, so I have to point out it.
Also see greyblake/whatlang-rs/issues/122
The text was updated successfully, but these errors were encountered: