Strange results for Chinese with Japanese #38

71sprite · 2023-04-24T09:25:17Z

To reproduce:

package main

import (
	"github.com/pemistahl/lingua-go"
	"fmt"
)

func main() {
	detector := lingua.NewLanguageDetectorBuilder().
		FromAllLanguages().
		Build()

	text := "上海大学是一个好大学. わー!"
	if language, exists := detector.DetectLanguageOf(text); exists {
		fmt.Println(language.String()) // Japanese
	}
}

Expected:
Get Chinese for this case.

https://github.com/pemistahl/lingua-go/blob/main/detector.go#L467

It's because here return Japanese if any japaneseCharacterSet char exists, I'm unsure if this is intended.

Thanks for awesome work!

The text was updated successfully, but these errors were encountered:

pemistahl · 2023-04-25T07:22:18Z

Hi @71sprite, thanks for your request.

I'm aware of the difficulties to recognize Chinese and Japanese correctly. These are actually the most difficult languages. I will try to improve the algorithm but as I'm not a speaker of these languages, it's not easy. If you know how to speak these languages and have ideas for heuristics to implement, I will be glad to read about them.

71sprite · 2023-04-26T08:24:15Z

I have also read some documents List_of_Unicode_characters , it is indeed impossible to accurately distinguish among Chinese, Japanese and Korean. Perhaps we can judge according to the Unicode range.

func isChinese(c rune) bool {
	// Chinese Unicode range
	if (c >= '\u3400' && c <= '\u4db5') || // CJK Unified Ideographs Extension A
		(c >= '\u4e00' && c <= '\u9fed') || // CJK Unified Ideographs
		(c >= '\uf900' && c <= '\ufaff') { // CJK Compatibility Ideographs
		return true
	}

	return false
}

func isJapanese(c rune) bool {
	// Japanese Unicode range
	if (c >= '\u3021' && c <= '\u3029') || // Japanese Hanzi
		(c >= '\u3040' && c <= '\u309f') || // Hiragana
		(c >= '\u30a0' && c <= '\u30ff') || // Katakana
		(c >= '\u31f0' && c <= '\u31ff') || // Katakana Phonetic Extension
		(c >= '\uf900' && c <= '\ufaff') { // CJK Compatibility Ideographs
		return true
	}

	return false
}

lyricat · 2023-07-22T14:50:40Z

As a speaker of Chinese and Japanese, I vote for @71sprite

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strange results for Chinese with Japanese #38

Strange results for Chinese with Japanese #38

71sprite commented Apr 24, 2023 •

edited

pemistahl commented Apr 25, 2023

71sprite commented Apr 26, 2023 •

edited

lyricat commented Jul 22, 2023

Strange results for Chinese with Japanese #38

Strange results for Chinese with Japanese #38

Comments

71sprite commented Apr 24, 2023 • edited

pemistahl commented Apr 25, 2023

71sprite commented Apr 26, 2023 • edited

lyricat commented Jul 22, 2023

71sprite commented Apr 24, 2023 •

edited

71sprite commented Apr 26, 2023 •

edited