New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Struggling to detect English #3
Comments
By the way, my hack is here: https://github.com/peterbe/langdetect |
I noticed this while I was testing and I haven't found a solution yet. I added it to the TODO list in this commit. |
For what it's worth, I tested something similar with a different project (https://pypi.python.org/pypi/guess_language-spirit) and it's based on a bunch of trigrams too. I can't remember the example but it suffered the same problem. |
I just checked it out and tested it with the same phrases you used. >>> guess_language("Le candidat socialiste à l’élection présidentielle.")
'fr' Correct! >>> guess_language("Mitt namn på svenska är Peter")
'sv' Correct! >>> guess_language("testing in english")
'UNKNOWN' Not correct. >>> guess_language("wondering still if it works in english")
'af' Still not correct. Considering the trigrams used in that project are not similar to the ones I used ... wtf is wrong with English ? |
I tried franc too. var franc = require('franc');
console.log(franc("Le candidat socialiste à l’élection présidentielle."))
console.log(franc("Mitt namn på svenska är Peter"))
console.log(franc("testing in english"))
console.log(franc("wondering still if it works in english")) output;
|
Short inputs like "happy" "hello" also returns incorrect results. |
Looks like the longer the string, the better the result? Hilarious: func TestLangDetection(t *testing.T) {
lang := whatlanggo.DetectLang("english english english english english english english")
if lang != whatlanggo.Eng {
t.Fatalf("Expected lang to be %v but was %v", whatlanggo.LangToString(whatlanggo.Eng), whatlanggo.LangToString(lang))
}
}
--- FAIL: TestLangDetection (0.00s)
search_test.go:12: Expected lang to be eng but was uzb
FAIL Providing a Whitelist via Algorithm:
Quote from the article:
So there are 2 ways to fix/mitigate this:
|
First, thanks for making this library available, it helped me to get rid of a C dependency from cld2. Just wanted to add a few additional data points concerning non-detection of English text:
|
Hi. I am a creator of the original library in Rust. There is nothing you can do with this issue. Some possible solutions however:
Rust version at the moment provides |
#15 Added confidence with the logic from the original library by @greyblake |
I took your awesome lib and wrapped it in a little command line app. I also added a conversion table from ISO 639-3 to ISO 639-1.
Correct!
Correct!
But....
Not right.
Not right either.
Also, would it be possible to output a list of probabilities? That way my app, where I hope to use this, could throw warnings if the probabilities "aren't certain enough".
The text was updated successfully, but these errors were encountered: