Struggling to detect English #3

peterbe · 2017-03-09T19:54:52Z

I took your awesome lib and wrapped it in a little command line app. I also added a conversion table from ISO 639-3 to ISO 639-1.

▶ ./langdetect "Le candidat socialiste à l’élection présidentielle"
Language: fra Script: Latin
fr

Correct!

▶ ./langdetect "Mitt namn på svenska är Peter"
Language: swe Script: Latin
sv

Correct!

But....

▶ ./langdetect "testing in english"
Language: uig Script: Latin
ug

Not right.

▶ ./langdetect "wondering if it still works in English"
Language: nld Script: Latin
nl

Not right either.

Also, would it be possible to output a list of probabilities? That way my app, where I hope to use this, could throw warnings if the probabilities "aren't certain enough".

The text was updated successfully, but these errors were encountered:

peterbe · 2017-03-09T20:19:23Z

By the way, my hack is here: https://github.com/peterbe/langdetect
Careful, I haven't done go for a long time.

abadojack · 2017-03-10T03:10:09Z

I noticed this while I was testing and I haven't found a solution yet. I added it to the TODO list in this commit.

peterbe · 2017-03-10T16:20:41Z

For what it's worth, I tested something similar with a different project (https://pypi.python.org/pypi/guess_language-spirit) and it's based on a bunch of trigrams too. I can't remember the example but it suffered the same problem.

abadojack · 2017-03-11T14:41:30Z

I just checked it out and tested it with the same phrases you used.

>>> guess_language("Le candidat socialiste à l’élection présidentielle.")
'fr'

Correct!

>>> guess_language("Mitt namn på svenska är Peter")
'sv'

Correct!

>>> guess_language("testing in english")
'UNKNOWN'

Not correct.

>>> guess_language("wondering still if it works in english")
'af'

Still not correct.

Considering the trigrams used in that project are not similar to the ones I used ... wtf is wrong with English ?

peterbe · 2017-03-12T18:02:20Z

I tried franc too.

var franc = require('franc');

console.log(franc("Le candidat socialiste à l’élection présidentielle."))
console.log(franc("Mitt namn på svenska är Peter"))
console.log(franc("testing in english"))
console.log(franc("wondering still if it works in english"))

output;

▶ node test.js
fra
swe
uig
nld

azer · 2017-10-19T07:10:49Z

Short inputs like "happy" "hello" also returns incorrect results.

ernsheong · 2017-11-16T07:18:31Z

Looks like the longer the string, the better the result?

Hilarious:

func TestLangDetection(t *testing.T) {
	lang := whatlanggo.DetectLang("english english english english english english english")
	if lang != whatlanggo.Eng {
		t.Fatalf("Expected lang to be %v but was %v", whatlanggo.LangToString(whatlanggo.Eng), whatlanggo.LangToString(lang))
	}
}
 
--- FAIL: TestLangDetection (0.00s)
        search_test.go:12: Expected lang to be eng but was uzb
FAIL

Providing a Whitelist via DetectLangWithOptions helps.

Algorithm:

Quote from the article:

Disadvantages:

May provide falsy results for short texts (smaller than 200-300 letters). Whatlang tries to compensate this with is_reliable attribute.

So there are 2 ways to fix/mitigate this:

Provide confidence value
As the article says, for lower amount of text we could try dictionary checks instead.

miku · 2018-05-05T12:29:24Z

First, thanks for making this library available, it helped me to get rid of a C dependency from cld2.

Just wanted to add a few additional data points concerning non-detection of English text:

"We report on 630 nm-band AIGalnP strained MQW laser diodes incorporating an MQB. The laser offer high-temperature operation over 60/spl deg/C and have been operating reliably for more than 1,000 h under 3 mW at 45/spl deg/C." -- dan
"Transverse-mode stabilized GaInP/AlGaInP strained multiquantum well lasers emitting at 638 nm were grown on a 15 degrees off" -- dan
"A convenient synthesis of vicinal methoxychlorides, methoxyiodides from alkenes using diphenyldiiodo-tetrachloride/methanol, iodine/diphenyldiiodotetra-chloride/methanol, iodine/3-carboxyphenyliod odichloride/methanol is described." -- spa
"We demonstrate ultra-wideband (850 to 1550 nm) WDM transmission in multi-mode fiber by using single-mode photonic crystal fiber (PCF) as center launching and mode-filtering devices." -- deu

greyblake · 2018-10-27T14:21:18Z

Hi. I am a creator of the original library in Rust.
I just want to say, that you should not have high expectation, if input text is relatively small. The library is based on statistical profiles of languages (trigrams). The bigger input text, the better it represents its statistical profile.

There is nothing you can do with this issue.

Some possible solutions however:

Use whitelist / blacklist
Use another library (possible with combination of this one) that is fundamentally different (.e.g based on vocabulary)

@miku

Just wanted to add a few additional data points concerning non-detection of English text:

Rust version at the moment provides is_reliable boolean in result. When it's true it is guaranteed, that language is recognized correctly. Otherwise you should not trust the result. For all of your text samples it returns result is_reliable=false.

mmorells · 2019-02-24T18:47:30Z

#15 Added confidence with the logic from the original library by @greyblake
Doesn't fix the problems with detection, but now Info{} has the confidence rating.

abadojack self-assigned this Mar 10, 2017

abadojack added the enhancement label Mar 10, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Struggling to detect English #3

Struggling to detect English #3

peterbe commented Mar 9, 2017

peterbe commented Mar 9, 2017

abadojack commented Mar 10, 2017

peterbe commented Mar 10, 2017

abadojack commented Mar 11, 2017

peterbe commented Mar 12, 2017

azer commented Oct 19, 2017

ernsheong commented Nov 16, 2017 •

edited

miku commented May 5, 2018

greyblake commented Oct 27, 2018

mmorells commented Feb 24, 2019

Struggling to detect English #3

Struggling to detect English #3

Comments

peterbe commented Mar 9, 2017

peterbe commented Mar 9, 2017

abadojack commented Mar 10, 2017

peterbe commented Mar 10, 2017

abadojack commented Mar 11, 2017

peterbe commented Mar 12, 2017

azer commented Oct 19, 2017

ernsheong commented Nov 16, 2017 • edited

miku commented May 5, 2018

greyblake commented Oct 27, 2018

mmorells commented Feb 24, 2019

ernsheong commented Nov 16, 2017 •

edited