False positives with gibberish #172

SkeletalDemise · 2023-04-17T02:13:12Z

There are some false positives when inputting gibberish, Lingua identifies them as languages when it should return None.

Examples:
vszzc hvwg wg zcbu hslh
5HeQsKSTseGZrDvdCAUYr6DyxS5jy4953UWACh9bN2rUFkj2sDuY3BS
VGhpcyBpcyBhbiBleGFtcGxlIG9mIGJhc2U2NA==
KZDWQ4DDPFBHAY3ZIJUGE2KCNRSUORTUMNDXQ3CJI44W2SKHJJUGGMSVGJHECPJ5

The project I'm working on has a lot of gibberish. We need to identify between different languages and gibberish. I've been looking for solutions but I'm not an expert at NLP.

I'd like your opinion on what the best solution for that use case would be.

The text was updated successfully, but these errors were encountered:

pemistahl · 2023-04-21T13:19:30Z

Hi @SkeletalDemise, thank you for reaching out to me. Currently, Lingua is not able to identify gibberish. It sums up probabilities for letter sequences (= ngrams) learned from training data for each supported language. Even the ngrams in gibberish have a certain probability and Lingua simply returns the language with the highest probability. So it's not that easy to identify gibberish. But I will think about how to solve this as it is a pretty interesting problem.

pemistahl added the enhancement New feature or request label Apr 21, 2023

Mrodent mentioned this issue Nov 5, 2023

Improvements with multi-language detection? #266

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

False positives with gibberish #172

False positives with gibberish #172

SkeletalDemise commented Apr 17, 2023

pemistahl commented Apr 21, 2023 •

edited

False positives with gibberish #172

False positives with gibberish #172

Comments

SkeletalDemise commented Apr 17, 2023

pemistahl commented Apr 21, 2023 • edited

pemistahl commented Apr 21, 2023 •

edited