Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

False positives with gibberish #172

Open
SkeletalDemise opened this issue Apr 17, 2023 · 1 comment
Open

False positives with gibberish #172

SkeletalDemise opened this issue Apr 17, 2023 · 1 comment
Labels
enhancement New feature or request

Comments

@SkeletalDemise
Copy link

There are some false positives when inputting gibberish, Lingua identifies them as languages when it should return None.

Examples:
vszzc hvwg wg zcbu hslh
5HeQsKSTseGZrDvdCAUYr6DyxS5jy4953UWACh9bN2rUFkj2sDuY3BS
VGhpcyBpcyBhbiBleGFtcGxlIG9mIGJhc2U2NA==
KZDWQ4DDPFBHAY3ZIJUGE2KCNRSUORTUMNDXQ3CJI44W2SKHJJUGGMSVGJHECPJ5

The project I'm working on has a lot of gibberish. We need to identify between different languages and gibberish. I've been looking for solutions but I'm not an expert at NLP.

I'd like your opinion on what the best solution for that use case would be.

@pemistahl
Copy link
Owner

pemistahl commented Apr 21, 2023

Hi @SkeletalDemise, thank you for reaching out to me. Currently, Lingua is not able to identify gibberish. It sums up probabilities for letter sequences (= ngrams) learned from training data for each supported language. Even the ngrams in gibberish have a certain probability and Lingua simply returns the language with the highest probability. So it's not that easy to identify gibberish. But I will think about how to solve this as it is a pretty interesting problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants