Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements with multi-language detection? #266

Open
Mrodent opened this issue Nov 5, 2023 · 2 comments
Open

Improvements with multi-language detection? #266

Mrodent opened this issue Nov 5, 2023 · 2 comments

Comments

@Mrodent
Copy link

Mrodent commented Nov 5, 2023

Firstly: THANK YOU! Not least for attempting the daunting task of multi-language detection. I have great hopes for this crate!

Ultimately, say one is creating an Elasticsearch index (my case), and a proportion of your documents do contain mixed languages. Detecting the language of fragments of lines or paragraphs is critically important: if you try and index French text with an English stemmer analyser, this risks making a very defective index, so you're likely in that case to opt not to use a stemmer analyser at all...

Here is an example of output from multi-language detection:

..text:

|Table of Contents
plex knowledge centre1
2022-01-09
plex knowledge centre
- various other locations for Plex info: IT Diary 2019-12, IT Diary 2022-01
- at the current time I was able to install Plex in W10 fairly easily. There is a bit of confusion over my Plex account: email is mrodent33@gmail.com. But on my phone (after installing \"Plex TV\") I entered forgon34 for the pwd. But I think I used f*******9f when setting up on Linux.
installing Plex on Linux (from IT Diary 2022-01)
- from here
chris@M17A:~$ sudo systemctl status plexmediaserver
[sudo] password for chris: [pwd]
|

These are the detection results (languages were English, French, German, Spanish, Latin, Irish):

DetectionResult { start_index: 0, end_index: 76, word_count: 2, language: French } |Table of Contents
plex knowledge centre1
2022-01-09
plex knowledge centre
- |

DetectionResult { start_index: 76, end_index: 277, word_count: 2, language: English } |various other locations for Plex info: IT Diary 2019-12, IT Diary 2022-01
- at the current time I was able to install Plex in W10 fairly easily. There is a bit of confusion over my Plex account: email |

DetectionResult { start_index: 277, end_index: 317, word_count: 2, language: Latin } |is mrodent33@gmail.com. But on my phone |

DetectionResult { start_index: 317, end_index: 423, word_count: 2, language: English } |(after installing \"Plex TV\") I entered forgon34 for the pwd. But I think I used f*******9f when setting |

DetectionResult { start_index: 423, end_index: 470, word_count: 2, language: French } |up on Linux.
installing Plex on Linux (from IT |

DetectionResult { start_index: 470, end_index: 497, word_count: 5, language: English } |Diary 2022-01)
- from here
|

DetectionResult { start_index: 497, end_index: 592, word_count: 2, language: Latin } |chris@M17A:~$ sudo systemctl status plexmediaserver
[sudo] password for chris: [pwd]
|

Obviously this is an analysis of what you might call "jottings", not proper sentences, and indeed "jottings of a technical jargony IT-language kind". But even so, I think there should be some way of "giving up the attempt" and saying "can't make head or tail of this part of your text: REJECT".

Perhaps more importantly, as it currently stands the multi-language detection part of your app doesn't deliver confidence levels in its DetectionResults. Without such levels it gets difficult to put these results to any practical use. I can of course then subject each of these DetectionResult text fragments to a second analysis, for confidence, using the same LanguageDetector, using the Language value from detection_result.language(). This actually makes the above results much more manageable: not surprisingly the supposed French and Latin fragments above turn out to have a very low confidence, the English ones higher.

So I think it'd be nice to incorporate a confidence rating into DetectionResult. I suspect it'd also help with the previous issue of "False positives with gibberish": i.e. feed in about 10 lines... splits this into multiple language fragments. But all have confidence of 0.2 or less: conclusion: GIBBERISH detected!

Naturally all this is going to take CPU power. But you could probably find some optimisations with incorporating confidence-rating, compared to what I'm doing above, a 2-stage analysis ...

@pemistahl
Copy link
Owner

Firstly: THANK YOU! Not least for attempting the daunting task of multi-language detection. I have great hopes for this crate!

And I thank you @Mrodent for your nice words. I'm happy that my library is obviously still useful to people. :)

You are right that it makes sense to include confidence values in the multi-language results. A two-stage analysis is simple to implement but costly as you have already said. I will think about how to do it more efficiently.

@YPares
Copy link

YPares commented May 22, 2024

I've been able to improve the result of multi-lang detection by doing a two-pass detection:

  • First detect on whole text with detect_multiple_languages_of
  • Then, on each span produced by the above step, do a single language detection. If a span is small (say, <80 chars), I extend it on the left & on the right so it reaches a minimal size (therefore it includes a bit of its surrounding context) before doing that second detection.

The second step is akin to a "low-pass filter", so this works quite well on docs where language changes don't happen every few words.

Notably, this allows me to fix proper nouns being incorrectly detected as being in another language than their surrounding text.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants