Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with fragment coordinates inside DetectionResult #268

Open
Mrodent opened this issue Nov 6, 2023 · 2 comments
Open

Issue with fragment coordinates inside DetectionResult #268

Mrodent opened this issue Nov 6, 2023 · 2 comments

Comments

@Mrodent
Copy link

Mrodent commented Nov 6, 2023

I've identified a slight problem with the DetectionResults produced when using the feature detect_multiple_languages_of. I do appreciate this is an "experimental" feature, but it probably needs to be addressed.

To obtain the text in these fragments I'm doing this:

let mut output_str = format!("document {} self.n_ldocs {} text:\n|{}|\n", self.path_str, self.n_ldocs, ldoc_text);
for detection_result in detection_results {
    let fragment: &str = &ldoc_text[detection_result.start_index()..detection_result.end_index()];
    let confidence = self.handling_framework.language_detector.compute_language_confidence(fragment, detection_result.language());
    if confidence > 0.25 {
        ... // probably sensible language classification
    }
    else {
        ... // may be gibberish, or rather technical or something like that
    }
}	

Obviously these are coordinates on the bytes in the String.

With the above, VERY occasionally, I get a panic. Examples:
"thread '<unnamed>' panicked at 'byte index 654 is not a char boundary; it is inside 'ο' (bytes 653..655) of ``ἐυηνεμον: ἐυ \"good\" + ἀνεμος \"wind\". ..."
or
"thread '<unnamed>' panicked at 'byte index 65 is not a char boundary; it is inside '\u{f0fc}' (bytes 64..67) of `` 7.3 ..."

In processing about 2 million words I only get a handful of panics, and these appear to be on VERY exotic and obscure Unicode: accented Ancient Greek or U+F0FC "Private Use Character".

But the trouble is that, currently, I'm not too sure how to "catch" such a thing before it panics: if a char boundary is violated by [result.start_index()..result.end_index()] this does not produce a Result, it just panics! And this in turn means that the whole of the rest of the document I'm parsing never gets processed. (My documents are being parsed in parallel threads so other documents are unaffected).

Naturally, I'm trying to find a way to test each proposed slice to find out whether or not it's "legal". But even if I do, obviously this would mean a lot more processing time.

So maybe you might want to think about providing indices on the chars rather than the bytes: users could then do this:

let text_vec = whole_string.chars().collect::<Vec<_>>();
fragment = text_vec[detection_result.start_index_of_chars()..detection_result.end_index_of_chars()].iter().cloned().collect::<String>();
@Mrodent
Copy link
Author

Mrodent commented Nov 6, 2023

Ah, I have found a workaround: first time I've heard about panic::catch_unwind. Here I'm assuming .66 confidence is good enough ...

let confidence_values = self.handling_framework.language_detector
    .compute_language_confidence_values(ldoc_text)
    .into_iter()
    .map(|(language, confidence)| (language, (confidence * 100.0).round() / 100.0))
    .collect::<Vec<_>>();		
if confidence_values[0].1 > 0.66 {
    // the whole section of text gives a reasonable confidence that it's in one of the languages
    self.handling_framework.n_confident_language_ldocs.fetch_add(1, Ordering::Relaxed);
}
else {
    // not confident enough... let's see whether multiple language analysis makes any sense
    let detection_results: Vec<DetectionResult> = self.handling_framework.language_detector.detect_multiple_languages_of(ldoc_text);
    let mut output_str = format!("document {} self.n_ldocs {} text:\n|{}|\n", self.path_str, self.n_ldocs, ldoc_text);
    for detection_result in detection_results {
	let start_index = detection_result.start_index();
	let end_index = detection_result.end_index();
	let result = std::panic::catch_unwind(||{
	    &ldoc_text[start_index..end_index]
    	});
	let fragment = match result {
	    Ok(fragment) => fragment,
	    Err(e) => {
		error!("got Unicode char boundary error: {:#?}", e);
		output_str.push_str(&format!("[!!! {} bytes with problem Unicode]", end_index - start_index));
		continue
	    }
        };
	let confidence = self.handling_framework.language_detector.compute_language_confidence(fragment, detection_result.language());
	output_str.push_str(&format!("{:?} fragment |{}|\n... confidence {}\n", detection_result, fragment, confidence));
	if confidence > 0.66 {
	    // ... (the fragment is marked and treated as being "probably language X")
	}
	else if confidence > 0.33 {
	    // ... (the fragment is marked and treated as being "possibly language X")
	}
	else {
	    // ... (the fragment is marked and treated as being "language unknown")
	}
    }	
}

With my corpus of documents (.docx files, each one split up into overlapping 10-paragraph subsections, each one constituting an "LDoc", Lucene document), with 1.75 million words, this generates 31727 LDocs, of which 30694 are confident enough to be classified as a given language, without needing multiple language detection.

In terms of processing times, this actually works out at only about 50% more time than doing no language analysis at all, so pretty good really.

@pemistahl
Copy link
Owner

The indices are actually supposed to be character indices, not byte indices. It turns out that this is a bug in the regex crate which my library uses internally. This bug has been fixed since my last release. I'm going to create a new release soon which will use the updated regex crate including this bug fix. Then the errors you are getting now will hopefully be gone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants