fix failure to OCR: general quality issue due to LSTM being fed noisy/crappy original image pixels instead of cleaned-up binarized pixels. #4111

GerHobbelt · 2023-08-05T00:39:51Z

fix Bushnell OCR bug (failure to properly OCR the number "11"; see message chain in mailing list: https://groups.google.com/g/tesseract-ocr/c/5jrGvsrdqig/m/jvTG6L9zBgAJ; this includes sample images, text output and context as originally reported by Astro/Nor):

root cause:

it turns out tesseract erroneously grabs the ORIGINAL image (instead of the THRESHOLDED/BINARIZED one!) to extract the word box (Tesseract::GetRectImage()) which will be fed into the LSTM OCR neural net in order to OCR the detected text area.

Ergo: this fix SHOULD improve OCR results generally, as this is a generic bug which impacts ALL text bboxes found in a given input page image, which are then being pumped into the LSTM engine to obtain OCR'ed text.

This fix was verified to work in an otherwise patched/augmented tesseract rig: GerHobbelt/tesseract: commit series bb37cf3, ffc1997, 15d2952, 69416e5, f49826b, d53c1a2, 44f2f84, where I worked on removing the curious BestPix() API, which SEEMINGLY was originally meant for ScrollView-et-al debug display purposes, but is (IMO) an ill-named API for that purpose.
remove accompanying, now obsolete, comment
also remove the need for BestPix() API usage in EquationDetect::PrintSpecialBlobsDensity() by invoking the API that delivers what's actually used there: the image height. Here BestPix() usage is also wrong (theoretically) as the sought-after image height is about the actual height of the binarized image data, which represent the cleaned-up-and-ready-for-use OCR sourcing image data.

Corollary of this bug

Anyone feeding tesseract monochrome (pre-thresholded/binarized) images from an external cleanup+binarization process SHOULD already get best OCR results and SHOULD NOT be impacted by this bug, nor this fix. (As then there would be no difference between 'original pix' and 'binary pix' from tesseract's perspective.)

…message chain in mailing list: https://groups.google.com/g/tesseract-ocr/c/5jrGvsrdqig/m/jvTG6L9zBgAJ): it turns out tesseract erroneously grabs the ORIGINAL image (instead of the THRESHOLDED/BINARIZED one!) to extract the word box (Tesseract::GetRectImage()) which will be fed into the LSTM OCR neural net in order to OCR the detected text area. Ergo: this fix SHOULD improve OCR results generally, as this is a generic bug which impacts ALL text bboxes found in a given input page image, which are then being pumped into the LSTM engine to obtain OCR'ed text. This fix was verified to work in an otherwise patched/augmented tesseract rig: GerHobbelt/tesseract: commit series bb37cf3, ffc1997, 15d2952, 69416e5, f49826b, d53c1a2, 44f2f84, where I worked on removing the curious BestPix() API, which SEEMINGLY was originally meant for ScrollView-et-al debug display purposes, but is (IMO) an ill-named API for that purpose. - remove accompanying, now obsolete, comment - also remove the need for BestPix() API usage in EquationDetect::PrintSpecialBlobsDensity() by invoking the API that delivers what's actually used there: the image height. Here BestPix() usage is also wrong (theoretically) as the sought-after image height is about the actual height of the binarized image data, which represent the cleaned-up-and-ready-for-use OCR sourcing image data.

stweil · 2023-08-05T10:03:53Z

I tested this commit on a greyscale newspaper image, and the result is mixed. Some lines were recognized better, others got worse. Generally it is intentional to use the LSTM with original images. Ideally the neural network was trained to handle noise and even to work better with greyscale images than with binarized images.

GerHobbelt · 2023-08-05T15:08:38Z

Do you have a sample newspaper scan where this occurs? 🤔 The fundamental issue with the old code is that it circumnavigates the tesseract thresholding/cleanup process using no obvious criteria & outside the user's control, resulting in undesirable source image noise getting into the engine. Pondering both your results and mine, this would mean BestPix() needs a user control parameter. (Or folks should be relegated to using fully external image cleanup processes, but that would still leave us with a nonobvious decision making process in BestPix().🤔

…

On Sat, 5 Aug 2023, 12:04 Stefan Weil, ***@***.***> wrote: I tested this commit on a grey scale newspaper image, and the result is mixed. Some line were recognized better, others got worse. — Reply to this email directly, view it on GitHub <#4111 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADCIHVGZBSXOZ7IRNN3SV3XTYLBJANCNFSM6AAAAAA3E2Z2SQ> . You are receiving this because you authored the thread.Message ID: ***@***.***>

GerHobbelt · 2023-08-22T19:35:00Z

As I'm digging further into the catacombs of tesseract RTFC I realize that technically the LSTM engine is engineered to accept raw RGB input (3 channels) or failing that, greyscale input (1 channel), so from that perspective the reported issue is NOT a bug but rather a tesseract 4/5 feature.

Meanwhile I find that same source image is fed through binarization, resulting in a pure-black-and-white 1 bit per pixel image which is used by the segmentation logic which clips (or extracts when we use a different jargon for the same "cutting out lines and feeding them to the OCR engine" process) boxes=segments of detected text lines to feed to the LSTM engine.

The LSTM engine may have once been trained to tolerate noise, but given the actual behaviour for, for example, the rather noisy (yet very legible) wildlife camera text referenced earlier, I'd argue that the tolerance for noise in the LSTM engine is only 'good enough' when we feed tesseract very tightly preprocessed scans/images, such as can be expected off a professional book/paper scanning rig.

Anyhow, I feel the currently filed pull request is subpar re quality, given my latest insights in tesseract, and should be closed.

Before I do so, I'd love some feedback from @stweil on the next two questions:

1: procedure: would you like me to file subsequent (re)work on this subject under this same pr or do you rather see a fresh pr which mentions this one?

2: the planned rework is to combine this effort with an improved greyscale image preprocessing stage where the greyscale is "filtered" through the thresholding mask (after thickening/dilating/brick-closing it) so that the resulting greyscale image keeps all noisy data only under the masked areas which represent the (binarized) text mask. The cruddy OCR results that the wildlife cam sample produced are due to nearly invisible pixel noise in the background far removed from the actual characters, such that it may be argued to be "adversarial input" as it produces an OCR text result that's far off the mark with seemingly sensible high certainty values at the moment. Filtering the raw input through the threshold mask (dilated) would kill this and a lot of other "weird OCR results" that I got for old & otherwise low quality book scans I have been testing on. Hence the question:

2a. Do you have a link for me where I can download some or all of the pages you are working on so I can compare against your samples as well?

2b. Would you be interested in this work at all?
(I'm of two minds when I have to formulate what tesseract, as a product, wishes/plans to accept as "input images": while there's been talk of noisy inputs over the years on the mailing list, etc., my current sense of the matter is that in actuality tesseract expects pre-cleaned input to the best of one's ability and that the binarization, etc. techniques included are rather a, ähhh, stop gap to make tesseract more palatable to 'the general public' at first glance. Lowering the threshold to entry and all that. Legacy code, etc etc.

Hm, what I'm looking for, I guess, is a project lead policy question (and answer), I suppose... Does tesseract, as a general policy, wish to receive maximum effort cleaned up black on white greyscale scans¹ as source images, or is tesseract's strategy/long-term goal to include a industrial quality preprocessing stage?

I'm asking because I know where I want to go towards, but I'm not clear on where tesseract wants to go. It may be my limited skillset to comprehend the documentations, but I was unable to find where y'all want to be in X years with this. (and, yes, open source is a labor of love primarily so no harm nor foul when the goals take forever, but I'm lacking that vision from the tesseract core as is. 🙏) So @stweil: if you could give me a hint at where you want to take tesseract I'd be much obliged. That way I would be much better informed whether some work I've done and intend to do is worth filing a pr for. 🙏😚

Thanks for bearing with me. Cheers,

Ger

doesn't anyone wonder why it's weird the LSTM models are based on 3 channel RGB inputs, while the same result would be achieved with 16bits per pixel greyscale inputs IFF one trains on and expects as input regular black on white printed pages' scans anyway? Or am I ignorant or missing some other elephant in the room? Greyscale at 16bpp is supported by leptonica AFAICT and the model would have a third of the input nodes it has now, so it would be (much?) faster as we don't really use the color info anyway: all tesseract knows about is black text on white/white-ish backgrounds. Yes, when you feed Tess an RGB image its pixels go straight into the LSTM (after line cropping) but... The same contrast (and some more) can be had at greyscale 16bpp unless I'm overlooking that elephant.🤔 ↩

GerHobbelt mentioned this pull request Sep 13, 2023

tesseract :: supply masks, etc. for each stage as images overriding or assisting the tesseract engine that way GerHobbelt/W#4

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix failure to OCR: general quality issue due to LSTM being fed noisy/crappy original image pixels instead of cleaned-up binarized pixels. #4111

fix failure to OCR: general quality issue due to LSTM being fed noisy/crappy original image pixels instead of cleaned-up binarized pixels. #4111

GerHobbelt commented Aug 5, 2023

stweil commented Aug 5, 2023 •

edited

GerHobbelt commented Aug 5, 2023 via email

GerHobbelt commented Aug 22, 2023

fix failure to OCR: general quality issue due to LSTM being fed noisy/crappy *original* image pixels instead of cleaned-up binarized pixels. #4111

Are you sure you want to change the base?

fix failure to OCR: general quality issue due to LSTM being fed noisy/crappy *original* image pixels instead of cleaned-up binarized pixels. #4111

Conversation

GerHobbelt commented Aug 5, 2023

root cause:

Corollary of this bug

stweil commented Aug 5, 2023 • edited

GerHobbelt commented Aug 5, 2023 via email

GerHobbelt commented Aug 22, 2023

Footnotes

fix failure to OCR: general quality issue due to LSTM being fed noisy/crappy original image pixels instead of cleaned-up binarized pixels. #4111

fix failure to OCR: general quality issue due to LSTM being fed noisy/crappy original image pixels instead of cleaned-up binarized pixels. #4111

stweil commented Aug 5, 2023 •

edited