Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix failure to OCR: general quality issue due to LSTM being fed noisy/crappy *original* image pixels instead of cleaned-up binarized pixels. #4111

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

GerHobbelt
Copy link
Contributor

  • fix Bushnell OCR bug (failure to properly OCR the number "11"; see message chain in mailing list: https://groups.google.com/g/tesseract-ocr/c/5jrGvsrdqig/m/jvTG6L9zBgAJ; this includes sample images, text output and context as originally reported by Astro/Nor):

    root cause:

    it turns out tesseract erroneously grabs the ORIGINAL image (instead of the THRESHOLDED/BINARIZED one!) to extract the word box (Tesseract::GetRectImage()) which will be fed into the LSTM OCR neural net in order to OCR the detected text area.

    Ergo: this fix SHOULD improve OCR results generally, as this is a generic bug which impacts ALL text bboxes found in a given input page image, which are then being pumped into the LSTM engine to obtain OCR'ed text.

    This fix was verified to work in an otherwise patched/augmented tesseract rig: GerHobbelt/tesseract: commit series bb37cf3, ffc1997, 15d2952, 69416e5, f49826b, d53c1a2, 44f2f84, where I worked on removing the curious BestPix() API, which SEEMINGLY was originally meant for ScrollView-et-al debug display purposes, but is (IMO) an ill-named API for that purpose.

  • remove accompanying, now obsolete, comment

  • also remove the need for BestPix() API usage in EquationDetect::PrintSpecialBlobsDensity() by invoking the API that delivers what's actually used there: the image height. Here BestPix() usage is also wrong (theoretically) as the sought-after image height is about the actual height of the binarized image data, which represent the cleaned-up-and-ready-for-use OCR sourcing image data.


Corollary of this bug

Anyone feeding tesseract monochrome (pre-thresholded/binarized) images from an external cleanup+binarization process SHOULD already get best OCR results and SHOULD NOT be impacted by this bug, nor this fix. (As then there would be no difference between 'original pix' and 'binary pix' from tesseract's perspective.)

…message chain in mailing list: https://groups.google.com/g/tesseract-ocr/c/5jrGvsrdqig/m/jvTG6L9zBgAJ): it turns out tesseract erroneously grabs the ORIGINAL image (instead of the THRESHOLDED/BINARIZED one!) to extract the word box (Tesseract::GetRectImage()) which will be fed into the LSTM OCR neural net in order to OCR the detected text area.

  Ergo: this fix SHOULD improve OCR results generally, as this is a generic bug which impacts ALL text bboxes found in a given input page image, which are then being pumped into the LSTM engine to obtain OCR'ed text.

  This fix was verified to work in an otherwise patched/augmented tesseract rig: GerHobbelt/tesseract: commit series bb37cf3, ffc1997, 15d2952, 69416e5, f49826b, d53c1a2, 44f2f84, where I worked on removing the curious BestPix() API, which SEEMINGLY was originally meant for ScrollView-et-al debug display purposes, but is (IMO) an ill-named API for that purpose.

- remove accompanying, now obsolete, comment

- also remove the need for BestPix() API usage in EquationDetect::PrintSpecialBlobsDensity() by invoking the API that delivers what's actually used there: the image height. Here BestPix() usage is also wrong (theoretically) as the sought-after image height is about the actual height of the binarized image data, which represent the cleaned-up-and-ready-for-use OCR sourcing image data.
@stweil
Copy link
Contributor

stweil commented Aug 5, 2023

I tested this commit on a greyscale newspaper image, and the result is mixed. Some lines were recognized better, others got worse. Generally it is intentional to use the LSTM with original images. Ideally the neural network was trained to handle noise and even to work better with greyscale images than with binarized images.

@GerHobbelt
Copy link
Contributor Author

GerHobbelt commented Aug 5, 2023 via email

@GerHobbelt
Copy link
Contributor Author

As I'm digging further into the catacombs of tesseract RTFC I realize that technically the LSTM engine is engineered to accept raw RGB input (3 channels) or failing that, greyscale input (1 channel), so from that perspective the reported issue is NOT a bug but rather a tesseract 4/5 feature.

Meanwhile I find that same source image is fed through binarization, resulting in a pure-black-and-white 1 bit per pixel image which is used by the segmentation logic which clips (or extracts when we use a different jargon for the same "cutting out lines and feeding them to the OCR engine" process) boxes=segments of detected text lines to feed to the LSTM engine.

The LSTM engine may have once been trained to tolerate noise, but given the actual behaviour for, for example, the rather noisy (yet very legible) wildlife camera text referenced earlier, I'd argue that the tolerance for noise in the LSTM engine is only 'good enough' when we feed tesseract very tightly preprocessed scans/images, such as can be expected off a professional book/paper scanning rig.

Anyhow, I feel the currently filed pull request is subpar re quality, given my latest insights in tesseract, and should be closed.

Before I do so, I'd love some feedback from @stweil on the next two questions:

1: procedure: would you like me to file subsequent (re)work on this subject under this same pr or do you rather see a fresh pr which mentions this one?

2: the planned rework is to combine this effort with an improved greyscale image preprocessing stage where the greyscale is "filtered" through the thresholding mask (after thickening/dilating/brick-closing it) so that the resulting greyscale image keeps all noisy data only under the masked areas which represent the (binarized) text mask. The cruddy OCR results that the wildlife cam sample produced are due to nearly invisible pixel noise in the background far removed from the actual characters, such that it may be argued to be "adversarial input" as it produces an OCR text result that's far off the mark with seemingly sensible high certainty values at the moment. Filtering the raw input through the threshold mask (dilated) would kill this and a lot of other "weird OCR results" that I got for old & otherwise low quality book scans I have been testing on. Hence the question:

2a. Do you have a link for me where I can download some or all of the pages you are working on so I can compare against your samples as well?

2b. Would you be interested in this work at all?
(I'm of two minds when I have to formulate what tesseract, as a product, wishes/plans to accept as "input images": while there's been talk of noisy inputs over the years on the mailing list, etc., my current sense of the matter is that in actuality tesseract expects pre-cleaned input to the best of one's ability and that the binarization, etc. techniques included are rather a, ähhh, stop gap to make tesseract more palatable to 'the general public' at first glance. Lowering the threshold to entry and all that. Legacy code, etc etc.

Hm, what I'm looking for, I guess, is a project lead policy question (and answer), I suppose... Does tesseract, as a general policy, wish to receive maximum effort cleaned up black on white greyscale scans1 as source images, or is tesseract's strategy/long-term goal to include a industrial quality preprocessing stage?

I'm asking because I know where I want to go towards, but I'm not clear on where tesseract wants to go. It may be my limited skillset to comprehend the documentations, but I was unable to find where y'all want to be in X years with this. (and, yes, open source is a labor of love primarily so no harm nor foul when the goals take forever, but I'm lacking that vision from the tesseract core as is. 🙏) So @stweil: if you could give me a hint at where you want to take tesseract I'd be much obliged. That way I would be much better informed whether some work I've done and intend to do is worth filing a pr for. 🙏😚


Thanks for bearing with me. Cheers,

Ger

Footnotes

  1. doesn't anyone wonder why it's weird the LSTM models are based on 3 channel RGB inputs, while the same result would be achieved with 16bits per pixel greyscale inputs IFF one trains on and expects as input regular black on white printed pages' scans anyway? Or am I ignorant or missing some other elephant in the room? Greyscale at 16bpp is supported by leptonica AFAICT and the model would have a third of the input nodes it has now, so it would be (much?) faster as we don't really use the color info anyway: all tesseract knows about is black text on white/white-ish backgrounds. Yes, when you feed Tess an RGB image its pixels go straight into the LSTM (after line cropping) but... The same contrast (and some more) can be had at greyscale 16bpp unless I'm overlooking that elephant.🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants