Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Characters assigned to wrong RIL_WORD block, 0 % confidence. #4175

Open
MK-3PP opened this issue Jan 1, 2024 · 5 comments
Open

Characters assigned to wrong RIL_WORD block, 0 % confidence. #4175

MK-3PP opened this issue Jan 1, 2024 · 5 comments

Comments

@MK-3PP
Copy link

MK-3PP commented Jan 1, 2024

Current Behavior

Upon recognition with language model tessdata 4.1.0 eng.traineddata, characters that are a bit too wide apart sometimes get sorted into two words (false whitespace), depending on the ROI that was set with TessBaseAPI::SetRechtangle.
This is an expected edge case due to different segmentation, depending on the placement of the ROI. Thresholding method is Tiled Sauvola.

The recognized characters, however, seem not to be assigned to the correct word block (RIL_WORD):
grafik

  • Note how the recognized words are "20M" and "110210A" instead of the clearly boxed "29M1" and "10210A" substrings in the image (neglect the "9" as "0" misrecognition).
  • Note that the confidences are 0.0% on each of the words.

Below you can see the total image and the ROI (orange rectangle). The black area in the center is a customer logo that had to be removed for uploading. During OCR, this area was not black, but printed like the other text.
grafik

Changing the ROI slightly (moving the bottom right corner a bit more outward) removes the split of the "29M110210A" line, though the confidence is low (22.3 %).
grafik

Side note:
And suddenly an additional "0" has sneaked in, making it "290M110210A"!
Layout analysis (TessBaseAPI::AnalyseLayout) shows a tiny fragment inside the "M". Is that the "0"?
grafik

Changing the ROI yet a bit more (moving the bottom right corner yet further outward) then gives a good confidence (80.4 %).
grafik

I thought that maybe I am observing an edge case where the segmentation goes from OK to really bad, but this seems not to be the case. TessBaseAPI::GetThresholdedImage gives these nearly identical images for the bad and the good case:
grafik
grafik

Expected Behavior

This should either read "20M1" and "10210A" or "20M110210A" (neglect the "9" as "0" misrecognition) and not have zero confidence.
grafik

Suggested Fix

No response

tesseract -v

tesseract 5.3.2
leptonica-1.83.1 (Oct 27 2023, 14:15:52) [MSC v.1929 LIB Release x64]
(null)
Found AVX2
Found AVX
Found FMA
Found SSE4.1

Operating System

Windows 10

Other Operating System

No response

uname -a

No response

Compiler

MSVC 16.11.26

CPU

No response

Virtualization / Containers

No response

Other Information

C++ Application, TessBaseAPI. Custom GDI visualization, hence only cropped screenshots with slightly differing alignment.

@zdenop
Copy link
Contributor

zdenop commented Jan 13, 2024

Please provide input images and example C++ code that demonstrate your problem.

@MK-3PP
Copy link
Author

MK-3PP commented Jan 30, 2024

Input image

input

Code

#include "leptonica/allheaders.h"
#include "leptonica/pix_internal.h"
#include "tesseract/baseapi.h"
#include "opencv2/imgcodecs.hpp"
#include "opencv2/imgproc.hpp"
#include <memory>

int main() {
    cv::Mat in_img = cv::imread("./input.png", cv::ImreadModes::IMREAD_GRAYSCALE);
    tesseract::TessBaseAPI tess;

    // Set tesseract parameters.
    tess.Init(".", "eng");
    tess.SetVariable("thresholding_method", "2"); // Tiled Sauvola
    tess.SetPageSegMode(tesseract::PageSegMode::PSM_SINGLE_BLOCK);
    tess.SetImage(in_img.data, in_img.cols, in_img.rows, in_img.channels(), static_cast<int>(in_img.step1()));

    // Output thresholded image.
    std::unique_ptr<Pix, void(*)(Pix*)> thrs_pix(tess.GetThresholdedImage(), [](Pix* val) { pixDestroy(&val); });
    cv::Mat out_img(cv::Size(thrs_pix->w, thrs_pix->h), CV_8UC1);
    for (uint32_t y = 0; y < thrs_pix->h; ++y) {
        for (uint32_t x = 0; x < thrs_pix->w; ++x) {
            l_uint32 val;
            if (0 == pixGetPixel(thrs_pix.get(), x, y, &val)) {
                out_img.at<unsigned char>(y, x) = val ? 255 : 0;
            }
        }
    }
    cv::cvtColor(out_img, out_img, cv::COLOR_GRAY2BGR); // prepare colored output image

    // Perform recognition.
    if (0 == tess.Recognize(nullptr))
        return 1;

    std::unique_ptr<tesseract::ResultIterator> res_iter(tess.GetIterator());

    if (nullptr == res_iter)
        return 2;

    // Extract image information. Generate output image for symbols and words.
    for (auto block_level : { tesseract::PageIteratorLevel::RIL_SYMBOL , tesseract::PageIteratorLevel::RIL_WORD }) {
        cv::Mat curr_img;
        cv::cvtColor(in_img, curr_img, cv::COLOR_GRAY2BGR); // prepare colored current image
        res_iter->Begin();

        do {
            // Only text blocks.
            if (PTIsTextType(res_iter->BlockType())) {
                cv::Point2i p1, p2;

                if (res_iter->BoundingBox(block_level, &p1.x, &p1.y, &p2.x, &p2.y)) {
                    // Draw bounding box.
                    cv::rectangle(curr_img, cv::Rect(p1, p2), cv::Scalar(0, 255, 0));

                    // Prapare text output.
                    const int font = cv::HersheyFonts::FONT_HERSHEY_PLAIN;
                    cv::Size text_size;

                    // Write confidence.
                    std::stringstream conf;
                    conf.precision(0);
                    conf << std::fixed << res_iter->Confidence(block_level) << '%';
                    text_size = cv::getTextSize(conf.str(), font, 1.0, 1, nullptr);
                    cv::putText(curr_img, conf.str(), cv::Point2i(p2.x - text_size.width - 2, p2.y - 2), font, 1.0, cv::Scalar(255, 100, 0));

                    // Write detected text (OpenCV does only have ASCII, but close enough).
                    std::unique_ptr<const char[]> raw_text(res_iter->GetUTF8Text(block_level));
                    if (raw_text != nullptr) {
                        text_size = cv::getTextSize(raw_text.get(), font, 1.0, 1, nullptr);
                        cv::putText(curr_img, raw_text.get(), cv::Point2i(p1.x + 2, p1.y + text_size.height + 2), font, 1, cv::Scalar(0, 0, 255));
                    }
                }
            }
        } while (res_iter->Next(block_level));

        // Stack current image on top of output image.
        cv::vconcat(curr_img, out_img, out_img);
    }

    cv::imwrite("./output.png", out_img);

    return 0;
}

Output

output

Remarks

The program above reproduces the error shown in the original issue post, but in a self-contained program. Hence coloring, fonts etc are deviating.
The output consists of three stacked augmented verisons of the input image:

  • Recognized words
  • Recognized symbols
  • Threshold image (for visual proof of Tesseract's working space)

Each word or symbol comes with it's bounding box (green), the recognized text (red) and the confidence (blue).

Dependencies

  • Tesseract
  • Leptonica
  • OpenCV

Setup

To execute the program, you need to put the input image into the executable's current directory as "input.png".
Also, you need the english language model from here in the same folder.
The output will be saved as "output.png" in the same folder.

Discussion

As you can see in the output image provided, the word "29M1" is recognized as "29M" with 0% confidence, albeit consisting of three characters '2', '9' and 'M' with above 90% confidence each. The 'M' is a misdetection of the actual printed "M1".

Noticeably, the next character might screw things up: the first '1' of "10210A" gets detected as 3 different Symbols, '1', '1' and 'T', where the glitched '1' and 'T' seem to share the exact same location. They got a higher bounding box than the neighboring characters but are only 1 px wide. It seems, those glitched symbols screw up the word "29M110210A", divide it in two parts and subsequently set their confidences to zero.
Detail shot from our customer application (I can zoom in there, but the boxes are drawn 0.5 pixels off - it is just a quick debug view):
grafik

And just for funsies, on the left side the word "paper" is recognized from random cracks. With 16% confidence, which is infinitely more than the 0% for second line of the actual printed text.

@zdenop
Copy link
Contributor

zdenop commented Feb 20, 2024

I just manually preprocess image based on documentation:

input4175p

and the result is:

tesseract input4175p.png -
9200795018 -
20M110210A

=>

  • tesseract is not suitable for text detection (usually)
  • tesseract is OCR engine for good output there is a need to give a good input image.

@MK-3PP
Copy link
Author

MK-3PP commented Feb 20, 2024

Thank you. As you guessed, text detection is what we aimed for.

Just to reemphasize, I was neither being thrown off by the random junk being detected outside the obvious text label or by the inserted blank between '1' and '1'.

What caught my attention was that

  • "M1" became "M"
  • "1" became "11" (and this was not a '1' being carried over the blank, it was a coincidentally occuring actual '1' that was detected with a very deformed bounding box)
  • The confidence dropped to 0 %
  • and the broken overlapping bounding boxes left of the second '1' glyph in the second line.
    And all that while the same image rotated 1 ° or 2 ° to the left or right yielded OK results.

I think this is dangerous: there is a continuous sweep of angles the image can be rotated for good results. and then, amodst those, there is a discontinuity in the results where obvious recognition artifacts screw up the result.
Even for non-optimal inputs the reults should not glitch out like that.

But I understand, there is machine learning behind the scenes and those models tend to have that kind of discontinuity issues.

@MK-3PP
Copy link
Author

MK-3PP commented Feb 20, 2024

One last question:

Do you have any educated guess on why this is happening?

grafik

As far as I understand documentation, the image acquired by GetThresholdedImage() is the true image presented to the OCR. How come that there is a character, 'a', recognised in a pitch black area with not a single white pixel?

To me this looks as if the character recognition model has not been trained with empty images as part of the rejection class(es).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants