Tesseract creates hOCR output without text results #4112

stweil · 2023-08-05T15:52:20Z

On some page images full of text Tesseract does not detect any text when using the default settings. Typically it prints Empty page!! twice for such pages. See issue #3021 for details and examples.

In some rare cases Tesseract prints Empty page!! only once and finds text in a 2nd pass. That text is written to ALTO and text output, but hOCR output does not show that text.

Example:

tesseract https://digi.bib.uni-mannheim.de/periodika/fileadmin/data/DeutReunP_856399094_19140210/max/856399094_1910_035_03.jpg 856399094_1910_035_03 alto hocr txt

The text was updated successfully, but these errors were encountered:

stweil · 2023-08-06T14:49:15Z

Tesseract normally runs Recognize from TessBaseAPI::ProcessPage, but most Tesseract renderers also run Recognize conditionally unless recognition was already done. The test whether recognition should be called by the renderer is done using two different implementations:

TessBaseAPI::GetAltoText, TessBaseAPI::GetTSVText, TessBaseAPI::GetHOCRText, TessBaseAPI::GetLSTMBoxText, TessBaseAPI::GetWordStrBoxText check page_res_ == nullptr.

TessBaseAPI::GetUTF8Text, TessBaseAPI::GetBoxText, TessBaseAPI::GetUNLVText, TessBaseAPI::AllWordConfidences check recognition_done_.

OCR on an "empty" page sets page_res, but not recognition_done_. Therefore all renderers which check recognition_done_ will trigger an additional OCR pass. Example:

tesseract 'https://ub-backup.bib.uni-mannheim.de/reichsanzeiger/1879-10-01--1914-07-31---001-036/029-1907/0312.jp2' - txt makebox wordstrbox unlv

Output:

Empty page!!
Empty page!!

Empty page!!
Empty page!!

If for example TessBaseAPI::GetUTF8Text triggers a 2nd OCR pass and that pass detects text, then all renderers which had been processed earlier did not get any text while the text renderer and all renderers which are processed after it will output the detected text from the 2nd pass.

amitdo · 2023-08-13T08:21:52Z

..but all Tesseract renderers also run Recognize conditionally...

The pdf renderer does not call Recognize().

stweil · 2023-08-13T09:14:26Z

That's correct, thank you. I updated my comment and replaced "all" by "most".

tfmorris · 2024-01-03T01:28:29Z

It seems odd both that the recognition is not deterministic and that recognition_done_ is not set for an empty page. Is the recognition_done_ used in a way where it's important to be able to distinguish an empty page from a non-empty page? It seems like

tesseract/src/api/baseapi.cpp

Line 849 in 637be53

recognition_done_ = true;

could just be moved up a few lines to fix the issue.

A small utility function that the renderers can use, so that they all do the check in the same way might be another improvement.

amitdo added bug output issues related output formats labels Aug 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tesseract creates hOCR output without text results #4112

Tesseract creates hOCR output without text results #4112

stweil commented Aug 5, 2023 •

edited

stweil commented Aug 6, 2023 •

edited

amitdo commented Aug 13, 2023

stweil commented Aug 13, 2023

tfmorris commented Jan 3, 2024

Tesseract creates hOCR output without text results #4112

Tesseract creates hOCR output without text results #4112

Comments

stweil commented Aug 5, 2023 • edited

stweil commented Aug 6, 2023 • edited

amitdo commented Aug 13, 2023

stweil commented Aug 13, 2023

tfmorris commented Jan 3, 2024

stweil commented Aug 5, 2023 •

edited

stweil commented Aug 6, 2023 •

edited