Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tesseract creates hOCR output without text results #4112

Open
stweil opened this issue Aug 5, 2023 · 4 comments
Open

Tesseract creates hOCR output without text results #4112

stweil opened this issue Aug 5, 2023 · 4 comments
Labels
bug output issues related output formats

Comments

@stweil
Copy link
Contributor

stweil commented Aug 5, 2023

On some page images full of text Tesseract does not detect any text when using the default settings. Typically it prints Empty page!! twice for such pages. See issue #3021 for details and examples.

In some rare cases Tesseract prints Empty page!! only once and finds text in a 2nd pass. That text is written to ALTO and text output, but hOCR output does not show that text.

Example:

tesseract https://digi.bib.uni-mannheim.de/periodika/fileadmin/data/DeutReunP_856399094_19140210/max/856399094_1910_035_03.jpg 856399094_1910_035_03 alto hocr txt
@stweil
Copy link
Contributor Author

stweil commented Aug 6, 2023

Tesseract normally runs Recognize from TessBaseAPI::ProcessPage, but most Tesseract renderers also run Recognize conditionally unless recognition was already done. The test whether recognition should be called by the renderer is done using two different implementations:

TessBaseAPI::GetAltoText, TessBaseAPI::GetTSVText, TessBaseAPI::GetHOCRText, TessBaseAPI::GetLSTMBoxText, TessBaseAPI::GetWordStrBoxText check page_res_ == nullptr.

TessBaseAPI::GetUTF8Text, TessBaseAPI::GetBoxText, TessBaseAPI::GetUNLVText, TessBaseAPI::AllWordConfidences check recognition_done_.

OCR on an "empty" page sets page_res, but not recognition_done_. Therefore all renderers which check recognition_done_ will trigger an additional OCR pass. Example:

tesseract 'https://ub-backup.bib.uni-mannheim.de/reichsanzeiger/1879-10-01--1914-07-31---001-036/029-1907/0312.jp2' - txt makebox wordstrbox unlv

Output:

Empty page!!
Empty page!!

Empty page!!
Empty page!!

If for example TessBaseAPI::GetUTF8Text triggers a 2nd OCR pass and that pass detects text, then all renderers which had been processed earlier did not get any text while the text renderer and all renderers which are processed after it will output the detected text from the 2nd pass.

@amitdo amitdo added bug output issues related output formats labels Aug 6, 2023
@amitdo
Copy link
Collaborator

amitdo commented Aug 13, 2023

..but all Tesseract renderers also run Recognize conditionally...

The pdf renderer does not call Recognize().

@stweil
Copy link
Contributor Author

stweil commented Aug 13, 2023

That's correct, thank you. I updated my comment and replaced "all" by "most".

@tfmorris
Copy link
Contributor

tfmorris commented Jan 3, 2024

It seems odd both that the recognition is not deterministic and that recognition_done_ is not set for an empty page. Is the recognition_done_ used in a way where it's important to be able to distinguish an empty page from a non-empty page? It seems like

recognition_done_ = true;

could just be moved up a few lines to fix the issue.

A small utility function that the renderers can use, so that they all do the check in the same way might be another improvement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug output issues related output formats
Projects
None yet
Development

No branches or pull requests

3 participants