New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tesseract creates hOCR output without text results #4112
Comments
Tesseract normally runs
OCR on an "empty" page sets
Output:
If for example |
The pdf renderer does not call |
That's correct, thank you. I updated my comment and replaced "all" by "most". |
It seems odd both that the recognition is not deterministic and that Line 849 in 637be53
could just be moved up a few lines to fix the issue. A small utility function that the renderers can use, so that they all do the check in the same way might be another improvement. |
On some page images full of text Tesseract does not detect any text when using the default settings. Typically it prints
Empty page!!
twice for such pages. See issue #3021 for details and examples.In some rare cases Tesseract prints
Empty page!!
only once and finds text in a 2nd pass. That text is written to ALTO and text output, but hOCR output does not show that text.Example:
The text was updated successfully, but these errors were encountered: