Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Add optional input for alternate image to use when sandwiching OCR data #210

Closed
dhendrix opened this issue Feb 5, 2016 · 9 comments · May be fixed by #4171
Closed

Feature request: Add optional input for alternate image to use when sandwiching OCR data #210

dhendrix opened this issue Feb 5, 2016 · 9 comments · May be fixed by #4171

Comments

@dhendrix
Copy link

dhendrix commented Feb 5, 2016

Hi,
I recently started using tesseract to help unclutter my desk at home, so forgive me if this is a n00b question/request.

I use textcleaner from Fred's ImageMagick Scripts to cleanup my scanned images for better OCR accuracy. However, the images that are optimized for OCR do not necessarily look good from a human standpoint, and I would like the final OCR'd PDF to look visually identical to the original scan.

So here's my feature request: Add an optional argument to take a cleaned image. Example invocation: tesseract -l eng -psm 4 --cleaned-image ${SRC}_cleaned.pnm ${SRC}.pnm out pdf

It will use ${SRC}.pnm to generate the final PDF image but layout detection, character recognition, etc. will be done using the --cleaned-image argument for better accuracy. That way the user will be given a final PDF that looks like the original but searches as well as the cleaned-up image.

I'd be surprised if nobody has already thought of this, so maybe work is already underway or maybe it's not possible. Thoughts?

@dhendrix dhendrix changed the title Feature request: Add optional input for alternate image to use when sandwiching hOCR data Feature request: Add optional input for alternate image to use when sandwiching OCR data Feb 5, 2016
@amitdo
Copy link
Collaborator

amitdo commented Feb 5, 2016

From #83, @zdenko's quote:

If you run:
tesseract OCR.tif ORIGINAL pdf
than ORIGINAL.tif is included in ORIGINAL.pdf WITHOUT any modification.

@zdenop
Copy link
Contributor

zdenop commented Feb 5, 2016

should be readed as: OCR.tif is included in ORIGINAL.pdf WITHOUT any modification. ;-)
And OCR is run on OCR.tif
IMO dhendrix what something else - he want to run OCR on image_b (improved for OCR), but include image_a (original) to pdf...

@dhendrix
Copy link
Author

dhendrix commented Feb 5, 2016

zdenop: Yes, that is correct. I want to run OCR on image_b (improved for OCR), but include image_a (original) in the resulting PDF.

BTW - I just realized that there is a user forum (https://groups.google.com/forum/#!forum/tesseract-ocr). Maybe somebody has asked / answered my question there. My apologies for not looking at that forum earlier.

@jbreiden
Copy link
Contributor

jbreiden commented Feb 5, 2016

Please post an example of a cleaned vs uncleaned image where accuracy improves significantly. Or even better, point to some documentation that has some examples. In the long term, one would hope that OCR could improve such that having a separate cleaned image is unnecessary.

Regarding this feature request, I think it is probably better to use an outside utility that can replace the images in a Tesseract produced PDF. The caller is already generating a separate set of clean images, so is therefore comfortable with pre/post processing. This approach lets us keep the design intent and implementation of PDF generation simple ('don't mess with the images'). I don't know if such a tool exists already, but based on my knowledge of the Tesseract PDF it shouldn't be too hard to write. Apologies, but I am not volunteering to write one unless I need it myself for something. The closest existing thing I know about is OverlayPDF from Apache PDFBox. Previously mentioned here. https://www.mail-archive.com/tesseract-ocr@googlegroups.com/msg11853.html

Also, if you really want to hack Tesseract to do what you are asking for, the code is in api/pdfrenderer.cc. You would have to replace both the pix and the filename. I'd just be reluctant to make this a general feature of Tesseract.

https://github.com/tesseract-ocr/tesseract/blob/master/api/pdfrenderer.cpp#L894

@dhendrix
Copy link
Author

dhendrix commented Feb 6, 2016

Hi jbreiden,
There are lots of tutorials on-line for how to clean-up images for improved accuracy OCR out there, just use your favorite search engine. Here's a good one from 2014 (not too old): http://www.christophermchurch.com/my-struggles-with-ocr-and-microfilm-scans/ . Yes, it would be nice if OCR engines were perfect and didn't need cleaned-up images. But that's currently not a realistic expectation when papers get crumpled, have background images, have shaded regions, are "scanned" using a cell phone camera with poor lighting conditions, etc.

I don't think it would be wise to try to add all that clean-up functionality into tesseract, which is why I'm proposing a solution to take an image that has already been processed externally. The exact intent is to "not mess with the images."

An external tool that could replace the image layer would certainly be good, but I haven't found any (suggestions welcome!). I tried to use hocr2pdf to use tesseract's .hocr data from my cleaned image and add it to a PDF with my original image but ran into a showstopper issue - When searching for a word in the document, the PDF viewers I tried would highlight the wrong part of the document. Maybe there is a bug with hocr2pdf or I am using it incorrectly. Tesseract already knows how to make a PDF so this reduces the possibility of an external program interpreting the PDF or hOCR specs differently and ruining the output.

Anyway, thanks for the pointer to api/pdfrenderer.cc! I might just add the feature locally to address my needs, or maybe try to start a new program based off of it as you suggest.

I don't mind closing this issue if others feel this feature is inappropriate or a better solution is made.

@amitdo
Copy link
Collaborator

amitdo commented Feb 6, 2016

Maybe this tool could help.
https://github.com/tmbdev/hocr-tools/blob/master/hocr-pdf

@jbreiden
Copy link
Contributor

jbreiden commented Feb 7, 2016

Swapping images in Tesseract a PDF is pretty easy for a programmer if destination images are JPEG or JPEG 2000. It is really just a matter of cutting and pasting the data, then cleaning up the results with qpdf. The hardest part is getting the courage to open up a PDF file and look inside it.

Regarding HOCR and bounding boxes, make sure you have image resolution metadata set correctly everywhere. The hocr-pdf program mentioned above works okay, but is limited to latin character sets and will also struggle with ligatures in English.

dhendrix added a commit to dhendrix/tesseract that referenced this issue Feb 7, 2016
Fix issue tesseract-ocr#210.

This adds an optional command-line argument to set the image which
will be used when generating a PDF image.

This addresses a niche case where the user wishes to use an optimized
image for OCR but maintain the visual appearance of the original image
when generating a PDF.

Signed-off-by: David Hendricks <davidhendricks@gmail.com>
@dhendrix
Copy link
Author

dhendrix commented Feb 7, 2016

Heh, yeah, I opened up a PDF and saw the stream content for the image and was thinking of how to replace it, but it seemed like there'd be a non-trivial amount of work to get right for a PDF n00b such as myself.

As far as I could tell the image resolution metadata was correct, or at least consistent. Couldn't get hocr-pdf working unfortunately due to some python module dependency that I couldn't find (I installed PyXML but no dice).

In the end I just went ahead and hacked my feature into Tesseract and it works pretty well :-) Here it is if you're interested, though be warned that it's a bit of a kludge in its current state: dhendrix@6cc206f

Thanks for the helpful pointers! Feel free to close if this feature is not desired for upstream, it can live on in my github account.

@jbreiden
Copy link
Contributor

jbreiden commented Feb 7, 2016

The python dependency is reportlab.

@jbreiden jbreiden closed this as completed Feb 7, 2016
@amitdo amitdo added the PDF label May 30, 2016
dhendrix added a commit to dhendrix/tesseract that referenced this issue Sep 4, 2016
Fix issue tesseract-ocr#210.

This adds an optional command-line argument to set the image which
will be used when generating a PDF image.

This addresses a niche case where the user wishes to use an optimized
image for OCR but maintain the visual appearance of the original image
when generating a PDF.

Signed-off-by: David Hendricks <davidhendricks@gmail.com>
dhendrix added a commit to dhendrix/tesseract that referenced this issue Jan 19, 2020
Fix issue tesseract-ocr#210.

This adds an optional command-line argument to set the image which
will be used when generating a PDF image.

This addresses a niche case where the user wishes to use an optimized
image for OCR but maintain the visual appearance of the original image
when generating a PDF.

Signed-off-by: David Hendricks <david.hendricks@gmail.com>
phymbert added a commit to phymbert/tesseract that referenced this issue Dec 18, 2023
Add pdf renderer tests.
Install pdf font in cmake tool chain.

resolves tesseract-ocr#210
resolves tesseract-ocr#3798
phymbert added a commit to phymbert/tesseract that referenced this issue Dec 18, 2023
…ammatically.

Support new rendering_dpi api params.
Add pdf renderer tests.
Install pdf font in cmake tool chain.

resolves tesseract-ocr#210
resolves tesseract-ocr#3798
phymbert added a commit to phymbert/tesseract that referenced this issue Dec 18, 2023
…ammatically.

Support new rendering_dpi api params.
Add pdf renderer tests.
Install pdf font in cmake tool chain.

resolves tesseract-ocr#210
resolves tesseract-ocr#3798
phymbert added a commit to phymbert/tesseract that referenced this issue Dec 18, 2023
…ammatically.

Support new rendering_dpi api params.
Add pdf renderer tests.
Install pdf font in cmake tool chain.

resolves tesseract-ocr#210
resolves tesseract-ocr#3798
phymbert added a commit to phymbert/tesseract that referenced this issue Dec 19, 2023
…ammatically.

Support new rendering_dpi api params.
Add pdf renderer tests.
Install pdf font in cmake tool chain.

resolves tesseract-ocr#210
resolves tesseract-ocr#3798
stweil pushed a commit to phymbert/tesseract that referenced this issue Apr 19, 2024
…ammatically

Support new rendering_dpi api params.
Add pdf renderer tests.
Install pdf font in cmake tool chain.

resolves tesseract-ocr#210
resolves tesseract-ocr#3798
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants