Feature request: Add optional input for alternate image to use when sandwiching OCR data #210

dhendrix · 2016-02-05T07:28:02Z

Hi,
I recently started using tesseract to help unclutter my desk at home, so forgive me if this is a n00b question/request.

I use textcleaner from Fred's ImageMagick Scripts to cleanup my scanned images for better OCR accuracy. However, the images that are optimized for OCR do not necessarily look good from a human standpoint, and I would like the final OCR'd PDF to look visually identical to the original scan.

So here's my feature request: Add an optional argument to take a cleaned image. Example invocation: tesseract -l eng -psm 4 --cleaned-image ${SRC}_cleaned.pnm ${SRC}.pnm out pdf

It will use ${SRC}.pnm to generate the final PDF image but layout detection, character recognition, etc. will be done using the --cleaned-image argument for better accuracy. That way the user will be given a final PDF that looks like the original but searches as well as the cleaned-up image.

I'd be surprised if nobody has already thought of this, so maybe work is already underway or maybe it's not possible. Thoughts?

amitdo · 2016-02-05T08:55:58Z

~~From #83, @zdenko's quote:~~

If you run:
tesseract OCR.tif ORIGINAL pdf
than ORIGINAL.tif is included in ORIGINAL.pdf WITHOUT any modification.

zdenop · 2016-02-05T09:14:16Z

should be readed as: OCR.tif is included in ORIGINAL.pdf WITHOUT any modification. ;-)
And OCR is run on OCR.tif
IMO dhendrix what something else - he want to run OCR on image_b (improved for OCR), but include image_a (original) to pdf...

dhendrix · 2016-02-05T09:16:24Z

zdenop: Yes, that is correct. I want to run OCR on image_b (improved for OCR), but include image_a (original) in the resulting PDF.

BTW - I just realized that there is a user forum (https://groups.google.com/forum/#!forum/tesseract-ocr). Maybe somebody has asked / answered my question there. My apologies for not looking at that forum earlier.

jbreiden · 2016-02-05T19:16:19Z

Please post an example of a cleaned vs uncleaned image where accuracy improves significantly. Or even better, point to some documentation that has some examples. In the long term, one would hope that OCR could improve such that having a separate cleaned image is unnecessary.

Regarding this feature request, I think it is probably better to use an outside utility that can replace the images in a Tesseract produced PDF. The caller is already generating a separate set of clean images, so is therefore comfortable with pre/post processing. This approach lets us keep the design intent and implementation of PDF generation simple ('don't mess with the images'). I don't know if such a tool exists already, but based on my knowledge of the Tesseract PDF it shouldn't be too hard to write. Apologies, but I am not volunteering to write one unless I need it myself for something. The closest existing thing I know about is OverlayPDF from Apache PDFBox. Previously mentioned here. https://www.mail-archive.com/tesseract-ocr@googlegroups.com/msg11853.html

Also, if you really want to hack Tesseract to do what you are asking for, the code is in api/pdfrenderer.cc. You would have to replace both the pix and the filename. I'd just be reluctant to make this a general feature of Tesseract.

https://github.com/tesseract-ocr/tesseract/blob/master/api/pdfrenderer.cpp#L894

dhendrix · 2016-02-06T10:11:25Z

Hi jbreiden,
There are lots of tutorials on-line for how to clean-up images for improved accuracy OCR out there, just use your favorite search engine. Here's a good one from 2014 (not too old): http://www.christophermchurch.com/my-struggles-with-ocr-and-microfilm-scans/ . Yes, it would be nice if OCR engines were perfect and didn't need cleaned-up images. But that's currently not a realistic expectation when papers get crumpled, have background images, have shaded regions, are "scanned" using a cell phone camera with poor lighting conditions, etc.

I don't think it would be wise to try to add all that clean-up functionality into tesseract, which is why I'm proposing a solution to take an image that has already been processed externally. The exact intent is to "not mess with the images."

An external tool that could replace the image layer would certainly be good, but I haven't found any (suggestions welcome!). I tried to use hocr2pdf to use tesseract's .hocr data from my cleaned image and add it to a PDF with my original image but ran into a showstopper issue - When searching for a word in the document, the PDF viewers I tried would highlight the wrong part of the document. Maybe there is a bug with hocr2pdf or I am using it incorrectly. Tesseract already knows how to make a PDF so this reduces the possibility of an external program interpreting the PDF or hOCR specs differently and ruining the output.

Anyway, thanks for the pointer to api/pdfrenderer.cc! I might just add the feature locally to address my needs, or maybe try to start a new program based off of it as you suggest.

I don't mind closing this issue if others feel this feature is inappropriate or a better solution is made.

amitdo · 2016-02-06T12:26:25Z

Maybe this tool could help.
https://github.com/tmbdev/hocr-tools/blob/master/hocr-pdf

jbreiden · 2016-02-07T02:14:07Z

Swapping images in Tesseract a PDF is pretty easy for a programmer if destination images are JPEG or JPEG 2000. It is really just a matter of cutting and pasting the data, then cleaning up the results with qpdf. The hardest part is getting the courage to open up a PDF file and look inside it.

Regarding HOCR and bounding boxes, make sure you have image resolution metadata set correctly everywhere. The hocr-pdf program mentioned above works okay, but is limited to latin character sets and will also struggle with ligatures in English.

Fix issue tesseract-ocr#210. This adds an optional command-line argument to set the image which will be used when generating a PDF image. This addresses a niche case where the user wishes to use an optimized image for OCR but maintain the visual appearance of the original image when generating a PDF. Signed-off-by: David Hendricks <davidhendricks@gmail.com>

dhendrix · 2016-02-07T04:14:50Z

Heh, yeah, I opened up a PDF and saw the stream content for the image and was thinking of how to replace it, but it seemed like there'd be a non-trivial amount of work to get right for a PDF n00b such as myself.

As far as I could tell the image resolution metadata was correct, or at least consistent. Couldn't get hocr-pdf working unfortunately due to some python module dependency that I couldn't find (I installed PyXML but no dice).

In the end I just went ahead and hacked my feature into Tesseract and it works pretty well :-) Here it is if you're interested, though be warned that it's a bit of a kludge in its current state: dhendrix@6cc206f

Thanks for the helpful pointers! Feel free to close if this feature is not desired for upstream, it can live on in my github account.

jbreiden · 2016-02-07T23:00:09Z

The python dependency is reportlab.

Fix issue tesseract-ocr#210. This adds an optional command-line argument to set the image which will be used when generating a PDF image. This addresses a niche case where the user wishes to use an optimized image for OCR but maintain the visual appearance of the original image when generating a PDF. Signed-off-by: David Hendricks <davidhendricks@gmail.com>

Fix issue tesseract-ocr#210. This adds an optional command-line argument to set the image which will be used when generating a PDF image. This addresses a niche case where the user wishes to use an optimized image for OCR but maintain the visual appearance of the original image when generating a PDF. Signed-off-by: David Hendricks <david.hendricks@gmail.com>

Add pdf renderer tests. Install pdf font in cmake tool chain. resolves tesseract-ocr#210 resolves tesseract-ocr#3798

…ammatically. Support new rendering_dpi api params. Add pdf renderer tests. Install pdf font in cmake tool chain. resolves tesseract-ocr#210 resolves tesseract-ocr#3798

…ammatically Support new rendering_dpi api params. Add pdf renderer tests. Install pdf font in cmake tool chain. resolves tesseract-ocr#210 resolves tesseract-ocr#3798

dhendrix changed the title ~~Feature request: Add optional input for alternate image to use when sandwiching hOCR data~~ Feature request: Add optional input for alternate image to use when sandwiching OCR data Feb 5, 2016

jbreiden closed this as completed Feb 7, 2016

amitdo added the PDF label May 30, 2016

amitdo added the feature request label Apr 27, 2022

amitdo mentioned this issue Apr 27, 2022

Add option to use demo.processed.tif image to create the demo.pdf #3798

Open

phymbert added a commit to phymbert/tesseract that referenced this issue Dec 18, 2023

PDF Renderer: allow caller to specify an alternate image or resolution.

5f1e7d5

Add pdf renderer tests. Install pdf font in cmake tool chain. resolves tesseract-ocr#210 resolves tesseract-ocr#3798

phymbert mentioned this issue Dec 19, 2023

PDF Renderer: allow to specify an alternate image or a custom resolution. #4171

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: Add optional input for alternate image to use when sandwiching OCR data #210

Feature request: Add optional input for alternate image to use when sandwiching OCR data #210

dhendrix commented Feb 5, 2016

amitdo commented Feb 5, 2016

zdenop commented Feb 5, 2016

dhendrix commented Feb 5, 2016

jbreiden commented Feb 5, 2016

dhendrix commented Feb 6, 2016

amitdo commented Feb 6, 2016

jbreiden commented Feb 7, 2016

dhendrix commented Feb 7, 2016

jbreiden commented Feb 7, 2016

Feature request: Add optional input for alternate image to use when sandwiching OCR data #210

Feature request: Add optional input for alternate image to use when sandwiching OCR data #210

Comments

dhendrix commented Feb 5, 2016

amitdo commented Feb 5, 2016

zdenop commented Feb 5, 2016

dhendrix commented Feb 5, 2016

jbreiden commented Feb 5, 2016

dhendrix commented Feb 6, 2016

amitdo commented Feb 6, 2016

jbreiden commented Feb 7, 2016

dhendrix commented Feb 7, 2016

jbreiden commented Feb 7, 2016