Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disabling page auto-rotations #57

Open
ajab21 opened this issue Oct 16, 2018 · 9 comments
Open

Disabling page auto-rotations #57

ajab21 opened this issue Oct 16, 2018 · 9 comments

Comments

@ajab21
Copy link

ajab21 commented Oct 16, 2018

Is there a way to still run Alfresco Simple OCR (w/ pdfsandwich) on each new document version (so text can continue to be found on the pages) yet kill the auto-rotation portion of the process for subsequent versions of the document after 1.0? The business scenarios here is to avoid manual page rotations (i.e. corrections to improper automatic orientation) from being recursively overridden by the automatic processing. Our thought process to resolve this issue is to consider writing programming logic to consider what the version of the document is in order to apply auto-rotations or not. In other words, apply automatic page rotations to the very first version 1.0, but don't so on any subsequent version edits when manually changes/corrections could have been made. Of course, this is dependent on whether we’re able to pass a command to Simple OCR and/or pdfsandwich to conditionally disable the auto-rotation portion of the process. Is this possible to do? If so, do you know the code or command we need to employ in order to achieve this?

Stepping back, just wondering if you’re heard of this problem before and any other approaches you know of that we may want to consider (instead of the idea described above) to overcome it.

Thank you!


Here's more background:

There are anomalies with some kinds of scanned documents being uploaded where automation logic is not able to determine the page rotation correctly. Auto-rotations is based on what the process finds on the page and how it believes text direction should flow. But, there are times when pages have text flowing in conflicting directions (i.e. some block of text goes one way, and other block of text goes a different way – not to mention times when text is handwriting and not computer-generated). So, when the auto-rotation ends up being incorrect for understandable reasons, the user will proceed by manually rotating the page and then saving changes before adding annotations (via another third-party tool). This results in a new document version in Alfresco, which next triggers Simple OCR / pdfsandwich to run once again against the new version. What happens next is that automatic process reverses the user’s manual correction and ends up auto-rotating the page back to the incorrect orientation. The next time a user views the document, they see the rotation incorrect again plus annotation layer that is no longer corresponding to the proper coordinates of the page. At this point, manually rotating the page in the UI document viewer results in the annotation being rotated incorrectly and often in an illegible manner. The problem is recursive in nature and any annotations added (as they often will be) end up making the problem that much worse.

@angelborroy-ks
Copy link
Contributor

Adding -nopreproc option to ocr.extra.commands parameter could solve your issue.
Detailed information on pdfsandwich options is available at http://www.tobias-elze.de/pdfsandwich/

@ajab21
Copy link
Author

ajab21 commented Oct 17, 2018

@angelborroy-ks thanks for your quick reply back.

Unfortunately, -nopreproc is not disabling auto-rotation as expected. We've reached out to Tobias for more insight about pdfsandwich options, so hope we can resolve with his assistance.

If it turns out we need to explore other options, do you have any recommendations on other tools to use for searchable image layer? As background, we're using tesseract for base OCR at metadata document level, so the gap we're needing to fill is allowing users to search for words and jump to pages within a document based on matches on top of page image layer.

Many thanks in advance for your continued guidance!! We're about 1-2 weeks from first production launch of Alfresco, and this issue (plus another issue related to serious pixel loss after pdfsandwich runs) is causing showstopper concern for the project. So, we're scrambling for ideas on how to resolve.

@angelborroy-ks
Copy link
Contributor

Did you tried ocrmypdf? This software includes many different options to deal with Tesseract parameters.

@ajab21
Copy link
Author

ajab21 commented Oct 17, 2018

No, I haven't, but I was just looking into it in fact. Based on your experience, would you say OCRmyPDF is a more sophisticated tool and may be better suited for our needs given the problem at hand? We saw you mentioned pdfsandwich first in the list in your write-up, so maybe we assumed incorrectly that's the one you had more preference for.

@angelborroy-ks
Copy link
Contributor

Yes, I suggested OCRmyPDF for your use case because it's more customisable than pdfsandwich. We could say that pdfsandwich is good basic tool (enough for many users) but OCRmyPDF is an expert tool (what requires more tuning and expertise).

Let me know how it goes.

@ajab21
Copy link
Author

ajab21 commented Oct 17, 2018

Thanks! Will do.

@ajab21
Copy link
Author

ajab21 commented Oct 29, 2018

FYI, just an update that OCRmyPDF is working out much better with the addt'l options. Thanks again!

@ajab21 ajab21 closed this as completed Oct 29, 2018
@DEEPAK-KESWANI
Copy link

DEEPAK-KESWANI commented Nov 26, 2018

Hi,

Please see the attached image where it shows the output PDF is getting distorted on each ocrmypdf command.

distorted_from_v1 0_to_v1 4

FYI, we are using auto-rotate options (--rotate-pages --rotate-pages-threshold 1) only for 1st version and for the rest versions PDF, we are not using the auto-rotate option.

sudo ocrmypdf --verbose 1 --force-ocr -l eng --output-type pdf --rotate-pages --rotate-pages-threshold 1 v_1.0.pdf v_1.1.pdf

sudo ocrmypdf --verbose 1 --force-ocr -l eng --output-type pdf v_1.1.pdf v_1.2.pdf

sudo ocrmypdf --verbose 1 --force-ocr -l eng --output-type pdf v_1.2.pdf v_1.3.pdf

sudo ocrmypdf --verbose 1 --force-ocr -l eng --output-type pdf v_1.3.pdf v_1.4.pdf

NOTE:
OCRMyPDF version: 7.0.0

Could you please help me on this?

Also, if I add --oversample 600 option to command in each version, it works fine but output pdf size has increased.

sudo ocrmypdf --verbose 1 --force-ocr -l eng --output-type pdf --oversample 600 --rotate-pages --rotate-pages-threshold 1 v_2.0.pdf v_2.1.pdf

sudo ocrmypdf --verbose 1 --force-ocr -l eng --output-type pdf --oversample 600 v_2.1.pdf v_2.2.pdf

sudo ocrmypdf --verbose 1 --force-ocr -l eng --output-type pdf --oversample 600 v_2.2.pdf v_2.3.pdf

sudo ocrmypdf --verbose 1 --force-ocr -l eng --output-type pdf --oversample 600 v_2.3.pdf v_2.4.pdf
 

Thanks.

@ajab21 ajab21 reopened this Nov 26, 2018
@angelborroy-ks
Copy link
Contributor

I'm not OCR expert. Probably you'll get better answers at OCRmyPDF project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants