Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to restrict OCR (PDFSandwich) for Searchable Documents (PDF)? #54

Open
DEEPAK-KESWANI opened this issue Oct 15, 2018 · 2 comments
Open

Comments

@DEEPAK-KESWANI
Copy link

DEEPAK-KESWANI commented Oct 15, 2018

BUG: OCR (PDFSandwich) is getting executed for Searchable Documents (PDF) as well.

Expected behavior: OCR should not process documents already containing text or searchable file.

Actual behavior: OCR is getting executed for Searchable Documents as well.

Steps to reproduce the behavior: Uploaded text containing PDF files which is also being processed for OCR.

Please help me on this.

Tell us about your environment: Linux

@DEEPAK-KESWANI DEEPAK-KESWANI changed the title How to restrict OCR (PDFSandwich) for Searchable Documents? How to restrict OCR (PDFSandwich) for Searchable Documents (PDF)? Oct 15, 2018
@angelborroy-ks
Copy link
Contributor

There is no way to be sure that a PDF document is scanned or searchable. For PDF format both are documents and both have text inside.

If you can provide any algorithm, technique or whatever to identify a scanned PDF document, we'll include this feature in the addon.

@Manucciu
Copy link

Hello,
there is one simple javascript for know if pdf containt already ocr or not 👍

var transformedPdfFolder = space.createFolder("temp_txt_folder");
var transformedPdfFile = document.transformDocument("text/plain", transformedPdfFolder)

if (transformedPdfFile.content.match(/./)) (don t do extract OCR) else do it.

I would like to do this on folder, actually, i do the javascript if not ocr move on folder then do the ocr and the doc back on the first folder.

It s not perfect.
If you have a better solution.

Cheer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants