Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't OCR a pdf file #73

Open
bit-man opened this issue May 1, 2024 · 2 comments
Open

Can't OCR a pdf file #73

bit-man opened this issue May 1, 2024 · 2 comments

Comments

@bit-man
Copy link
Contributor

bit-man commented May 1, 2024

Uploading a PDF file and trying to OCR (method: simple, format : txt) by pressing button Convert into Document opens a new tab with the error Not Found and no file is downloaded

image

At docker console the error show is

172.17.0.1 - - [01/May/2024:20:26:37 +0000] "GET /HRProprietary/HRConvert2/DATA/856ca1146d63/7f10275ffce6/m1m2.txt HTTP/1.1" 404 489 "http://localhost:8080/HRProprietary/HRConvert2/convertCore.php?showFiles=1&gui=Default&language=en&color=blue" "Mozilla/5.0 (X11; Linux x86_64; rv:125.0) Gecko/20100101 Firefox/125.0"

Doing tail of txt log at Logs folder shows

Op-Act, May 1, 2024, 8:36 pm, 856ca1146d63/b0806464b510: Initiating Converter.
Op-Act, May 1, 2024, 8:36 pm, 856ca1146d63/b0806464b510: User selected to perform OCR on file m1m2.pdf.
Op-Act, May 1, 2024, 8:36 pm, 856ca1146d63/b0806464b510: Copying file m1m2.pdf to /var/www/html/HRProprietary/HRConvert2/DATA/856ca1146d63/b0806464b510/m1m2.pdf.
Op-Act, May 1, 2024, 8:36 pm, 856ca1146d63/b0806464b510: Copied file m1m2.pdf.
Op-Act, May 1, 2024, 8:36 pm, 856ca1146d63/b0806464b510: Verified file /DATA/HRConvert2/856ca1146d63/b0806464b510/m1m2.txt.
ERROR!!! May 1, 2024, 8:36 pm, HRConvert2-22, 856ca1146d63/b0806464b510: OCR Operation Failed!
@bit-man
Copy link
Contributor Author

bit-man commented May 1, 2024

Tryed to follow code at convertCore.php and seems the failing code is at if (!in_array(strtolower($oldExtension), $pdf1array)) . This evaluation results in false and thus no attempt to convert is made which makes no sense to me because its supposed to be the Code to convert a PDF to a document, as stated by the previous line comment

Stripped of the negation and an file si downloaded but is empty 😢 . Still not working
The log output follows :

Op-Act, May 1, 2024, 8:43 pm, 856ca1146d63/1029442e5485: Initiating Converter.
Op-Act, May 1, 2024, 8:43 pm, 856ca1146d63/1029442e5485: User selected to perform OCR on file m1m2.pdf.
Op-Act, May 1, 2024, 8:43 pm, 856ca1146d63/1029442e5485: Copying file m1m2.pdf to /var/www/html/HRProprietary/HRConvert2/DATA/856ca1146d63/1029442e5485/m1m2.pdf.
Op-Act, May 1, 2024, 8:43 pm, 856ca1146d63/1029442e5485: Copied file m1m2.pdf.
Op-Act, May 1, 2024, 8:43 pm, 856ca1146d63/1029442e5485: Verified file /DATA/HRConvert2/856ca1146d63/1029442e5485/m1m2.txt.
Op-Act, May 1, 2024, 8:43 pm, 856ca1146d63/1029442e5485: Performing OCR intermediate operation using method 0.
Op-Act, May 1, 2024, 8:43 pm, 856ca1146d63/1029442e5485: Converted file /var/www/html/HRProprietary/HRConvert2/DATA/856ca1146d63/1029442e5485/m1m2.jpg to /var/www/html/HRProprietary/HRConvert2/DATA/856ca1146d63/1029442e5485/m1m2.txt.
Op-Act, May 1, 2024, 8:43 pm, 856ca1146d63/1029442e5485: Performing OCR final using method 0.
Op-Act, May 1, 2024, 8:43 pm, 856ca1146d63/1029442e5485: Renamed file /var/www/html/HRProprietary/HRConvert2/DATA/856ca1146d63/1029442e5485/m1m2.pdf to /var/www/html/HRProprietary/HRConvert2/DATA/856ca1146d63/1029442e5485/m1m2.txt.
Op-Act, May 1, 2024, 8:43 pm, 856ca1146d63/1029442e5485: Created a file at /DATA/HRConvert2/856ca1146d63/1029442e5485/m1m2.txt.

No time today to do a followup. Will try the weekend or else. Happy if anyone else can continue from here
Added this change to https://github.com/bit-man/HRConvert2 in case anyone wants to try a fix

@zelon88
Copy link
Owner

zelon88 commented May 22, 2024

Sorry for the delayed response. Can you try the following.....

sudo leafpad /etc/ImageMagick-6/policy.xml

Find and edit the following line.....

<policy domain="coder" rights="none" pattern="PDF" />

.....To.....

<policy domain="coder" rights="read|write" pattern="PDF" />

And let me know the result.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants