Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stuck on find /to-ocr -name '*.pdf' -type f #23

Open
g-simmons opened this issue Jul 11, 2023 · 1 comment
Open

Stuck on find /to-ocr -name '*.pdf' -type f #23

g-simmons opened this issue Jul 11, 2023 · 1 comment

Comments

@g-simmons
Copy link

Hey there, I'm running pd3f on a workstation and accessing via SSH tunnel to my local machine.

I'm using the browser GUI to (hopefully 馃) OCR a book scan with several hundred pages. The book scan is already in PDF format - the file size is around 30MB.

In the log output of the web GUI I am seeing:

INFO:root:setting up ocr
INFO:root:ocr finished successfully
INFO:pd3f.parsr_wrapper:sending PDF to Parsr
INFO:pd3f.parsr_wrapper:got response from Parsr
INFO:pd3f.doc_info:media line width: 174.0
INFO:pd3f.doc_info:median line height: 9.0
INFO:pd3f.doc_info:median line space: 4.159999999999968
INFO:pd3f.doc_info:counter width: [(409.44, 1036), (8, 1036), (409.68, 1014), (410.16, 982), (409.2, 974)]
INFO:pd3f.doc_info:counter height: [(10, 19582), (9, 11277), (8, 10001), (7, 1238), (9.24, 1180)]
INFO:pd3f.doc_info:counter lineheight: [(4.159999999999968, 3830), (4.160000000000025, 2457), (4.159999999999997, 2251), (4.399999999999977, 2118), (2.759999999999991, 1806)]
INFO:pd3f.export:export page #0

It's been at least 20 minute since I started the conversion, so I'm surprised to see the tool is still on page #0.

In the terminal I'm seeing the following:

ocr_worker_1  | + sleep 1
web_1         | 172.18.0.1 - - [11/Jul/2023 22:46:00] "GET /update/B3QUgcWaGCVeuaQTqy7u5Z HTTP/1.1" 200 -
ocr_worker_1  | ++ find /to-ocr -name '*.pdf' -type f
ocr_worker_1  | + sleep 1
web_1         | 172.18.0.1 - - [11/Jul/2023 22:46:01] "GET /update/B3QUgcWaGCVeuaQTqy7u5Z HTTP/1.1" 200 -
ocr_worker_1  | ++ find /to-ocr -name '*.pdf' -type f
ocr_worker_1  | + sleep 1
web_1         | 172.18.0.1 - - [11/Jul/2023 22:46:02] "GET /update/B3QUgcWaGCVeuaQTqy7u5Z HTTP/1.1" 200 -
ocr_worker_1  | ++ find /to-ocr -name '*.pdf' -type f
ocr_worker_1  | + sleep 1
web_1         | 172.18.0.1 - - [11/Jul/2023 22:46:03] "GET /update/B3QUgcWaGCVeuaQTqy7u5Z HTTP/1.1" 200 -
ocr_worker_1  | ++ find /to-ocr -name '*.pdf' -type f
ocr_worker_1  | + sleep 1
web_1         | 172.18.0.1 - - [11/Jul/2023 22:46:04] "GET /update/B3QUgcWaGCVeuaQTqy7u5Z HTTP/1.1" 200 -
ocr_worker_1  | ++ find /to-ocr -name '*.pdf' -type f
ocr_worker_1  | + sleep 1
web_1         | 172.18.0.1 - - [11/Jul/2023 22:46:05] "GET /update/B3QUgcWaGCVeuaQTqy7u5Z HTTP/1.1" 200 -
ocr_worker_1  | ++ find /to-ocr -name '*.pdf' -type f
ocr_worker_1  | + sleep 1
web_1         | 172.18.0.1 - - [11/Jul/2023 22:46:06] "GET /update/B3QUgcWaGCVeuaQTqy7u5Z HTTP/1.1" 200 -
ocr_worker_1  | ++ find /to-ocr -name '*.pdf' -type f
ocr_worker_1  | + sleep 1
web_1         | 172.18.0.1 - - [11/Jul/2023 22:46:07] "GET /update/B3QUgcWaGCVeuaQTqy7u5Z HTTP/1.1" 200 -
ocr_worker_1  | ++ find /to-ocr -name '*.pdf' -type f
ocr_worker_1  | + sleep 1
web_1         | 172.18.0.1 - - [11/Jul/2023 22:46:08] "GET /update/B3QUgcWaGCVeuaQTqy7u5Z HTTP/1.1" 200 -
ocr_worker_1  | ++ find /to-ocr -name '*.pdf' -type f
ocr_worker_1  | + sleep 1

I'm new to pd3f but it looks like the ocr worker is stuck in a loop waiting to receive a file?

Any suggestions for troubleshooting are much appreciated.

@rahulkrprajapati
Copy link

I'm facing the same issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants