Add parallel processing to OCR text extraction of full documents #124

ntodd · 2014-12-18T22:51:55Z

Leverage the GNU Parallel tool to OCR multiple pages in parallel. If Parallel is installed, a full document extraction will generate an image for each page and then spawn a tesseract process for each available core. If Parallel is not installed or a subset of pages are indicated, the old behavior will be used. This speeds up OCR processing significantly on multi-core machines.

With a bit more work, this could be leveraged by the other OCR code paths.

Use GNU Parallel if installed to parallelize tesseract OCR on full document text extraction. If Parallel is not installed, use previous behavior.

deuxshaish · 2014-12-21T11:08:12Z

I like this a lot.. Will test and observe, thanks for the commit

pickhardt · 2023-05-06T01:52:27Z

This is a great idea.

Nate Todd added 2 commits December 18, 2014 17:20

Add parallel processing to OCR text extraction

1f1ec93

Use GNU Parallel if installed to parallelize tesseract OCR on full document text extraction. If Parallel is not installed, use previous behavior.

Add Parallel installation to documentation

7427d08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add parallel processing to OCR text extraction of full documents #124

Add parallel processing to OCR text extraction of full documents #124

ntodd commented Dec 18, 2014

deuxshaish commented Dec 21, 2014

pickhardt commented May 6, 2023

Add parallel processing to OCR text extraction of full documents #124

Are you sure you want to change the base?

Add parallel processing to OCR text extraction of full documents #124

Conversation

ntodd commented Dec 18, 2014

deuxshaish commented Dec 21, 2014

pickhardt commented May 6, 2023