PDF OCR Tool

PDF OCR Tool is a Python-based tool for adding OCR to PDFs. It builds up on the great work done at OCRmyPDF and pdftopng and acts as a wrapper for the same on Windows. It converts the incoming PDF into images and back to PDF to strip any old data/text present and then adds the OCR layer on top of it.

Installation

Download the appropriate ZIP file from releases and unzip it.

The downloaded ZIP file consists of two batch scripts

setup.bat
install-lib.bat

Run both the scripts in the same order with administrative privileges. This installs the prerequisites (Chocolatey as package manager, Python 3.8, Tesseract OCR engine and Ghostscript. And the python packages ocrmypdf, pdftopng and PyPDF2)

OR

Clone the repository

git clone git@github.com:Shubham-272/PDFOCRtool.git

Ensure you have Python 3 installed (tested on Python 3.8)
Install dependencies and libraries (preferably via chocolatey)

choco install --pre tesseract -y
choco install ghostscript -y
pip install ocrmypdf pdftopng PyPDF2

The src folder has the python scripts for individual operations as well as for the overall conversion.

Usage

The PDF OCR Tool.exe runs the application. It consists of two options - the input PDF file and the folder where the output PDF should be saved. The output file will have the name paper.pdf

If cloned via git, the script pdfOCRtool.py is responsible for overall operation. You can run it via a terminal as

python pdfOCRtool.py

Conversion takes time and the progress is shown in the console that opens alongside. Once the conversion is complete, the UI shows a message in green saying "File converted and saved successfully!"

In case of errors, the message is printed in the UI with a red background. Contact the maintainers with the error message for resolution.

Additional Language Support

PDF OCR Tool is based on Tesseract OCR engine. Tesseract supports a wide range of languages (you can check the list here)

PDF OCR Tool installs only English language by default. For adding support for languages other than English, download the respective language pack (.traineddata file) from here and place it in C:\Program Files\Tesseract-OCR\tessdata (or wherever Tesseract OCR is installed).

To perform OCR on a PDF with a language other than English, specify the language(s) to be used for OCR during run time as a comma separated list.

Changelog

v1.0.0 - Initial release, support English OCR

v1.1.0 - Added language support via arguments

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

License

GNU General Public License v3.0

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
releases		releases
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
install-lib.bat		install-lib.bat
setup.bat		setup.bat

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

releases

releases

src

src

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

install-lib.bat

install-lib.bat

setup.bat

setup.bat

Repository files navigation

PDF OCR Tool

Installation

Usage

Additional Language Support

Changelog

Contributing

License

About

Releases 2

Languages

License

TheComputeGuy/PDFOCRtool

Folders and files

Latest commit

History

Repository files navigation

PDF OCR Tool

Installation

Usage

Additional Language Support

Changelog

Contributing

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages