PyPDFOCR on Docker
get rid of your paperwork...
PyPDFOCR converts a scanned PDF into an OCR'ed PDF using Tesseract-OCR and Ghostscript
This Docker image is based on the official Ubuntu base image.
It incorporates a patch for issue #41 of pypdfocr 0.9.0 likely to be fixed in 0.9.1
docker run --rm mmatiaschek/pypdfocr [-h] [-d] [-v] [-m] [-l LANG] [--preprocess]
[--skip-preprocess] [-w WATCH_DIR] [-f] [-c CONFIGFILE] [-e]
[-n]
[pdf_filename]
docker run -v ~/:/media --rm pypdfocr /media/filename.pdf
--> reads filename.pdf from your Home directory, filename_ocr.pdf will be generated
docker run -v ~/Documents/Paper:/media --rm mmatiaschek/pypdfocr -w /media -f -c /media/config.yaml
For sample config see config.yaml or pypdfocr authors repository here.
docker run --rm mmatiaschek/pypdfocr [-h] [-d] [-v] [-m] [-l LANG] [--preprocess]
[--skip-preprocess] [-w WATCH_DIR] [-f] [-c CONFIGFILE] [-e]
[-n]
[pdf_filename]
Interactive Shell
docker run --entrypoint=/bin/bash -t -i mmatiaschek/pypdfocr
- I use Scanner Pro on iOS (scanbot on Android) to scan and upload documents to a WebDAV folder without OCR
- The WebDAV folder is hosted on my Synology DiskStation NAS via HTTPS and shared between devices with CloudStation
- I run this PyPDFOCR on Docker manually on Mac OS X or hosted on a local server
This way my personal documents don't have to leave my hardware or network aka personal cloud.