Skip to content

minidocks/ocrmypdf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

49 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OCRmyPDF with img2pdf docker image (minidocks/ocrmypdf)

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched.

Utilities

  • img2pdf is a tool to lossless conversion of raster images to PDF.
  • pdfminer is a tool for extracting information from PDF documents.
  • hocr tools

Usage

OCRmyPDF requires ghostscript, tesseract an unpaper. We must connect these containers via the ssh protocol. The easiest solution is to use docker compose.

So create a file compose.yaml with content:

x-base: &base
  volumes:
    - .:/app
    - ./tmp:/tmp
  working_dir: /app
  command: sshd

services:
  ocrmypdf:
    <<: *base
    image: minidocks/ocrmypdf
    links:
      - tesseract
      - unpaper
      - gs
    environment:
      ALIAS_TESSERACT: ssh tesseract tesseract
      ALIAS_UNPAPER: ssh unpaper unpaper
      ALIAS_GS: ssh gs gs

  gs:
    <<: *base
    image: minidocks/ghostscript

  tesseract:
    <<: *base
    image: minidocks/tesseract:4-eng
    environment:
      OMP_THREAD_LIMIT: 1

  unpaper:
    <<: *base
    image: minidocks/unpaper

And in the same directory run command:

docker compose run --rm ocrmypdf -j 2 -l eng --tesseract-pagesegmode 3 input.pdf output.pdf

Tags

Tag Size
latest

Related images

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published