Skip to content

kent-state-university-libraries/PDFCreate

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PDF Create

Omeka plugin that creates OCR'd PDFs from TIFFs. If you have multiple TIFFs for a single item, this provides any easy way to aggregate the TIFFs into a single file for easy viewing/downloading.

Generates OCR via Tesseract.

Stores OCR'd text via PdfText plugin's metadata element for site searching.

Aggregates multiple TIFFs for one item into single OCR'd PDF/a-1b PDF via Ghostscript. When the aggregated PDF is created, it can be found at http://example.com/path/to/your/files/directory/pdfs/ITEM_ID.pdf

Install

This plugin requires the PdfText plugin

The server-side software needed to peform the OCR extraction is Ghostscript and Tesseract. This is the exact versions of the required software verified to work with this plugin (running on Red Hat Enterprise Linux 7):

  • GPL Ghostscript 9.07 (2013-02-14)
  • Tesseract 3.04.01
    • leptonica 1.73
      • libjpeg 6b (libjpeg-turbo 1.2.90)
        • libpng 1.5.13
        • libtiff 4.0.3
        • zlib 1.2.7
  • Download the tessdata 3.04.00 tarball
    • mv all eng.* files to /usr/local/share/tessdata/
  • Download the file "pdf.ttf" found here to /usr/local/share/tessdata/
    • Without this updated pdf.ttf when two or more PDFs are aggregated into a single PDF via Ghostscript the resulting OCR will have spaces between every letter, essentially ruining the OCR. Essentially the tesseract and ghostscript fonts don't map perfectly, but this file fixes that.