Skip to content
@Early-Modern-OCR

Early Modern OCR Project

Open source tools and training for OCR'ing 15th-18th Century printed documents with Tesseract.

Popular repositories

  1. TesseractTraining TesseractTraining Public

    Training files produced for and by the Tesseract OCR engine for work on the Early Modern OCR Project (eMOP)

    35 7

  2. FrankenPlus FrankenPlus Public

    Part of eMOP: Franken+ tool for creating font training for Tesseract OCR engine from page images.

    C# 24 7

  3. hOCR-De-Noising hOCR-De-Noising Public

    code to remove "noise" from hOCR output of Tesseract OCR.

    Python 14 6

  4. RETAS RETAS Public

    Part of eMOP: the Recursive Text Alignment Tool compares OCR text results to groundtruth by character and computes a score.

    Java 11 4

  5. page-evaluator page-evaluator Public

    Java code to examine the output of Tesseract OCR and generate scores for general page quality and correctabiliby (see page-corrector repo).

    Java 8 1

  6. page-corrector page-corrector Public

    Scala code to correct Tesseract OCR output and generate ALTO XML and text files. Uses dictionary files, rules and a google-3gram DB to make corrections.

    Scala 7

Repositories

Showing 10 of 19 repositories

People

This organization has no public members. You must be a member to see who’s a part of this organization.

Top languages

Loading…

Most used topics

Loading…