Early Modern OCR Project

TesseractTraining Public

Training files produced for and by the Tesseract OCR engine for work on the Early Modern OCR Project (eMOP)

35 7

FrankenPlus Public

Part of eMOP: Franken+ tool for creating font training for Tesseract OCR engine from page images.

C# 24 7

hOCR-De-Noising Public

code to remove "noise" from hOCR output of Tesseract OCR.

Python 14 6

RETAS Public

Part of eMOP: the Recursive Text Alignment Tool compares OCR text results to groundtruth by character and computes a score.

Java 11 4

page-evaluator Public

Java code to examine the output of Tesseract OCR and generate scores for general page quality and correctabiliby (see page-corrector repo).

Java 8 1

page-corrector Public

Scala code to correct Tesseract OCR output and generate ALTO XML and text files. Uses dictionary files, rules and a google-3gram DB to make corrections.

Scala 7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Early Modern OCR Project

Popular repositories

Repositories

People

Top languages

Most used topics