OCR_POST_DE

OCR post correction for old German corpus. More details can be found in our paper(https://arxiv.org/abs/2102.00583).

Libraries: python 3.7 keras 2.4.3, tensorflow 2.3.1, pytorch 1.4.0

Other packages: NLTK, numpy, gensim, datasketch, Bio.pairwise2, entmax

create_data:

Download OCRed book from ÖNB(https://iiif.onb.ac.at/gui/manifest.html) by the unique barcode, see dataScrapy.py.
Clean the downloaded raw text, see parseText.py
Providing a downloaded OCRed book and the corresponding transcription from DTA, see sentenceAlignment.py already generated sentence pairs (ocr_seq, trans_seq) and the original books are under PKL/

CRF (conditional random field):

There are many cases that the OCR quality is acceptible(e.g, books from 18_th and later centuries), most of the errors are from segmentation instead of character misrecognition. We provide a tagger trained from the German wikipedia corpus, use CRF to correct segmentation errors only. See word_segment.py for details. Due to the space limitation, you can utilize the source code and your own data to train the tagger, alternately you can also download the trained tagger(dewiki_segmentation.crfsuite) from https://drive.google.com/file/d/1h7mwsXERKrymGnVNfYuDcOGf25DbohEp/view?usp=sharing

keras_implement:

The original implementation is based on keras, see networks.py for all models definition. The ocr_corrector.py contains all functions to train, evaluate, generate output for a single sentence or in batch level.

torch_implement:

A torch implementation is also on going, for now we provide a standard attention based encoder-decoder model. The only differences are: 1. a random teacher-forcing training. 2. entmax (instead of softmax). These two changes boosted the performance further, which also maintains the simplicity. See Model.py for model definition, see seq2seq.py for training, evaluation, generating and other utilities. A trained model with all instances and the data pairs can be found from https://drive.google.com/drive/folders/1qBI-2IhYPBGtMcGVWGb19lJCu0jV7QId?usp=sharing

If more questions related to the code or data, please contact us. Please cite our paper if you find it useful.

@article{lyu2021neural,
  title={Neural OCR post-hoc correction of historical corpora},
  author={Lyu, Lijun and Koutraki, Maria and Krickl, Martin and Fetahu, Besnik},
  journal={Transactions of the Association for Computational Linguistics},
  volume={9},
  pages={479--493},
  year={2021},
  publisher={MIT Press}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CRF

CRF

PKL

PKL

create_data

create_data

keras_implement

keras_implement

torch_implement

torch_implement

README.md

README.md

Repository files navigation

OCR_POST_DE

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
CRF		CRF
PKL		PKL
create_data		create_data
keras_implement		keras_implement
torch_implement		torch_implement
README.md		README.md

GarfieldLyu/OCR_POST_DE

Folders and files

Latest commit

History

Repository files navigation

OCR_POST_DE

About

Resources

Stars

Watchers

Forks

Languages