Skip to content

formatting and integrating the Deutches Textarchiv dictionary into various applications

Notifications You must be signed in to change notification settings

bertsky/dta-lexdb-applications

Repository files navigation

dta-lexdb-applications

CD

formatting and integrating the Deutches Textarchiv dictionary into various applications

Deutsches Textarchiv (DTA) is a large collection of curated and manually corrected reference corpora in New High German from the 17th to 20th century.

LexDB are a collection of lexical databases (i.e. dictionaries) distilled from DTA by the BBAW. They include the full-form, lemmatization, normalized orthography and part-of-speech.

This repository provides scripts to extract and re-format dictionaries for re-use in other applications. The results will be available as Github release assets.

Tesseract OCR models with added language model

Tesseract models (both the originally provided ones, trained on synthetic data, and the community generated ones, finetuned on annotated scan data or trained from scratch) can be amended with a simple language model by providing dictionaries/grammars for punctuation, numbers and words.

We will pick publicly available models for German Antiqua and Fraktur prints, as well as handwriting, and republish them with DTA as language model.

For currently selected models, see

TESS_MODELS := frak2021 GT4HistOCR ONB Fraktur_5000000 german_print frk Fraktur
GT4HistOCR.traineddata:
wget -O $@ https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/GT4HistOCR/tessdata_best/GT4HistOCR.traineddata
frak2021.traineddata:
wget -O $@ https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/frak2021/tessdata_best/frak2021-0.905.traineddata
Fraktur_5000000.traineddata:
wget -O $@ https://ub-backup.bib.uni-mannheim.de/~stweil/ocrd-train/data/Fraktur_5000000/tessdata_fast/Fraktur_50000000.334_450937.traineddata
ONB.traineddata:
wget -O $@ https://ub-backup.bib.uni-mannheim.de/~stweil/ocrd-train/data/ONB/tessdata_best/ONB_1.195_300718_989100.traineddata
german_print.traineddata:
wget -O $@ https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/german_print/german_print.traineddata
frk.traineddata:
wget -O $@ https://github.com/tesseract-ocr/tessdata_fast/raw/main/frk.traineddata
Fraktur.traineddata:
wget -O $@ https://github.com/tesseract-ocr/tessdata_fast/raw/main/script/Fraktur.traineddata

Hunspell

Hunspell is a widely used dictionary based, morphology aware spell checker.

We will produce a DTA dictionary for it.

For currently selected rules, see

de-dta.dic: dta_lexdb_10.words
wc -l < $< > $@
# to do: combine DTA lemmatization and contemporary affixation to a historic affixation system (instead of fixed word list)
grep -v -e '^[[:punct:]]' -e '^[[:digit:][:punct:]]*$$' $< | sort -u >> $@

...

Others to come. Please raise an issue if you have ideas!

About

formatting and integrating the Deutches Textarchiv dictionary into various applications

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published