Skip to content

free-variation/ocr-arabic-script

Repository files navigation

ocr-arabic-script

Experiments in OCR for historical texts written in Arabic script.

Prerequisites

  • GNU Make
  • GNU gawk
  • mmv utility
  • xmllint, part of the libxml2-utils package.
  • A working Python3 environment
  • pip, updated to latest.
  • GNU parallel, for running kraken operations in parallel, which may be somewhat faster than kraken batched operations on multicore machines with lower core counts and no GPU.

Installation

make deps

Configuration

The system is configured via environment variables set in a local, non-versisoned file ./config

PyTorch device

To point to a GPU, set for example

DEVICE=cuda:0

The default device is cpu.

Number of threads for OCR step

This parameter is passed to kraken's ocr command. For a 4-core system,

NUM_THREADS=4

The default is `1'.

Test Runs

Binarization

make binarize-all

This will binarize all the images in data/fas, yielding image files ending in -bin.png

Optionally, use the parallelized version of this target:

make binarize-all-par

Segmentation

make segment-all

This will segment all the binaried images in data/fas, yielding ALTO XML files ending in -seg.xml

Optionally, use the parallelized version of this target:

make segment-all-par

Because the parallelized version runs multiple processes, the overhead of the initial load of the neural model is multiplied by the number of cores avialable on the machine (the parallel default). Experiment to determine whether parallelization is beneficial on your hardware. On a Macbook Pro (2019) the speedup is considerable.

Recognition

make ocr-all

This target will run kraken's OCR over the segmented images, again yielding ALTO XML files, this time containing <CONTENT> elements. The filenames of the output end in -rec.xml.

Optionally, use the parallelized version of this target:

make ocr-all-par

Same caveats apply.

Evaluation

make extract-gold-all
make create-eval-dirs
make eval-all

These final steps will construct the evaluation datasets and run programs in ./bin that yield a character accuracy report in report.txt

Everything

To run the entire sequence, including installation of dependencies:

make go

And wait.

About

Experiments in OCR for historical arabic texts.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published