Skip to content

Hrazhan/kurdish-ocr

Repository files navigation

Kurdish OCR

Banner Image

🌐 Live Demo | 📦 Base Model | ✍️ Handwritten Model | 🗃️ Data

This project is an implementation of an Optical Character Recognition (OCR) system for the Central Kurdish language. But it can be easily extended to other languages since its data is generated synthetically. The architecture is a Vision Encoder-Decoder where the encoder can be any transformer vision model and the decoder can be any pretrained language model. The cross attention layers of the final model will be added to the decoder. The model supports single line by default to extend it to multi-line you can either train your own text dection or use something like CRAFT (which in my experience is not so good with perso-arabic scripts)

Usage

You can install the requirements using pip:

pip install -r requirements.txt

You will need a single line text corpus and fonts for your language of choice

  • gen_vocab.py Generate the vocab from the OSCAR corpus and wikipedia. Modify --chars to the characters you want to keep in the final corpus
  • gen_ocr_data.py Generates the final dataset with various filters and distortions. Modify the number of lines in this script according to your corpus and have the fonts in data/fonts directory.
  • 'init_model.py` Initialize the model
  • accelerate_train.py or `train.py is used to train the model.
  • inference.py runs the model through command line.
  • app.py Is a UI where can run the model with CRAFT for multi-line text recognition.

Note: If you wanna train on handwritten Kurdish data, download the the dataset from here and delete the .DS_Store file. Pass --handwritten_dataset to train.py a class for that dataset is implement in dataset.py.

License

This project is open-source and available under the GNU General Public License v3.0