Sinhala Script Optical Character Recognition

_{June 2020}

About

The aim of this project was to experiment creating a basic optical character recognition (OCR) system that can take images of printed Sinhalese characters, Sinhala being the national language of Sri Lanka, and to convert them to machine readable text using a KNN classifier.

Training and Testing Data

To generate training and testing data, I created PDFs of Sinhalese characters with different font weights and matching text files to store the ground truth. After reading the PDFs into the notebook, they are converted to images.

For each image, padding is added to increase the margin of the page and a Gaussian blur applied to merge parts of characters that are close together. The image is then converted to grayscale and adaptive thresholding applied using Otsu's method to separate the background from the foreground (see example with training data below).

Next the image is converted to binary and regions of interest (ROIs) are identified. ROIs are first sorted by their y coordinates to find characters on the same line, and then by x to determine the order on the page.

Finally, each ROI resized to 20 x 20 pixels. This is because the classifier requires all input images to be the same size.

K-Nearest Neighbors (KNN)

KNN is a type of classifier that uses majority voting to assigns a class to each data point based on its k closest points. After the extracted ROIs are matched to their labels, the KNN classifier is fit to the training data. Then the KNN is used to make a prediction on the test data and the predicted and expected text compared.

Prediction: ථළභඦඹශඝඖරඨෆඦඤදධඣඏඋඏධටඝෆඵයවෂෂඉඅටකචඤ
Actual:     ථළභඦඹශඝඖරඨෆඦඤදධඣඏඋඏධටඝෆඵයවෂෂඉආකචඤඏ

With OCR, image-label shifts are a common occurrence. For example, at the end අ ට ක ච ඤ was predicted when the expected text was ආ ක ච ඤ ඏ. In this case the label at index 1 would not match image 1, label 2 would match image 2 etc. This means a measure such as accuracy is not suitable to measure performance.

Instead the Levenshtein distance and Jaro distance are calculated. The Levenshtein distance counts the number of edits (insertions, deletions and substitutions) required to convert the prediction to the actual text, while the Jaro distance is a string-edit distance between [0,1], where 0 represents dissimilar strings and 1 represents identical strings.

Levenshtein distance: 3
Jaro distance: 0.961

To evaluate the quality of the classifier's output, a confusion matrix is plotted. This reveals how good the classifier is at predicting each character (true positives are plotted on the diagonal line) and for those it get incorrect, which ones it predicts instead.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
Images		Images
Testing		Testing
Training		Training
README.md		README.md
SinhalaOCR.ipynb		SinhalaOCR.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Images

Images

Testing

Testing

Training

Training

README.md

README.md

SinhalaOCR.ipynb

SinhalaOCR.ipynb

Repository files navigation

Sinhala Script Optical Character Recognition

About

Training and Testing Data

K-Nearest Neighbors (KNN)

About

Releases

Packages

Languages

TomMakesThings/Sinhala-Optical-Character-Recognition

Folders and files

Latest commit

History

Repository files navigation

Sinhala Script Optical Character Recognition

About

Training and Testing Data

K-Nearest Neighbors (KNN)

About

Topics

Resources

Stars

Watchers

Forks

Languages