Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it possible to add few pre-1918 Russian characters to RUS language files? #3

Open
alexei-kouprianov opened this issue Nov 10, 2018 · 4 comments

Comments

@alexei-kouprianov
Copy link

alexei-kouprianov commented Nov 10, 2018

In 1917--1918, the Russian language was reformed in many ways including but not limited to the banning of four letters: I-decimal (now known as "Byelorussian-Ukrainian I"), Yat, Fita, and Izhitsa. The necessity to OCR the texts published in Russia from 1708 through 1918 (and somewhat later) is widely recognised among scholars but they are largely unfamiliar with the ways tesseract can be trained to recognise these missing characters (and, I have to confess, the vast majority of ordinary people will be absolutely unable to train tesseract even if they read the instructions [ https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for--a-few-characters ]). See also: https://en.wikipedia.org/wiki/Russian_alphabet#Letters_eliminated_in_1918

Is there a possibility to include in the desired characters list for Russian ( langdata_lstm/rus/desired_characters ) the following glyphs:

§ : Section sign ; Unicode number: U+00A7

І : Cyrillic Capital Letter Byelorussian-Ukrainian I ; Unicode number: U+0406
і : Cyrillic Small Letter Byelorussian-Ukrainian I ; Unicode number: U+0456
Ѣ : Cyrillic Capital Letter Yat ; Unicode number: U+0462
ѣ : Cyrillic Small Letter Yat ; Unicode number: U+0463
Ѳ : Cyrillic Capital Letter Fita ; Unicode number: U+0472
ѳ : Cyrillic Small Letter Fita ; Unicode number: U+0473
Ѵ : Cyrillic Capital Letter Izhitsa ; Unicode number: U+0474
ѵ : Cyrillic Small Letter Izhitsa ; Unicode number: U+0475

What else should be provided to add these few characters? A list of words containing these letters? How long should that list be? I am working currently on a project which processes lots of geographic names in pre-1918 Russian (and some other texts), so I can provide at least a list of words of considerable length. For now, I have to resort to OCR the pre-1918 text as a post-1918 and insert the missing four characters manually (mostly, two of them, as Fita and, especially, Izhitsa were rather less frequent).

Or this would rather require a much larger effort like creating a special rus-old model?

@stweil
Copy link
Contributor

stweil commented Nov 11, 2018

Yes, this is possible. I think the resulting model should not replace rus, but be a new rus_old, because otherwise Tesseract might "recognize" the old characters in modern texts, too.

I assume that the missing section sign will be needed for rus and for rus_old. The Tesseract wiki explains how the fixed or new models can be created based on the existing model.

@amitdo
Copy link

amitdo commented Nov 11, 2018

Your first step should be finding/making ground truth text from images of pre-1918 Russian books and/or newspapers.

@stweil
Copy link
Contributor

stweil commented Nov 11, 2018

The Byelorussian-Ukrainian I (upper and lower case) is included in scripts/Cyrillic.traineddata: I see it in the unicharset file.

@alexei-kouprianov
Copy link
Author

@stweil and @amitdo, thank you for the comments. As I figured out, a new rus_old model is a better solution. I shall try to prepare a set of words in pre-1918 Russian for training and come back to the issue after that. I am not sure I will be able to decipher the training instructions on my own but they are anyway of no use without a good deal of text to be used on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants