Telugu unicode ambiguities #32

metaforte · 2018-09-13T03:23:29Z

Hi,
I created a test text data mostly (made up individual characters. see attachment) and converted it to tiff file using 'jTessBoxEditorFX' with font 'noto sans telugu 8pt'. I then ran it using the the testdata_best telugu language trained data.
I noticed a few errors in recognizing them. I believe this are due to ambiguous glyphs'.

Ambiguity 1: Telugu has three vowels that are similar to another consonant (There is another consonant that looks close enough)
vowel 1) ఒ (pronounced as 'o' in 'so')
vowel 2) ఓ (pronounced as 'oa' in 'goal' )
vowel 3) ఔ (pronounced as 'ou' in 'ounce' or 'pound')

similar looking consonant 1) బ (pronounced as 'bu' in 'bus')
consonant 2) భ (this is same as above but uttered with stress and aspiration. Imagine saying 'bus' as 'bhus')

Ambiguity 2: Consonant చ (pronounced as 'ch' as in 'church') is similar to another rarely used consonant ౘ (closest transliteration 'tsa')

Ambiguity 3: Consonant ర (pronounced as 'ru' as in 'run') is similar to another consonant ఠ ( hard 't' - close to the 't' in 'stone')

Ambiguity 4: Consonant జ (pronounced as 'ju' as in 'justice') is similar to another rarely used consonant ౙ (closest trasilteration 'za') and also similar to ఙ ('jna')

Ambiguity 5: consonant ఝ (pronounced as 'jha' - hard జ with aspiration ) was interpreted as 'రు' (pronounced as 'ru' in 'rupee' ) which is a combination of Consonant ర ('ru') + vowel ఉ(pronounced as 'u' in 'push')

Ambiguity 6: vowel ఇ ( pronounced as 'i' in 'ink') is close to consonant ఞ (pronounced as 'inya'). The 'inya' did not get recognized at all in my test data.

Ambiguity 7: కౄ ( 'cru' as in 'cruel') and గౄ ('grue' as in 'gruesome') were converted to క్య ('kya') and 'గ్యూ' (gyoo).

Ambiguity 8: ౠ ('rroo') became బూ ('boo')

I guess some of them could be due to my poor tiff. But I think some of the ambiguities are genuine and need to be handled.

Please help to address these ambiguity resolutions.

tesseract-telugu.txt

Shreeshrii · 2018-09-13T06:52:25Z

1. Please also test with tessdata_fast. 2. Check tel.lstm-unicharset in both tessdata_best and tessdata_fast to ensure that rarely used letters are included. 3. Take a look at the training source files in langdata_lstm repo under tel. 4. Verify that the indic/telugu validation rules are correct.

…

On Thu 13 Sep, 2018, 8:53 AM Manas Marthi, ***@***.***> wrote: Hi, I created a test text data mostly (made up individual characters. see attachment) and converted it to tiff file using 'jTessBoxEditorFX'. I then ran it using the the testdata_best telugu language trained data. I noticed a few errors in recognizing them. I believe this are due to ambiguous glyphs'. *Ambiguity 1*: Telugu has three vowels that are similar to another consonant (There is another consonant that looks close enough) vowel 1) ఒ (pronounced as 'o' in 'so') vowel 2) ఓ (pronounced as 'oa' in 'goal' ) vowel 3) ఔ (pronounced as 'ou' in 'ounce' or 'pound') similar looking consonant 1) బ (pronounced as 'bu' in 'bus') consonant 2) భ (this is same as above but uttered with stress and aspiration. Imagine saying 'bus' as 'bhus') *Ambiguity 2*: Consonant చ (pronounced as 'ch' as in 'church') is similar to another rarely used consonant ౘ (closest transliteration 'tsa') *Ambiguity 3*: Consonant ర (pronounced as 'ru' as in 'run') is similar to another consonant ఠ ( hard 't' - close to the 't' in 'stone') *Ambiguity 4*: Consonant జ (pronounced as 'ju' as in 'justice') is similar to another rarely used consonant ౙ (closest trasilteration 'za') and also similar to ఙ ('jna') *Ambiguity 5*: consonant ఝ (pronounced as 'jha' - hard జ with aspiration ) was interpreted as 'రు' (pronounced as 'ru' in 'rupee' ) which is a combination of Consonant ర ('ru') + vowel ఉ(pronounced as 'u' in 'push') *Ambiguity 6*: vowel ఇ ( pronounced as 'i' in 'ink') is close to consonant ఞ (pronounced as 'inya'). The 'inya' did not get recognized at all in my test data. *Ambiguity 7*: కౄ ( 'cru' as in 'cruel') and గౄ ('grue' as in 'gruesome') were converted to క్య ('kya') and 'గ్యూ' (gyoo). *Ambiguity 8*: ౠ ('rroo') became బూ ('boo') I guess some of them could be due to my poor tiff. But I think some of the ambiguities are genuine and need to be handled. Please help to address these ambiguity resolutions. tesseract-telugu.txt <https://github.com/tesseract-ocr/tessdata_best/files/2377575/tesseract-telugu.txt> — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#32>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE2_ow6-Hp5u_rar7PuPyzPF2xepLL3Nks5uac-xgaJpZM4Wmghi> .

Shreeshrii · 2018-09-13T06:53:25Z

Please test with real text not just syllables. On Thu 13 Sep, 2018, 12:22 PM Shree Devi Kumar, <shreeshrii@gmail.com> wrote:

…

1. Please also test with tessdata_fast. 2. Check tel.lstm-unicharset in both tessdata_best and tessdata_fast to ensure that rarely used letters are included. 3. Take a look at the training source files in langdata_lstm repo under tel. 4. Verify that the indic/telugu validation rules are correct. On Thu 13 Sep, 2018, 8:53 AM Manas Marthi, ***@***.***> wrote: > Hi, > I created a test text data mostly (made up individual characters. see > attachment) and converted it to tiff file using 'jTessBoxEditorFX'. I then > ran it using the the testdata_best telugu language trained data. > I noticed a few errors in recognizing them. I believe this are due to > ambiguous glyphs'. > > *Ambiguity 1*: Telugu has three vowels that are similar to another > consonant (There is another consonant that looks close enough) > vowel 1) ఒ (pronounced as 'o' in 'so') > vowel 2) ఓ (pronounced as 'oa' in 'goal' ) > vowel 3) ఔ (pronounced as 'ou' in 'ounce' or 'pound') > > similar looking consonant 1) బ (pronounced as 'bu' in 'bus') > consonant 2) భ (this is same as above but uttered with stress and > aspiration. Imagine saying 'bus' as 'bhus') > > *Ambiguity 2*: Consonant చ (pronounced as 'ch' as in 'church') is > similar to another rarely used consonant ౘ (closest transliteration 'tsa') > > *Ambiguity 3*: Consonant ర (pronounced as 'ru' as in 'run') is similar > to another consonant ఠ ( hard 't' - close to the 't' in 'stone') > > *Ambiguity 4*: Consonant జ (pronounced as 'ju' as in 'justice') is > similar to another rarely used consonant ౙ (closest trasilteration 'za') > and also similar to ఙ ('jna') > > *Ambiguity 5*: consonant ఝ (pronounced as 'jha' - hard జ with aspiration > ) was interpreted as 'రు' (pronounced as 'ru' in 'rupee' ) which is a > combination of Consonant ర ('ru') + vowel ఉ(pronounced as 'u' in 'push') > > *Ambiguity 6*: vowel ఇ ( pronounced as 'i' in 'ink') is close to > consonant ఞ (pronounced as 'inya'). The 'inya' did not get recognized at > all in my test data. > > *Ambiguity 7*: కౄ ( 'cru' as in 'cruel') and గౄ ('grue' as in > 'gruesome') were converted to క్య ('kya') and 'గ్యూ' (gyoo). > > *Ambiguity 8*: ౠ ('rroo') became బూ ('boo') > > I guess some of them could be due to my poor tiff. But I think some of > the ambiguities are genuine and need to be handled. > > Please help to address these ambiguity resolutions. > > tesseract-telugu.txt > <https://github.com/tesseract-ocr/tessdata_best/files/2377575/tesseract-telugu.txt> > > — > You are receiving this because you are subscribed to this thread. > Reply to this email directly, view it on GitHub > <#32>, or mute the > thread > <https://github.com/notifications/unsubscribe-auth/AE2_ow6-Hp5u_rar7PuPyzPF2xepLL3Nks5uac-xgaJpZM4Wmghi> > . >

metaforte · 2018-09-13T08:29:17Z

Thank you. I will try and update

metaforte · 2018-09-17T14:38:03Z

I created a word doc with valid text and converted it to pdf and then tiff using imagemagick and ran tesseract with training data fast. I was able to scan mostly okay. News paper clipping had some errors..But that's fine.

That said, the ambiguity stated in item 1,7 are still a problem.

metaforte · 2018-09-17T14:38:34Z

I will do more testing and update here

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Telugu unicode ambiguities #32

Telugu unicode ambiguities #32

metaforte commented Sep 13, 2018 •

edited

Shreeshrii commented Sep 13, 2018 via email

Shreeshrii commented Sep 13, 2018 via email

metaforte commented Sep 13, 2018

metaforte commented Sep 17, 2018 •

edited

metaforte commented Sep 17, 2018

Telugu unicode ambiguities #32

Telugu unicode ambiguities #32

Comments

metaforte commented Sep 13, 2018 • edited

Shreeshrii commented Sep 13, 2018 via email

Shreeshrii commented Sep 13, 2018 via email

metaforte commented Sep 13, 2018

metaforte commented Sep 17, 2018 • edited

metaforte commented Sep 17, 2018

metaforte commented Sep 13, 2018 •

edited

metaforte commented Sep 17, 2018 •

edited