Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Telugu unicode ambiguities #32

Open
metaforte opened this issue Sep 13, 2018 · 5 comments
Open

Telugu unicode ambiguities #32

metaforte opened this issue Sep 13, 2018 · 5 comments

Comments

@metaforte
Copy link

metaforte commented Sep 13, 2018

Hi,
I created a test text data mostly (made up individual characters. see attachment) and converted it to tiff file using 'jTessBoxEditorFX' with font 'noto sans telugu 8pt'. I then ran it using the the testdata_best telugu language trained data.
I noticed a few errors in recognizing them. I believe this are due to ambiguous glyphs'.

Ambiguity 1: Telugu has three vowels that are similar to another consonant (There is another consonant that looks close enough)
vowel 1) ఒ (pronounced as 'o' in 'so')
vowel 2) ఓ (pronounced as 'oa' in 'goal' )
vowel 3) ఔ (pronounced as 'ou' in 'ounce' or 'pound')

similar looking consonant 1) బ (pronounced as 'bu' in 'bus')
consonant 2) భ (this is same as above but uttered with stress and aspiration. Imagine saying 'bus' as 'bhus')

Ambiguity 2: Consonant చ (pronounced as 'ch' as in 'church') is similar to another rarely used consonant ౘ (closest transliteration 'tsa')

Ambiguity 3: Consonant ర (pronounced as 'ru' as in 'run') is similar to another consonant ఠ ( hard 't' - close to the 't' in 'stone')

Ambiguity 4: Consonant జ (pronounced as 'ju' as in 'justice') is similar to another rarely used consonant ౙ (closest trasilteration 'za') and also similar to ఙ ('jna')

Ambiguity 5: consonant ఝ (pronounced as 'jha' - hard జ with aspiration ) was interpreted as 'రు' (pronounced as 'ru' in 'rupee' ) which is a combination of Consonant ర ('ru') + vowel ఉ(pronounced as 'u' in 'push')

Ambiguity 6: vowel ఇ ( pronounced as 'i' in 'ink') is close to consonant ఞ (pronounced as 'inya'). The 'inya' did not get recognized at all in my test data.

Ambiguity 7: కౄ ( 'cru' as in 'cruel') and గౄ ('grue' as in 'gruesome') were converted to క్య ('kya') and 'గ్యూ' (gyoo).

Ambiguity 8: ౠ ('rroo') became బూ ('boo')

I guess some of them could be due to my poor tiff. But I think some of the ambiguities are genuine and need to be handled.

Please help to address these ambiguity resolutions.

tesseract-telugu.txt

@Shreeshrii
Copy link
Contributor

Shreeshrii commented Sep 13, 2018 via email

@Shreeshrii
Copy link
Contributor

Shreeshrii commented Sep 13, 2018 via email

@metaforte
Copy link
Author

Thank you. I will try and update

@metaforte
Copy link
Author

metaforte commented Sep 17, 2018

I created a word doc with valid text and converted it to pdf and then tiff using imagemagick and ran tesseract with training data fast. I was able to scan mostly okay. News paper clipping had some errors..But that's fine.

That said, the ambiguity stated in item 1,7 are still a problem.

@metaforte
Copy link
Author

I will do more testing and update here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants