Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalization failed / Invalid start of grapheme sequence Error While training the tesseract model #345

Open
Sanketnarkhede-10 opened this issue Jun 7, 2023 · 1 comment

Comments

@Sanketnarkhede-10
Copy link

Sanketnarkhede-10 commented Jun 7, 2023

Normalization failed for string 'ଜୀବନକୁ ନିବିଡ଼ ଭାବେ ଏକନ୍ୱିତ କରିଛନ୍ତି'
Invalid start of grapheme sequence:D=0xb71
Normalization failed for string 'ପରମ୍ପରାକୁ ଅବଲମ୍ୱନ କରିଛନ୍ତି, ସେତିକି ମଧ୍ୟ'
Invalid start of grapheme sequence:M=0xb48
Normalization failed for string 'ଦ୍ୱୈତ ରୂପରେ ଦେଖିଥିଲେ, ଏଠାରେ ପୁରୁଷ'
Invalid start of grapheme sequence:M=0xb47
Normalization failed for string 'ତାଙ୍କ ହୃଦୟ ବିଭୋର ହୋଇଛି ସମ୍ୱେଦନଶୀଳତାରେ;'
Invalid start of grapheme sequence:D=0xb71

I'm getting this error while training the tesseract ocr model for Oriya language , please help me to resolve this issue .
I'm attaching the ground truth files .

Training on tesseract 4.1.1 :
tesseract 4.1.1
leptonica-1.82.0

ocr_training.zip

@stweil
Copy link
Collaborator

stweil commented Jun 7, 2023

Try to shorten those strings in your training data until the error messages disappear, then check what was wrong with them.

And please use the latest Tesseract version 5.3.1 instead of 4.1.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants