Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What should be the norm_mode for different languages? #99

Open
girikum opened this issue Nov 29, 2022 · 0 comments
Open

What should be the norm_mode for different languages? #99

girikum opened this issue Nov 29, 2022 · 0 comments

Comments

@girikum
Copy link

girikum commented Nov 29, 2022

I see that the norm_mode is defined as the following values in https://github.com/tesseract-ocr/tesseract/blob/master/src/training/unicharset_extractor.cpp#L103

1 - combine graphemes (use for Latin and other simple scripts)
2 - split graphemes (use for Indic/Khmer/Myanmar)
3 - pure unicode (use for Arabic/Hebrew/Thai/Tibetan)

Can someone clarify in the documentation the exact mapping for the all the available languages in the tessdata repos?

It is pretty confusing to me that the NORM_MODE defined in the tesstrain Makefile almost never uses the values for Latin languages. https://github.com/tesseract-ocr/tesstrain/blob/main/Makefile#L86-L101

Should norm_mode be 2 even for English according to the Makefile?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant