What should be the norm_mode for different languages? #99

girikum · 2022-11-29T21:08:47Z

I see that the norm_mode is defined as the following values in https://github.com/tesseract-ocr/tesseract/blob/master/src/training/unicharset_extractor.cpp#L103

1 - combine graphemes (use for Latin and other simple scripts)
2 - split graphemes (use for Indic/Khmer/Myanmar)
3 - pure unicode (use for Arabic/Hebrew/Thai/Tibetan)

Can someone clarify in the documentation the exact mapping for the all the available languages in the tessdata repos?

It is pretty confusing to me that the NORM_MODE defined in the tesstrain Makefile almost never uses the values for Latin languages. https://github.com/tesseract-ocr/tesstrain/blob/main/Makefile#L86-L101

Should norm_mode be 2 even for English according to the Makefile?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What should be the norm_mode for different languages? #99

What should be the norm_mode for different languages? #99

girikum commented Nov 29, 2022

What should be the norm_mode for different languages? #99

What should be the norm_mode for different languages? #99

Comments

girikum commented Nov 29, 2022