Tesseract fails to recognize % sign in Hungarian language texts #40

Googulator · 2019-09-18T17:57:33Z

When the language is set to "hun" (Hungarian), Tesseract is unable to recognize the % sign. This sign is very commonly used in Hungarian to represent percentages, the same way as in English. Tesseract instead sees various letters and digits - most commonly "96", sometimes "9", "69", "0", "S", "Z", or even nothing at all.

Even if I feed a generated image containing a % sign in large black type on a pure white background, I still can't get Tesseract to output the % sign, as long as the language is set to Hungarian.

Both the "fast" and "best" models suffer from this problem.

If I instead set the language to English, % sings are recognized without issue.

stweil · 2019-09-18T18:07:10Z

See hun.unicharset which shows all known characters.

The percent sign was not part of the training data set, so Tesseract simply does not know that character with hun.

This can only be solved by new training, either from scratch or by fine tuning the existing incomplete model (which can add new characters).

An alternative would be using the script/Latin model which supports all ~~western~~ European languages which are based on Latin script.

Googulator · 2019-09-20T08:50:19Z

A model "which supports all western Europe languages" is not an option for Hungarian, because of ő and ű, which are not found in any Western European language.

Shreeshrii · 2019-09-20T08:54:31Z

Have you tried with -l hun+eng?

stweil · 2019-09-20T09:47:38Z

Latin.unicharset includes both characters, so I suggest to try it. I updated my previous comment.

Shreeshrii · 2019-09-20T13:01:24Z

@stweil the more relevant unicharset will be the lstm-unicharset extracted from the script/Latin traineddata file. Latin.unicharset maybe a superset.

stweil · 2019-09-23T06:26:51Z

You are right, it is not identical, but that one also includes both characters.

Googulator · 2019-10-24T10:41:59Z

How do I test Latin.unicharset? Do I need to train a new model?

stweil · 2019-10-24T11:08:50Z

Just get Latin.traineddata and extract Latin.unicharset using combine_tessdata. Then load Latin.unicharset in your editor and check whether it contains all relevant characters.

stweil added the bug label Sep 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tesseract fails to recognize % sign in Hungarian language texts #40

Tesseract fails to recognize % sign in Hungarian language texts #40

Googulator commented Sep 18, 2019

stweil commented Sep 18, 2019 •

edited

Googulator commented Sep 20, 2019

Shreeshrii commented Sep 20, 2019

stweil commented Sep 20, 2019

Shreeshrii commented Sep 20, 2019

stweil commented Sep 23, 2019

Googulator commented Oct 24, 2019

stweil commented Oct 24, 2019 •

edited

Tesseract fails to recognize % sign in Hungarian language texts #40

Tesseract fails to recognize % sign in Hungarian language texts #40

Comments

Googulator commented Sep 18, 2019

stweil commented Sep 18, 2019 • edited

Googulator commented Sep 20, 2019

Shreeshrii commented Sep 20, 2019

stweil commented Sep 20, 2019

Shreeshrii commented Sep 20, 2019

stweil commented Sep 23, 2019

Googulator commented Oct 24, 2019

stweil commented Oct 24, 2019 • edited

stweil commented Sep 18, 2019 •

edited

stweil commented Oct 24, 2019 •

edited