Tesseract fails to detect letters Å and å in Finnish language. #31

jmokoistinen · 2019-11-13T16:07:48Z

Testing Tesseract to detect Finnish texts containing "swedish o" -> å. Seems it cannot detect them- Å and å correctly. I have also tried fin+swe model but more usually the fin model version of the text is selected.

Is the previous training files available somewhere? Probably the training data does not have enough Åå cases or it is not included even it is official letter.

stweil · 2019-12-17T18:14:38Z

See the list of known characters (unicharset). The data for fin in langdata_lstm needs to be fixed. Do you want to send a fix (pull request)?

I move the issue to langdata_lstm.

jmokoistinen · 2020-02-12T10:52:18Z

Yes, what should i do to make it happen? Collect some data and box them with some tool? where can i get the current data? Cannot see any images here https://github.com/tesseract-ocr/langdata_lstm/tree/master/fin

I guess training is made by synthetic texts with those files? How many examples of å Å there should be? Anything else needs to be modified? Just the training_text singles_text desired characters?(any rules how exactly?)

jmokoistinen · 2020-03-02T12:37:07Z

Also letters Q and q are missing from the data? There should be all letters at least abcdefghijklmnopqrstuvwxyzåäö
ABCDEFGHIJKLMNOPQRSTUVWXYZÅÄÖ
1234567890
How can these be fixed?

I checked the characters through, only Åå and Qq are missing. Is it enough to modify fin.training_text to contain N-amount of missing letters? Or do I need to modify something else?

stweil · 2020-03-02T21:54:20Z

I'd add all desired characters to desired_characters, ideally sorted with LANG=C.UTF-8 sort. Then we at least have a list of those characters and can try to find training texts which include them sufficiently often.

To fix the problem, we still have to run new training ...

stweil transferred this issue from tesseract-ocr/tessdata_best Dec 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tesseract fails to detect letters Å and å in Finnish language. #31

Tesseract fails to detect letters Å and å in Finnish language. #31

jmokoistinen commented Nov 13, 2019

stweil commented Dec 17, 2019

jmokoistinen commented Feb 12, 2020 •

edited

jmokoistinen commented Mar 2, 2020 •

edited

stweil commented Mar 2, 2020

Tesseract fails to detect letters Å and å in Finnish language. #31

Tesseract fails to detect letters Å and å in Finnish language. #31

Comments

jmokoistinen commented Nov 13, 2019

stweil commented Dec 17, 2019

jmokoistinen commented Feb 12, 2020 • edited

jmokoistinen commented Mar 2, 2020 • edited

stweil commented Mar 2, 2020

jmokoistinen commented Feb 12, 2020 •

edited

jmokoistinen commented Mar 2, 2020 •

edited