Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing many special characters in desired_characters file (Swedish) #4

Open
aslamy opened this issue Nov 24, 2018 · 9 comments
Open

Comments

@aslamy
Copy link
Contributor

aslamy commented Nov 24, 2018

The file desired_characters does not contains many of the important special characters like "@".
All special characters in english is also important for swedish language.
Law documents contains section sign § character. Please add this as well.

@stweil
Copy link
Contributor

stweil commented Nov 26, 2018

From tesseract-ocr/tesseract#2075:

It's also possible to use script/Latin for Swedish. That should contain all characters.

@stweil
Copy link
Contributor

stweil commented Nov 26, 2018

Only symbols included in swe.unicharset will be detected during OCR. If a symbol is missing, it can be added by fine tuning training.

Adding symbols to the desired_characters files helps for future trainings, so symbols won't be missed then, but does not change existing models.

@amitdo
Copy link

amitdo commented Nov 26, 2018

The desired_characters file is used for the training done by Google. The tesseract training tools which are available in https://github.com/tesseract-ocr/tesseract do not use it.

@Kalle12345
Copy link

@amitdo should I then use https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for--a-few-characters ? Is there any easier way? A training GUI for tesseract 4?

@amitdo
Copy link

amitdo commented Nov 28, 2018

should I then use https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for--a-few-characters ?

That supposed to be the way...
but it's not so easy.

Is there any easier way? A training GUI for tesseract 4?

I don't know.

@poizan42
Copy link
Contributor

poizan42 commented Jan 10, 2020

The current danish traineddata has the same issue. Really danish should be exactly the same as swedish except for ö->ø and ä->æ (I see that specifically '@' was added recently to desired_characters, but no new training data generated).

@stweil
Copy link
Contributor

stweil commented Jan 10, 2020

@poizan42, I suggest to create a pull request which adds the missing characters to the list of desired characters.

You can try the script/Latin model which should already support all Danish characters, or you could enhance the existing dan.traineddata, either by fine-tuning (see link above) or by using tesstrain. I prefer tesstrain because I found it easier to use.

@poizan42
Copy link
Contributor

@stweil, I have created a PR in #34

@stweil
Copy link
Contributor

stweil commented Jan 12, 2020

I merged that PR now, thanks. Please note that we cannot expect new training done by Google, so it is up to the Open Source community (= you, me, ...) to use the fixed information and train new models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants