Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Danish traineddata file doesn't include the "@" character #29

Open
Furtifk opened this issue Nov 29, 2019 · 9 comments
Open

Danish traineddata file doesn't include the "@" character #29

Furtifk opened this issue Nov 29, 2019 · 9 comments
Labels
bug Something isn't working enhancement New feature or request help wanted Extra attention is needed

Comments

@Furtifk
Copy link
Contributor

Furtifk commented Nov 29, 2019

Environment

Current Behavior: Danish traineddata file doesn't include the "@" character

Expected Behavior: Danish traineddata file should include the "@" character

Suggested Fix: Danish traineddata file should include the "@" character

File to run OCR on:
Screenshot_572

In the case of reproducing I have zip file I can send so you may run a VERY basic test which will display both results comparing eng and dan traineddata results. Please whoever looks into the issue to contact me to receive this.

This is a quite a pressing issue so any response is appreciated.

@stweil
Copy link
Contributor

stweil commented Nov 29, 2019

That's a problem of the model (traineddata), not of Tesseract. See dan.unicharset for a list of supported characters.

If you want, you can send a pull request which fixes the list of desired characters.

@stweil stweil transferred this issue from tesseract-ocr/tesseract Nov 29, 2019
@stweil
Copy link
Contributor

stweil commented Nov 29, 2019

There won't be a fixed dan.traineddata soon. I suggest to try Latin.traineddata for your case.

@stweil stweil added bug Something isn't working enhancement New feature or request help wanted Extra attention is needed labels Nov 29, 2019
@Furtifk
Copy link
Contributor Author

Furtifk commented Nov 29, 2019

@stweil Thanks for the response.
I will try the latin traineddata although the document I need to be read cannot yield correct results if I use a combination eng + dan traineddata files so I'm not confident this will work. Getting good OCR results for Danish documents seems to be a hassle when not using the Danish dictionary file.

@stweil
Copy link
Contributor

stweil commented Nov 29, 2019

It is possible to enhance the existing dan.traineddata with missing characters by additional training, so you could try to fix it yourself. Here is a description how this was done for Fraktur. You'll need pairs of line images and text files with a transcription.

@Furtifk
Copy link
Contributor Author

Furtifk commented Dec 3, 2019

It is possible to enhance the existing dan.traineddata with missing characters by additional training, so you could try to fix it yourself. Here is a description how this was done for Fraktur. You'll need pairs of line images and text files with a transcription.

Thank you for your response. I do not think this is a viable option for me but thanks for your reply and for the information!

@poizan42
Copy link
Contributor

It lacks '§' as well which is used in every single legal document in existence...

@stweil
Copy link
Contributor

stweil commented Jan 10, 2020

@Furtifk, @poizan42, especially for older Danish texts you could also try one of the models which I trained recently, for example Fraktur_50000000.502_198857.traineddata.

It was trained based on script/Fraktur with lots of historic documents, and according to my experience it works good although I did not add a dictionary. You will get a warning therefore at runtime, but could add a Danish dictionary if needed.

@Furtifk
Copy link
Contributor Author

Furtifk commented Mar 12, 2020

Has there been any improvements recently with the Danish dictionary?

@stweil
Copy link
Contributor

stweil commented Mar 18, 2020

No, and I am afraid there won't be an improvement unless someone works on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants