Danish traineddata file doesn't include the "@" character #29

Furtifk · 2019-11-29T15:09:16Z

Environment

Tesseract Version: 5.00 Downloaded from: https://github.com/UB-Mannheim/tesseract/wiki
Platform: Windows 10 64bit

Current Behavior: Danish traineddata file doesn't include the "@" character

Expected Behavior: Danish traineddata file should include the "@" character

Suggested Fix: Danish traineddata file should include the "@" character

File to run OCR on:

In the case of reproducing I have zip file I can send so you may run a VERY basic test which will display both results comparing eng and dan traineddata results. Please whoever looks into the issue to contact me to receive this.

This is a quite a pressing issue so any response is appreciated.

stweil · 2019-11-29T15:52:21Z

That's a problem of the model (traineddata), not of Tesseract. See dan.unicharset for a list of supported characters.

If you want, you can send a pull request which fixes the list of desired characters.

stweil · 2019-11-29T15:54:11Z

There won't be a fixed dan.traineddata soon. I suggest to try Latin.traineddata for your case.

Furtifk · 2019-11-29T19:47:06Z

@stweil Thanks for the response.
I will try the latin traineddata although the document I need to be read cannot yield correct results if I use a combination eng + dan traineddata files so I'm not confident this will work. Getting good OCR results for Danish documents seems to be a hassle when not using the Danish dictionary file.

stweil · 2019-11-29T20:56:55Z

It is possible to enhance the existing dan.traineddata with missing characters by additional training, so you could try to fix it yourself. Here is a description how this was done for Fraktur. You'll need pairs of line images and text files with a transcription.

Furtifk · 2019-12-03T14:40:20Z

It is possible to enhance the existing dan.traineddata with missing characters by additional training, so you could try to fix it yourself. Here is a description how this was done for Fraktur. You'll need pairs of line images and text files with a transcription.

Thank you for your response. I do not think this is a viable option for me but thanks for your reply and for the information!

poizan42 · 2020-01-10T12:17:13Z

It lacks '§' as well which is used in every single legal document in existence...

stweil · 2020-01-10T12:27:23Z

@Furtifk, @poizan42, especially for older Danish texts you could also try one of the models which I trained recently, for example Fraktur_50000000.502_198857.traineddata.

It was trained based on script/Fraktur with lots of historic documents, and according to my experience it works good although I did not add a dictionary. You will get a warning therefore at runtime, but could add a Danish dictionary if needed.

Furtifk · 2020-03-12T12:32:27Z

Has there been any improvements recently with the Danish dictionary?

stweil · 2020-03-18T20:49:41Z

No, and I am afraid there won't be an improvement unless someone works on it.

stweil transferred this issue from tesseract-ocr/tesseract Nov 29, 2019

stweil added bug Something isn't working enhancement New feature or request help wanted Extra attention is needed labels Nov 29, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Danish traineddata file doesn't include the "@" character #29

Danish traineddata file doesn't include the "@" character #29

Furtifk commented Nov 29, 2019 •

edited

stweil commented Nov 29, 2019

stweil commented Nov 29, 2019

Furtifk commented Nov 29, 2019

stweil commented Nov 29, 2019

Furtifk commented Dec 3, 2019

poizan42 commented Jan 10, 2020

stweil commented Jan 10, 2020

Furtifk commented Mar 12, 2020

stweil commented Mar 18, 2020

Danish traineddata file doesn't include the "@" character #29

Danish traineddata file doesn't include the "@" character #29

Comments

Furtifk commented Nov 29, 2019 • edited

Environment

Current Behavior: Danish traineddata file doesn't include the "@" character

Expected Behavior: Danish traineddata file should include the "@" character

Suggested Fix: Danish traineddata file should include the "@" character

stweil commented Nov 29, 2019

stweil commented Nov 29, 2019

Furtifk commented Nov 29, 2019

stweil commented Nov 29, 2019

Furtifk commented Dec 3, 2019

poizan42 commented Jan 10, 2020

stweil commented Jan 10, 2020

Furtifk commented Mar 12, 2020

stweil commented Mar 18, 2020

Furtifk commented Nov 29, 2019 •

edited