English traineddata file does not contain the '±' character? #48

Furtifk · 2022-10-26T14:54:09Z

English traineddata file does not contain the '±' character?

Environment
Tesseract Version: 5.00 Downloaded from: https://github.com/UB-Mannheim/tesseract/wiki
Platform: Windows 10 64bit

I am trying to OCR using the English dictionary file found:
https://tesseract-ocr.github.io/tessdoc/Data-Files
I notice the character is not included here either:
https://github.com/tesseract-ocr/langdata_lstm/blob/main/eng/eng.unicharset

Are there any plans to add it? Are there any language files that contain successfully OCR this character?

Many thanks to whoever can assist here. I am attaching the file I used to test this behavior for this character here: (https://github.com/tesseract-ocr/langdata_lstm/files/9870674/Special.Symbols.pdf)

amitdo · 2022-10-26T15:14:49Z

Are there any plans to add it?

The best/fast models were uploaded 5 years ago. AFAIK, no one is working on updating them.

Furtifk · 2022-10-26T15:31:12Z

Thanks for the information and the fast reply. Would you know of any fix I could have access to OCR this character?

Many thanks ahead of time ^^

stweil · 2022-10-26T15:49:28Z

The official script/Latin model includes ±. You could also try any of my models from https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/, for example https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/frak2021_09/tessdata_fast/frak2021-09.traineddata.

Furtifk · 2022-10-26T16:19:07Z

The official script/Latin model includes ±. You could also try any of my models from https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/, for example https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/frak2021_09/tessdata_fast/frak2021-09.traineddata.

Thanks a lot. I will try this and let you know here if it does indeed work for us going forward.

Furtifk · 2022-10-27T08:14:30Z

After further testing, it would appear both lat.traineddata (https://tesseract-ocr.github.io/tessdoc/Data-Files) and your own model are struggling to get this char in my example.
Is this the latin dictionary file you meant as I have linked above? If not, where could I find this and download to try it?

Many thanks!

stweil · 2022-10-27T08:48:34Z

lat.traineddata is a different model. script/Latin is in https://github.com/tesseract-ocr/tessdata_fast/tree/main/script. Or simply re-run the installer and select it there for installation.

Furtifk · 2022-10-27T09:18:20Z

Thanks for the link. I have tried this on my end with the Latin.traineddata model but I'm still not having much luck with the test file and internal files on my end for getting this character.
I'm guessing there's not much else that can be done here? Thanks for the help and suggestions nonetheless.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

English traineddata file does not contain the '±' character? #48

English traineddata file does not contain the '±' character? #48

Furtifk commented Oct 26, 2022 •

edited

amitdo commented Oct 26, 2022

Furtifk commented Oct 26, 2022

stweil commented Oct 26, 2022

Furtifk commented Oct 26, 2022

Furtifk commented Oct 27, 2022

stweil commented Oct 27, 2022

Furtifk commented Oct 27, 2022

English traineddata file does not contain the '±' character? #48

English traineddata file does not contain the '±' character? #48

Comments

Furtifk commented Oct 26, 2022 • edited

amitdo commented Oct 26, 2022

Furtifk commented Oct 26, 2022

stweil commented Oct 26, 2022

Furtifk commented Oct 26, 2022

Furtifk commented Oct 27, 2022

stweil commented Oct 27, 2022

Furtifk commented Oct 27, 2022

Furtifk commented Oct 26, 2022 •

edited