Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training data should include bullet-like characters #45

Open
wollmers opened this issue Aug 17, 2021 · 0 comments
Open

Training data should include bullet-like characters #45

wollmers opened this issue Aug 17, 2021 · 0 comments

Comments

@wollmers
Copy link

Modern texts especially business documents contain bullet-like symbols e. g. for lists. Also middle dot is used with some frequency. While the recognition results for eng and deu are nearly perfect, the results for these symbols are "random".

For a next release of trained models the training data should be improved in this direction and maybe other symbols as well.

Test image:

bullets

Tesseract result with -l eng:

List of vehicles:
* Trucks
* vans
* bicycles
Liste von Fahrzeugen:
e Lastwagen
e Transporter
e Fahrrader

Result with -l deu:

List of vehicles:
« Trucks
« vans
+ bicycles
Liste von Fahrzeugen:
e Lastwagen
e Transporter
e Fahrräder

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant