Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trailing spaces on line 27 of eng.punc #28

Open
juliangilbey opened this issue Nov 9, 2019 · 4 comments
Open

Trailing spaces on line 27 of eng.punc #28

juliangilbey opened this issue Nov 9, 2019 · 4 comments
Labels
question Further information is requested

Comments

@juliangilbey
Copy link

I've not yet worked out whether eng.punc is used by the LSTM mode of tesseract, but I discovered that there are two trailing spaces on line 27 of this file, which might cause the occasional problem.

@stweil
Copy link
Contributor

stweil commented Nov 10, 2019

Which occasional problem are you referring to? If there is a problem, you can create a new traineddata file without those spaces and see whether that fixes the problem.

@stweil
Copy link
Contributor

stweil commented Nov 10, 2019

Link to line 27 in file eng.punc. The trailing spaces are also in eng.traineddata and can be found there in 17 lines. It looks like other languages have them, too.

@stweil
Copy link
Contributor

stweil commented Nov 10, 2019

LSTM and legacy mode use different punc components from the traineddata file, but both have the trailing spaces.

@stweil stweil added the question Further information is requested label Nov 10, 2019
@juliangilbey
Copy link
Author

AFAICT, the space on each line indicates where "word characters" ("alphanumerics" for lack of a better term right now - non-punctuation symbols) are expected to appear. So line 1 has a single space, indicating a sequence of [A-Z...] with no punctuation, and other lines have a trailing space to indicate initial punctuation followed by word characters. Except for line 27, every line has precisely one space. I hope that makes sense.

I haven't detected an actual problem yet, but any such problem would likely be very subtle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants