Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running Tesseract 5 training and how I solved the issues I found #341

Open
mvfpoa opened this issue Apr 11, 2023 · 1 comment
Open

Running Tesseract 5 training and how I solved the issues I found #341

mvfpoa opened this issue Apr 11, 2023 · 1 comment
Labels
stale Issues which require input by the reporter which is not provided

Comments

@mvfpoa
Copy link

mvfpoa commented Apr 11, 2023

Hi there.

Just want to share how I managed to run tesseract training with tesstrain on version 5. It might help other and I hope can be used to improve documentation.

This was my first try on tesseract training, I neved did it before.

I cloned tesseract from git on tag 5.3 and was able to make it exactly as documented here: https://github.com/tesseract-ocr/tessdoc/blob/main/Compiling-–-GitInstallation.md

I performed the installation on Ubuntu running on WSL.

I cloned the latest git for tesstrain and followed this page:
https://github.com/tesseract-ocr/tesstrain

That document recommended (https://github.com/tesseract-ocr/tesstrain#provide-ground-truth) trying the train with the ocrd-testset.zip files. I unziped the contents in a folder named 'data/foo-ground-truth/'. The folder named 'data' was created by me to put the files when running make tesseract-langdata as stated in the document.

So I run make training and the result was a lot of error messages:

Can't encode transcription: '<some random german phrase>' in language ''
Encoding of string failed! Failure bytes: <some hexa codes>

Side note: I needed to run it twice, looks like the first command crashes when building the all-gt file.

It was clearly something related to unicharset that has not described the special characters that exists in the samples ground truth.

After studying a while, I decided by my own to replace the unicharset file in data/foo/ with the contents of data/langdata/Latin.unicharset

cp data/langdata/Latin.unicharset data/foo/unicharset

that completely solved the error messages and training finally started.

After some minutes, the BCER train that started at 89% went to 99,9%. Something was clearly wrong again.

I was digging in the web and had a hunch that the issue was related that I haven't specified the starter traineddata, so the training was running from "scratch", don't know.

I then specified the START_MODEL and the result was much better. The BCER started below 20% and continued to improve.

When specifying the starter model, the training process extracts the unicharset from the model and put it in the data/eng folder. I was expecting that eng.traineddata would be using Latin.unicharset, but that seems to not be the case (perhaps the ger.traineddata?), so copying the unicharset is still necessary. For my application I will be using the eng.traineddata, so I decided to continue on english traineddata instead of using the germany traineddata (which I haven't tried).

To have a cleaner run, I decided to run the training in steps. Those were:

# let's start by cleaning the environment
make -r TESSDATA=/usr/local/share/tessdata/ MODEL_NAME=foo START_MODEL=eng clean
make -r TESSDATA=/usr/local/share/tessdata/ MODEL_NAME=foo START_MODEL=eng unicharset
# error expected (creating the foo/all-gt file)
make -r TESSDATA=/usr/local/share/tessdata/ MODEL_NAME=foo START_MODEL=eng unicharset
cp data/langdata/Latin.unicharset data/eng/foo.lstm-unicharset
make -r TESSDATA=/usr/local/share/tessdata/ MODEL_NAME=foo START_MODEL=eng training

I hope this can support Tesseract comunity and any contribution is welcome.

@stale
Copy link

stale bot commented May 22, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale Issues which require input by the reporter which is not provided label May 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Issues which require input by the reporter which is not provided
Projects
None yet
Development

No branches or pull requests

1 participant