Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating training data using tesstrain.sh #39

Open
InbarShapira opened this issue Jan 31, 2021 · 6 comments
Open

Creating training data using tesstrain.sh #39

InbarShapira opened this issue Jan 31, 2021 · 6 comments

Comments

@InbarShapira
Copy link

It is not clear when creating training data using tesstain.sh for the LSTM model
should I use --langdata_dir langdata_lstm or to use --langdata_dir langdata?

It defect which eng.training_text file will be used to generate the training data

what should I use?

@Shreeshrii
Copy link
Collaborator

For the LSTM model, use --langdata_dir langdata_lstm

You can limit the number of pages, if doing finetuning.

@InbarShapira
Copy link
Author

So if I want to train a LSTM model from scratch, that will reach the Tesseract accuracy that is in the LSTM model what training data do I need create and how?

@Shreeshrii
Copy link
Collaborator

@udibarzi
Copy link

udibarzi commented Feb 3, 2021

Thanks @Shreeshrii I went over this documentation and something is still not clear to me.

When following the instructions, the script creates a tiff file with ~50 lines per page and a total of ~3700 pages which is a total of 185,000 lines of text for just a single font. The instructions specify to use ~4000 fonts for English so the total number of lines that will be created is 4000*185,000 whereas according to this post (tesseract-ocr/tesseract#654 (comment)) the training set comprises only 400,000-800,000 textlines.

What am I missing?

@Shreeshrii
Copy link
Collaborator

Our knowledge about the training method is based on Ray Smith's posts and comments. It is possible that he experimented with different settings and the posts at different times reflect that.

https://github.com/tesseract-ocr/tessdoc/blob/master/Data-Files-in-tessdata_fast.md shows the following info for English traineddata.

Version string:4.00.00alpha:eng:synth20170629
LSTM training info:Network str:[1,36,0,1Ct3,3,16Mp3,3Lfys48Lfx96Lrx96Lfx192O1c1], flags=41,
iteration=6352400, sample_iteration=6352704, null_char=110, learning_rate=0.001, momentum=0.5, adam_beta=0.999

While for tessdata_best it is

eng
Version string:4.00.00alpha:eng:synth20170629:[1,36,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c1]
LSTM training info:Network str:[1,36,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c1],
flags=40, iteration=814100, sample_iteration=814136, null_char=110,
learning_rate=0.001, momentum=0.5, adam_beta=0.999

Look at number of iterations to see the difference.

I haven't seen any post where someone has been able to replicate his results.

@kseniazhagorina
Copy link

Hello.
In your instructions https://github.com/tesseract-ocr/tessdoc/blob/main/tess4/TrainingTesseract-4.00.md#using-tesstrainsh
your mention the file tesstrain.sh at https://github.com/tesseract-ocr/tesseract/blob/main/src/training/tesstrain.sh
but there is no such file in tesseract
and also you write that
Training with tesstrain.sh (a.k.a tesseract 4 training) in unsupported/abandoned. Please use scripts from https://github.com/tesseract-ocr/tesstrain for training

d57b942
Could you please update instructions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants