Creating training data using tesstrain.sh #39

InbarShapira · 2021-01-31T14:53:28Z

It is not clear when creating training data using tesstain.sh for the LSTM model
should I use --langdata_dir langdata_lstm or to use --langdata_dir langdata?

It defect which eng.training_text file will be used to generate the training data

what should I use?

Shreeshrii · 2021-01-31T16:48:18Z

For the LSTM model, use --langdata_dir langdata_lstm

You can limit the number of pages, if doing finetuning.

InbarShapira · 2021-02-01T09:08:54Z

So if I want to train a LSTM model from scratch, that will reach the Tesseract accuracy that is in the LSTM model what training data do I need create and how?

Shreeshrii · 2021-02-01T10:46:24Z

See https://tesseract-ocr.github.io/tessdoc/TrainingTesseract-4.00.html#training-text-requirements

udibarzi · 2021-02-03T05:00:00Z

Thanks @Shreeshrii I went over this documentation and something is still not clear to me.

When following the instructions, the script creates a tiff file with ~50 lines per page and a total of ~3700 pages which is a total of 185,000 lines of text for just a single font. The instructions specify to use ~4000 fonts for English so the total number of lines that will be created is 4000*185,000 whereas according to this post (tesseract-ocr/tesseract#654 (comment)) the training set comprises only 400,000-800,000 textlines.

What am I missing?

Shreeshrii · 2021-02-03T06:05:25Z

Our knowledge about the training method is based on Ray Smith's posts and comments. It is possible that he experimented with different settings and the posts at different times reflect that.

https://github.com/tesseract-ocr/tessdoc/blob/master/Data-Files-in-tessdata_fast.md shows the following info for English traineddata.

Version string:4.00.00alpha:eng:synth20170629
LSTM training info:Network str:[1,36,0,1Ct3,3,16Mp3,3Lfys48Lfx96Lrx96Lfx192O1c1], flags=41,
iteration=6352400, sample_iteration=6352704, null_char=110, learning_rate=0.001, momentum=0.5, adam_beta=0.999

While for tessdata_best it is

eng
Version string:4.00.00alpha:eng:synth20170629:[1,36,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c1]
LSTM training info:Network str:[1,36,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c1],
flags=40, iteration=814100, sample_iteration=814136, null_char=110,
learning_rate=0.001, momentum=0.5, adam_beta=0.999

Look at number of iterations to see the difference.

I haven't seen any post where someone has been able to replicate his results.

kseniazhagorina · 2021-10-31T08:40:46Z

Hello.
In your instructions https://github.com/tesseract-ocr/tessdoc/blob/main/tess4/TrainingTesseract-4.00.md#using-tesstrainsh
your mention the file tesstrain.sh at https://github.com/tesseract-ocr/tesseract/blob/main/src/training/tesstrain.sh
but there is no such file in tesseract
and also you write that
Training with tesstrain.sh (a.k.a tesseract 4 training) in unsupported/abandoned. Please use scripts from https://github.com/tesseract-ocr/tesstrain for training

d57b942
Could you please update instructions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Creating training data using tesstrain.sh #39

Creating training data using tesstrain.sh #39

InbarShapira commented Jan 31, 2021

Shreeshrii commented Jan 31, 2021

InbarShapira commented Feb 1, 2021

Shreeshrii commented Feb 1, 2021

udibarzi commented Feb 3, 2021

Shreeshrii commented Feb 3, 2021

kseniazhagorina commented Oct 31, 2021

Creating training data using tesstrain.sh #39

Creating training data using tesstrain.sh #39

Comments

InbarShapira commented Jan 31, 2021

Shreeshrii commented Jan 31, 2021

InbarShapira commented Feb 1, 2021

Shreeshrii commented Feb 1, 2021

udibarzi commented Feb 3, 2021

Shreeshrii commented Feb 3, 2021

kseniazhagorina commented Oct 31, 2021