Training with python: run training step ? #351

forzagreen · 2023-09-18T07:58:32Z

As mentioned by @stefan6419846 in madmaze/pytesseract#508 , there is a python wrapper for training in tesstrain/src/ , which unfortunately is not documented in tesseract, tessdoc and tesstrain repositories.

From my understanding: (please correct me if I'm wrong)

It only generates lstmf files, and does not perform any training.
In the steps mentioned in Overview of Training Process, it stops at step 5. Steps 6 and 7 must be done separately. Is that correct ?
How to perform steps 6 and 7 ? with Makefile commands ? if you give me some inputs, I can help adding these steps to the python script.
The python script takes a TEXTFILE and generates (for each font) box/tif/lstmf files for the hole text, not line by line. So, in order to generate line by line, we must run the script for each one-line file ?

Thanks in advance !

Cc: @stefan6419846

stefan6419846 · 2023-09-18T08:13:23Z

tesstrain basically creates artificial training data for doing finetuning with a specific font for example. You might find some existing examples using the old tesstrain.sh script which should be roughly equivalent for tesstrain. The Makefile approach is for "real" data only.

Rough steps for the Python module:

Extract LSTM file: combine_tessdata -e tessdata/eng.traineddata eng.lstm

Generate files:

tesstrain.run(
    fonts_directory=fonts_directory,
    fonts=[font_name],
    language_code='eng',
    linedata_only=True,
    langdata_directory=language_data_directory,
    tessdata_directory=tessdata_directory,
    save_box_tiff=True,
    maximum_pages=maximum_pages,
    output_directory=output_directory
)

Finetune: lstmtraining --continue_from eng.lstm --model_output font_name --traineddata tessdata/eng.traineddata --train_listfile eng.training_files.txt --max_iterations 10
Convert to .traineddata file: lstmtraining --stop_training --continue_from font_name_checkpoint --traineddata tessdata/eng.traineddata --model_output target_path

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training with python: run training step ? #351

Training with python: run training step ? #351

forzagreen commented Sep 18, 2023

stefan6419846 commented Sep 18, 2023

Training with python: run training step ? #351

Training with python: run training step ? #351

Comments

forzagreen commented Sep 18, 2023

stefan6419846 commented Sep 18, 2023