Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training with python: run training step ? #351

Open
forzagreen opened this issue Sep 18, 2023 · 1 comment
Open

Training with python: run training step ? #351

forzagreen opened this issue Sep 18, 2023 · 1 comment

Comments

@forzagreen
Copy link

As mentioned by @stefan6419846 in madmaze/pytesseract#508 , there is a python wrapper for training in tesstrain/src/ , which unfortunately is not documented in tesseract, tessdoc and tesstrain repositories.

From my understanding: (please correct me if I'm wrong)

  1. It only generates lstmf files, and does not perform any training.
    In the steps mentioned in Overview of Training Process, it stops at step 5. Steps 6 and 7 must be done separately. Is that correct ?

  2. How to perform steps 6 and 7 ? with Makefile commands ? if you give me some inputs, I can help adding these steps to the python script.

  3. The python script takes a TEXTFILE and generates (for each font) box/tif/lstmf files for the hole text, not line by line. So, in order to generate line by line, we must run the script for each one-line file ?

Thanks in advance !

Cc: @stefan6419846

@stefan6419846
Copy link
Contributor

tesstrain basically creates artificial training data for doing finetuning with a specific font for example. You might find some existing examples using the old tesstrain.sh script which should be roughly equivalent for tesstrain. The Makefile approach is for "real" data only.

Rough steps for the Python module:

  1. Extract LSTM file: combine_tessdata -e tessdata/eng.traineddata eng.lstm

  2. Generate files:

    tesstrain.run(
        fonts_directory=fonts_directory,
        fonts=[font_name],
        language_code='eng',
        linedata_only=True,
        langdata_directory=language_data_directory,
        tessdata_directory=tessdata_directory,
        save_box_tiff=True,
        maximum_pages=maximum_pages,
        output_directory=output_directory
    )
  3. Finetune: lstmtraining --continue_from eng.lstm --model_output font_name --traineddata tessdata/eng.traineddata --train_listfile eng.training_files.txt --max_iterations 10

  4. Convert to .traineddata file: lstmtraining --stop_training --continue_from font_name_checkpoint --traineddata tessdata/eng.traineddata --model_output target_path

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants