Is it possible to train a model for multiple types of sources? #332

gabriel-fsa · 2023-02-14T18:10:51Z

I would like to know how the default model is trained. If it is trained with several images (if so, what order of magnitude), or if images are generated automatically with different sources.

I want to train a model of mine from the pattern and using images it seems that low resolution images the pattern model reads better even adding more and more dataset. I'm training with images varying DPI using characters, words and phrases. Should I be doing it differently?

stweil · 2023-02-14T18:14:43Z

We don't know exactly how the standard models were trained because that was done by Google. Only some hints are available.

gabriel-fsa · 2023-02-14T18:29:05Z

But have you ever trained, or do you know of any case where, through a dataset of images, the assertiveness got to be greater than or equal to the standard model? This type of information is very scarce, I would like to have a north of the amount of a possible dataset to have a reasonably functional model.

stweil · 2023-02-14T18:48:58Z

Yes, we trained lots of models meanwhile. See https://github.com/tesseract-ocr/tesstrain/wiki/GT4HistOCR or https://github.com/tesseract-ocr/tesstrain/wiki/GT4HistOCR for examples.

gabriel-fsa · 2023-02-14T19:35:37Z

Yes, we trained lots of models meanwhile. See https://github.com/tesseract-ocr/tesstrain/wiki/GT4HistOCR or https://github.com/tesseract-ocr/tesstrain/wiki/GT4HistOCR for examples.

Wow what a fuck! How have I not seen this before.

But I still have some doubts:

1 - I saw that you use xml, in the dataset. Is this xml just to extract the words and use them as png and .gt.txt or is the xml used with the whole image?

2 - What is the order of magnitude of the dataset that you guys usually use (100k, 1M, 10M)?

3 - Do you do a lot of data augmentation to improve reading?

stweil · 2023-02-14T20:53:01Z

The lines must be extracted from the PAGE XML files, and the same must be done for the page images. See example with extracted lines. For other GT data you still have to do this extraction.
That depends. reichsanzeiger-gt for example has 119435 lines, GT4HistOCR has 313173 lines, but there are also some smaller data sets.
No data augmentation.

gabriel-fsa · 2023-02-16T16:10:11Z

The last question, I swear.

Does it have much impact on assertiveness in training a model with multiple sources? Several images with different fonts, always keeping the proportion between them, of course.

stale · 2023-05-22T01:12:35Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale bot added the stale Issues which require input by the reporter which is not provided label May 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it possible to train a model for multiple types of sources? #332

Is it possible to train a model for multiple types of sources? #332

gabriel-fsa commented Feb 14, 2023

stweil commented Feb 14, 2023

gabriel-fsa commented Feb 14, 2023

stweil commented Feb 14, 2023

gabriel-fsa commented Feb 14, 2023

stweil commented Feb 14, 2023

gabriel-fsa commented Feb 16, 2023

stale bot commented May 22, 2023

Is it possible to train a model for multiple types of sources? #332

Is it possible to train a model for multiple types of sources? #332

Comments

gabriel-fsa commented Feb 14, 2023

stweil commented Feb 14, 2023

gabriel-fsa commented Feb 14, 2023

stweil commented Feb 14, 2023

gabriel-fsa commented Feb 14, 2023

stweil commented Feb 14, 2023

gabriel-fsa commented Feb 16, 2023

stale bot commented May 22, 2023