Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it possible to train a model for multiple types of sources? #332

Open
gabriel-fsa opened this issue Feb 14, 2023 · 7 comments
Open

Is it possible to train a model for multiple types of sources? #332

gabriel-fsa opened this issue Feb 14, 2023 · 7 comments
Labels
stale Issues which require input by the reporter which is not provided

Comments

@gabriel-fsa
Copy link

I would like to know how the default model is trained. If it is trained with several images (if so, what order of magnitude), or if images are generated automatically with different sources.

I want to train a model of mine from the pattern and using images it seems that low resolution images the pattern model reads better even adding more and more dataset. I'm training with images varying DPI using characters, words and phrases. Should I be doing it differently?

@stweil
Copy link
Collaborator

stweil commented Feb 14, 2023

We don't know exactly how the standard models were trained because that was done by Google. Only some hints are available.

@gabriel-fsa
Copy link
Author

But have you ever trained, or do you know of any case where, through a dataset of images, the assertiveness got to be greater than or equal to the standard model? This type of information is very scarce, I would like to have a north of the amount of a possible dataset to have a reasonably functional model.

@stweil
Copy link
Collaborator

stweil commented Feb 14, 2023

Yes, we trained lots of models meanwhile. See https://github.com/tesseract-ocr/tesstrain/wiki/GT4HistOCR or https://github.com/tesseract-ocr/tesstrain/wiki/GT4HistOCR for examples.

@gabriel-fsa
Copy link
Author

Yes, we trained lots of models meanwhile. See https://github.com/tesseract-ocr/tesstrain/wiki/GT4HistOCR or https://github.com/tesseract-ocr/tesstrain/wiki/GT4HistOCR for examples.

Wow what a fuck! How have I not seen this before.

But I still have some doubts:

1 - I saw that you use xml, in the dataset. Is this xml just to extract the words and use them as png and .gt.txt or is the xml used with the whole image?

2 - What is the order of magnitude of the dataset that you guys usually use (100k, 1M, 10M)?

3 - Do you do a lot of data augmentation to improve reading?

@stweil
Copy link
Collaborator

stweil commented Feb 14, 2023

  1. The lines must be extracted from the PAGE XML files, and the same must be done for the page images. See example with extracted lines. For other GT data you still have to do this extraction.
  2. That depends. reichsanzeiger-gt for example has 119435 lines, GT4HistOCR has 313173 lines, but there are also some smaller data sets.
  3. No data augmentation.

@gabriel-fsa
Copy link
Author

The last question, I swear.

Does it have much impact on assertiveness in training a model with multiple sources? Several images with different fonts, always keeping the proportion between them, of course.

@stale
Copy link

stale bot commented May 22, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale Issues which require input by the reporter which is not provided label May 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Issues which require input by the reporter which is not provided
Projects
None yet
Development

No branches or pull requests

2 participants