Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training Tesseract OCR for a specific document #360

Open
mumarsyal opened this issue Nov 7, 2023 · 4 comments
Open

Training Tesseract OCR for a specific document #360

mumarsyal opened this issue Nov 7, 2023 · 4 comments

Comments

@mumarsyal
Copy link

I have recently started learning and experimenting with Tesseract OCR. I have done a training for a new font using the tesstrain.

Now my use case is that I want to train Tesseract 5 for a specific document attached below.

Ptcl_bill_0000

I have found some articles and tutorials about training for new font or new language but I couldn't find something about training for a custom document.

Is it possible to train Tesseract 5 for my document? If yes, please give me some guidelines on how to proceed with this and if I need any other tools other than Tesseract itself to prepare training data.

I have Tesseract 5 installed on Ubuntu 22.04.

@stefan6419846
Copy link
Contributor

Could you please elaborate on what you are trying to achieve by training a specific document (type)? What do you expect to change compared to using the existing models?

@mumarsyal
Copy link
Author

Thank you for your response @stefan6419846 .

I ran Tesseract default English model on this image and the output is very bad. So, I want to train Tesseract specifically for this document to improve the output but I don't know how I can generate the training dataset(line images, *.gt.txt & box files) from these images. If you could suggest me some tools to create the dataset from these images, that would be wonderful.

@stefan6419846
Copy link
Contributor

I have not tried it, but I would argue that better preprocessing on your side (feeding Tesseract with specific ROIs with appropriate preprocessing per ROI instead of the whole page, ...) might be easier and sufficient.

@linxyu1
Copy link

linxyu1 commented Nov 15, 2023

Thank you for your response @stefan6419846 .

I ran Tesseract default English model on this image and the output is very bad. So, I want to train Tesseract specifically for this document to improve the output but I don't know how I can generate the training dataset(line images, *.gt.txt & box files) from these images. If you could suggest me some tools to create the dataset from these images, that would be wonderful.

hello,maybe you can use jtessboxeditor.but it is heavy workload.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants