Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad box coordinates in boxfile string! #338

Open
khashashin opened this issue Mar 26, 2023 · 7 comments
Open

Bad box coordinates in boxfile string! #338

khashashin opened this issue Mar 26, 2023 · 7 comments
Labels
stale Issues which require input by the reporter which is not provided

Comments

@khashashin
Copy link

khashashin commented Mar 26, 2023

I have prepared the following ground truth files:

../tesstrain/data/Chechen-ground-truth
|-- 1.box
|-- 1.gt.txt
|-- 1.png
|-- 10.box
|-- 10.gt.txt
|-- 10.png
|-- 11.box
|-- 11.gt.txt
|-- 11.png
|-- 12.box
|-- 12.gt.txt
|-- 12.png

The box files are based on WordStr, here is the content of the file 1.box for example:

WordStr	65 61 1556 254	0	#НЕКЪАШ А
	65 61 1556 254	0

In the file 1.gt.txt I then have the corresponding text:

НЕКЪАШ А

And here is the image:

image

Running the command make training MODEL_NAME=Chechen START_MODEL=rus TESSDATA=../tesseract/tessdata, gives me an Error:

set -x; \
tesseract "data/Chechen-ground-truth/1.png" data/Chechen-ground-truth/1 --psm 13 lstm.train
+ tesseract data/Chechen-ground-truth/1.png data/Chechen-ground-truth/1 --psm 13 lstm.train
Bad box coordinates in boxfile string!  65 61 1556 254  0
No block overlapping textline: НЕКЪАШ А
Failed to read pages from data/Chechen-ground-truth/1.png
Error during processing.
make: *** [Makefile:258: data/Chechen-ground-truth/1.lstmf] Error 1

I'm usin tesseract version 5.3.0

@zdenop
Copy link
Contributor

zdenop commented Mar 27, 2023

Please have a look at https://github.com/tesseract-ocr/tesstrain/blob/main/ocrd-testset.zip how to prepare custom data for training.

@khashashin
Copy link
Author

@zdenop thanks for your reply, this data does not provide any box files at all, how does tesseract know which character is which?

@zdenop
Copy link
Contributor

zdenop commented Mar 28, 2023

Did you try to follow the instructions on https://github.com/tesseract-ocr/tesstrain/?
As far as I see there is no instruction about creating box files ;-)

@khashashin
Copy link
Author

@zdenop Thanks, after I removed the *.box files from the Ground Truth folder, the training could start, but the first step (stage) of the training (tesstrain-script) was to create the box files. And the coordinates look wired to me. Here is the example of a box file that tesstrain generated for me:

Н 0 0 209 43 0
Е 0 0 209 43 0
К 0 0 209 43 0
Ъ 0 0 209 43 0
А 0 0 209 43 0
Ш 0 0 209 43 0
  0 0 209 43 0
А 0 0 209 43 0
	 0 0 209 43 0

This was generated for the following image:
image

And I only put the files *.png and *.gt.txt in the Ground Truth folder, my 1.gt.txt content was:

НЕКЪАШ А

I just wonder how it works and if there is an article about this process, I have not found anything about version 5 and it seems relatively new, right? But there are a lot of tutorials and examples for version 4, but they are different and the process is also different.

p.s. the model created after the training was able to recognize characters it did not recognize before the training (I just used the model rus.traindata before and trained it further)

@zdenop
Copy link
Contributor

zdenop commented Apr 1, 2023

Did you read and follow https://github.com/tesseract-ocr/tesstrain?
Where is written that the first stage is to create box files?

@khashashin
Copy link
Author

Did you read and follow https://github.com/tesseract-ocr/tesstrain?

yes

Where is written that the first stage is to create box files?

@zdenop no, tesstrain first created the *.box files itself and it is not mentioned in tesstrain's readme.

@stale
Copy link

stale bot commented May 22, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale Issues which require input by the reporter which is not provided label May 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Issues which require input by the reporter which is not provided
Projects
None yet
Development

No branches or pull requests

2 participants