Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ground truth: spaces before and after text? #335

Open
jbarth-ubhd opened this issue Feb 16, 2023 · 10 comments
Open

Ground truth: spaces before and after text? #335

jbarth-ubhd opened this issue Feb 16, 2023 · 10 comments

Comments

@jbarth-ubhd
Copy link

jbarth-ubhd commented Feb 16, 2023

I've created *.exp0.gt.txt as a base for manual ground truth creation using Shreeshrii's shell script and the files contain a space before and after the text (no newlines etc). Example:

01127778-001.exp0.gt.txt: " Verfahrenstechnik / -70 B 2813 "
01127778-002.exp0.gt.txt: " Verfahrenstechnik, Forschung und Lehre, (Zsgest. Ue "
01127778-003.exp0.gt.txt: " verf. von Kurt Schiefer und Kurt Boekmer,) "
01127778-004.exp0.gt.txt: " Düsseldorf: Verfahrenstechnische Ges, im Verein Deut- ! "
01127778-005.exp0.gt.txt: " scher Ingenieure 1967. 185 S, 8° "
01127778-006.exp0.gt.txt: " Frühere Ausg. 3. u.d.T.: Verfahrenstechnik im In- und "
01127778-007.exp0.gt.txt: " Ausland. "

... but The-Hallucination-Effect states
»Example 2: Your training text frequently includes a Space at the beginning of your sentences or at the end. Might result in slow training, non-convergence & even model corruption.«

My Question: Spaces or not?

The 1 line images are very tight, no blank space before/after; example:
grafik

@jbarth-ubhd
Copy link
Author

ok, ocrd-testset.zip *.gt.txt contain no spaces before/after, but \n

@vishakraj25
Copy link

vishakraj25 commented Feb 24, 2023

ok, ocrd-testset.zip *.gt.txt contain no spaces before/after, but \n

@jbarth-ubhd, I see there is no space or no new line at the end of the *.gt.txt

@jbarth-ubhd
Copy link
Author

I'll see newlines (4th line below):

jb@xxx:~/Downloads/ocrd-testset> cat *.gt.txt|od -c|head
0000000   i   c   h       d   e   n   k   e   .       A   b   e   r    
0000020   w   a   s       d   i   e     305 277   e   l   i   g   e    
0000040   F   r   a   u       G   e   h   e   i   m   r 303 244   t   h
0000060   i   n  \n 342 200 236   D   a   s       k   a   n   n       i
0000100   c   h       n   i   c   h   t   ,       c   '   e   s   t    

@stweil
Copy link
Collaborator

stweil commented Mar 1, 2023

Ground truth line text must not have spaces before or after the text.
It may end with a linefeed (which gets added automatically by many editors).

@jbarth-ubhd
Copy link
Author

jbarth-ubhd commented Mar 1, 2023

Just tried it again with #7 and https://github.com/ocropus/hocr-tools/blob/master/hocr-extract-images , the generated .exp0.gt.txt files contain spaces before & after:

308-119.exp0.gt.txt: " | == zz NN NN ANNE NZZ SE anli : "
308-120.exp0.gt.txt: " <C3><BC>ber 1 BONS DD DD SS EN U = NS utfer]pras "
308-121.exp0.gt.txt: " Datei ihrem unfeligen zZ <E2><80><94> SS AN . LEE KA 5 XS Ode bot. "
308-122.exp0.gt.txt: " N ein SH 7 DD SS 7 ea san Zn EFF LEE Z<E2><80><94>_-- BEN UNE x "

@stale
Copy link

stale bot commented May 22, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale Issues which require input by the reporter which is not provided label May 22, 2023
@jbarth-ubhd
Copy link
Author

bump

@stale stale bot removed the stale Issues which require input by the reporter which is not provided label Jun 28, 2023
@stweil
Copy link
Collaborator

stweil commented Jun 28, 2023

Just tried it again with #7 and https://github.com/ocropus/hocr-tools/blob/master/hocr-extract-images , the generated .exp0.gt.txt files contain spaces before & after

Then I assume that the original data (hOCR) already contains such spaces. Do you have a link to an example?

@jbarth-ubhd
Copy link
Author

jbarth-ubhd commented Jun 28, 2023

The .hocr does not contain spaces: <span ...>abcdefg</span>, but the .exp0.gt.txt does so.

See https://digi.ub.uni-heidelberg.de/diglitData/v/tesstrain-issue-335.zip for a complete test environment; main script is Shreeshrii-script.

@jbollacke
Copy link

jbollacke commented Nov 9, 2023

The spaces before and after the line occur, if your hocr file is indented.

hocr-extract-images uses regex to replace one (or more) whitespace characters with one space. see line 20

i am not sure if they had indentation in mind, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants