Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lstmeval: Improve output by ensuring 'Truth:' text is encoded the same way as OCR output… #3421

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

nickjwhite
Copy link

This ensures that transformations like unicode normalisation are done on
the truth output as well as the OCR output, so that you can compare
the two properly.

Before this a perfect OCR result could show different lines for Truth and
OCR if the OCR output included characters that were normalised.

…e way as OCR output

This ensures that transformations like unicode normalisation are done on
the truth output as well as the OCR output, so that you can compare
the two properly.

Before this a perfect OCR result could show different lines for Truth and
OCR if the OCR output included characters that were normalised.
@Shreeshrii
Copy link
Collaborator

@nickjwhite Please provide a sample demonstrating this.

Before this a perfect OCR result could show different lines for Truth and
OCR if the OCR output included characters that were normalised.

I had noticed this in the past but do not have any ready example to test and verify.

@Shreeshrii
Copy link
Collaborator

Is this issue related?

@wollmers
Copy link

wollmers commented Sep 7, 2021

Is this issue related?

No, this issue looks more like the wrong normalisation form, which normalises long_s to s:

$ perl -e 'use utf8; use Unicode::Normalize; print NFC("ſ"),"\n";'
ſ
$ perl -e 'use utf8; use Unicode::Normalize; print NFKC("ſ"),"\n";'
s

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Dec 6, 2021

Ok, I have a sample now.

eng Praja exp0_159

Ground Truth: aṇṇi- aṇṇi- , 11 v. 904)² (p. 142) alakkaḻi- ... in the Coimbatore
OCR via CLI using custom IAST traineddata: aṇṇi- aṇṇi- , 11 v. 904)² (p. 142) alakkaḻi- ... in the Coimbatore
OCR via lstmeval using same custom IAST traineddata: aṇṇi- aṇṇi- , 11 v. 904)2 (p. 142) alaḵkaḻi- ... in the Coimbatore

Superscript 2 is getting normalized to number 2 for lstmeval.

@Shreeshrii
Copy link
Collaborator

similarly for trademark symbol

san Guru_Italic 0000203 exp0_0

GT: TOPOGRAPHIC FASHIONABLE WETTER Core™2 problem ALLOWED) *Call YOU, Kanpur coach
CLI OCR: TOPOGRAPHIC FASHIONABLE WETTER Core™2 problem ALLOWED) *Call YOU, Kanpur coach
lstmeval OCR: TOPOGRAPHIC FASHIONABLE WETTER CoreTM?2 problem ALLOWED) *Call YOU, Kanpur coach

@stweil I have attached a zip file with the custom IAST traineddata.

IAST_0.267000_136760_880600.zip

@stweil stweil added this to To do: Bug fixes for release 5 in Tesseract next Jan 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Tesseract next
  
To do: Bug fixes for release 5
Development

Successfully merging this pull request may close these issues.

None yet

3 participants