lstmeval: Improve output by ensuring 'Truth:' text is encoded the same way as OCR output… #3421

nickjwhite · 2021-05-11T09:47:04Z

This ensures that transformations like unicode normalisation are done on
the truth output as well as the OCR output, so that you can compare
the two properly.

Before this a perfect OCR result could show different lines for Truth and
OCR if the OCR output included characters that were normalised.

…e way as OCR output This ensures that transformations like unicode normalisation are done on the truth output as well as the OCR output, so that you can compare the two properly. Before this a perfect OCR result could show different lines for Truth and OCR if the OCR output included characters that were normalised.

Shreeshrii · 2021-09-07T16:07:25Z

@nickjwhite Please provide a sample demonstrating this.

Before this a perfect OCR result could show different lines for Truth and
OCR if the OCR output included characters that were normalised.

I had noticed this in the past but do not have any ready example to test and verify.

Shreeshrii · 2021-09-07T16:10:55Z

Is this issue related?

wollmers · 2021-09-07T17:35:30Z

Is this issue related?

No, this issue looks more like the wrong normalisation form, which normalises long_s to s:

$ perl -e 'use utf8; use Unicode::Normalize; print NFC("ſ"),"\n";'
ſ
$ perl -e 'use utf8; use Unicode::Normalize; print NFKC("ſ"),"\n";'
s

Shreeshrii · 2021-12-06T15:39:50Z

Ok, I have a sample now.

Ground Truth: aṇṇi- aṇṇi- , 11 v. 904)² (p. 142) alakkaḻi- ... in the Coimbatore
OCR via CLI using custom IAST traineddata: aṇṇi- aṇṇi- , 11 v. 904)² (p. 142) alakkaḻi- ... in the Coimbatore
OCR via lstmeval using same custom IAST traineddata: aṇṇi- aṇṇi- , 11 v. 904)2 (p. 142) alaḵkaḻi- ... in the Coimbatore

Superscript 2 is getting normalized to number 2 for lstmeval.

Shreeshrii · 2021-12-06T15:50:33Z

similarly for trademark symbol

GT: TOPOGRAPHIC FASHIONABLE WETTER Core™2 problem ALLOWED) *Call YOU, Kanpur coach
CLI OCR: TOPOGRAPHIC FASHIONABLE WETTER Core™2 problem ALLOWED) *Call YOU, Kanpur coach
lstmeval OCR: TOPOGRAPHIC FASHIONABLE WETTER CoreTM?2 problem ALLOWED) *Call YOU, Kanpur coach

@stweil I have attached a zip file with the custom IAST traineddata.

IAST_0.267000_136760_880600.zip

stweil added this to To do: Bug fixes for release 5 in Tesseract next Jan 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lstmeval: Improve output by ensuring 'Truth:' text is encoded the same way as OCR output… #3421

lstmeval: Improve output by ensuring 'Truth:' text is encoded the same way as OCR output… #3421

nickjwhite commented May 11, 2021

Shreeshrii commented Sep 7, 2021

Shreeshrii commented Sep 7, 2021

wollmers commented Sep 7, 2021

Shreeshrii commented Dec 6, 2021 •

edited

Shreeshrii commented Dec 6, 2021

lstmeval: Improve output by ensuring 'Truth:' text is encoded the same way as OCR output… #3421

Are you sure you want to change the base?

lstmeval: Improve output by ensuring 'Truth:' text is encoded the same way as OCR output… #3421

Conversation

nickjwhite commented May 11, 2021

Shreeshrii commented Sep 7, 2021

Shreeshrii commented Sep 7, 2021

wollmers commented Sep 7, 2021

Shreeshrii commented Dec 6, 2021 • edited

Shreeshrii commented Dec 6, 2021

Shreeshrii commented Dec 6, 2021 •

edited