'confidence' for single character is different between 'MeanTextConf()` and 'Confidence(level)' #4135

NewUserHa · 2023-10-03T20:18:10Z

Current Behavior

It's unconvinent to compile exe so I here used c++ bind from python. (I examed its code and it's directly using TessBaseAPI in C++)

api = tesserocr.PyTessBaseAPI(r"", 'chi_sim', 10)
api.SetVariable("save_blob_choices", "T")
api.SetVariable("lstm_choice_mode", "2")

api.SetImageFile(r"...")
api.Recognize()

choices = []
ri = api.GetIterator()
level = tesserocr.RIL.SYMBOL
for r in tesserocr.iterate_level(ri, level):
    choices.append([r.GetUTF8Text(level), r.Confidence(level)])
    for _ in r.GetChoiceIterator():
        choices.append([_.GetUTF8Text(), _.Confidence()])

print([(_.GetUTF8Text(), _.Confidence()) for _ in api.GetIterator().GetChoiceIterator()])
print(api.GetUTF8Text().strip(), api.MeanTextConf(), choices, api.MapWordConfidences() if api.GetUTF8Text() else 0)

api.End()

output:

[('员', 91.3373794555664), ('灵', 0.0)]
员 93 [['员', 99.01185607910156], ['员', 91.3373794555664], ['灵', 0.0]] [('员', 93)]

[('会', 92.27279663085938), ('针', 0.0), ('，', 0.0), ('|', 0.0), ('。', 0.0)]
会 96 [['会', 99.53471374511719], ['会', 92.27279663085938], ['针', 0.0], ['，', 0.0], ['|', 0.0], ['。', 0.0]] [('会', 96)]

[('患', 64.94859313964844), ('理', 0.0), ('写', 0.0), ('上', 0.0), ('雪', 0.0), ('由', 0.0)]
患 0 [['患', 80.64909362792969], ['患', 64.94859313964844], ['理', 0.0], ['写', 0.0], ['上', 0.0], ['雪', 0.0], ['由', 0.0]] [('患', 0)]

for most of the single characters, the MeanTextConf() returned seems to be the mean of (like '会' (99 + 92)/2 ~= 96), but for the first case '员' (99+91)/2 ~= 95, rather than 93 as the shown above.
moreover, the single character "患" has both MapWordConfidences and MeanTextConf 0 (zero), and obviously that's wrong.

pictures of above examples: (all are extracted from font files with same point size)

Expected Behavior

in the above example, since it's a single character, the confidence should be '99.01185607910156' '99.53471374511719' (those returned by r.Confidence(level)), because that's the correct character.

Suggested Fix

as above

tesseract -v

bunded, but:
tesseract 5.3.1
leptonica-1.83.1 (Jun 13 2023, 19:19:21) [MSC v.1935 LIB Release x64]
libgif 5.2.1 : libjpeg 6b (libjpeg-turbo 2.1.4) : libpng 1.6.39 : libtiff 4.5.0 : zlib 1.2.13 : libwebp 1.3.0 : libopenjp2 2.5.0

Operating System

Windows 10

Other Operating System

No response

uname -a

No response

Compiler

No response

CPU

No response

Virtualization / Containers

No response

Other Information

No response

The text was updated successfully, but these errors were encountered:

zdenop · 2023-10-06T18:00:24Z

We do not support 3rd party projects (tesseract wrappers).
Please reproduce the problem with C++ code (full working code + input image for testing and maybe desired output, AFAIK we do not active Chinese developer).
Also, provide information if some previous version (e.g. 4.x) was working as you expected.
Thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

'confidence' for single character is different between 'MeanTextConf()` and 'Confidence(level)' #4135

'confidence' for single character is different between 'MeanTextConf()` and 'Confidence(level)' #4135

NewUserHa commented Oct 3, 2023 •

edited

zdenop commented Oct 6, 2023

'confidence' for single character is different between 'MeanTextConf()` and 'Confidence(level)' #4135

'confidence' for single character is different between 'MeanTextConf()` and 'Confidence(level)' #4135

Comments

NewUserHa commented Oct 3, 2023 • edited

Current Behavior

Expected Behavior

Suggested Fix

tesseract -v

Operating System

Other Operating System

uname -a

Compiler

CPU

Virtualization / Containers

Other Information

zdenop commented Oct 6, 2023

NewUserHa commented Oct 3, 2023 •

edited