RFC: Improve positioning of symbol bounding boxes #3787

p12tic · 2022-04-10T21:03:35Z

This PR improves the positions of symbol bounding boxes in cases when LSTM model is used. Up to 20 times less errors have been observed in sample images.

This PR still requires potentially significant amount of work, please let me know whether the approach is sensible in principle and if the PR makes sense I will spend time polishing it.

When using LSTM models the accuracy of character bounding boxes is low with many blobs assigned to wrong characters. This is caused by the fact that LSTM model output produces only approximate character positions without boundary data. As a result the input blobs cannot be accurately mapped to characters and which compromises the accuracy of character bounding boxes.

Currently this problem is solved as follows. The character boundaries are computed according to the character positions from the LSTM output by placing the boundaries at the middle between two character positions. The blobs are then assigned according to which character the center of the blob falls to. In other words the blobs are assigned to the nearest characters.

This unfortunately produces a lot of errors because the character positions in the LSTM output have a tendency to drift, thus the nearest character is often not the right one.

Fortunately while the LSTM model produces approximate positions, the blob boundaries produced by the regular segmenter are pretty good. Most of the time a single blob corresponds to a single character and vice-versa.

The above is used to create an optimization algorithm that treats the output of the regular segmenter as a template to which LSTM model output is matched. The selection of best match is done by assigning each unwanted property of the outcome a cost and then minimizing the total cost of the solution.

This reliably solves the most frequent error present in the current solution when blobs are simply assigned to wrong character. As a result the current algorithm produces up to 20 times less errors.

This can be further improved because the root cause of most of the remaining errors is the segmenter producing single blobs for multiple characters. The algorithm could be improved by biasing it to split the blobs in the places where the segmenter makes errors often, such as near character "t".

Fixes #1712.

One of the example images I've used:

Before this PR, tesseract produced 116 errors in determining character bounding boxes (could be inaccurate due to manual counting)

After this PR, tesseract produced only 5 errors in determining character bounding boxes.

p12tic · 2022-04-18T20:52:49Z

@stweil Just a friendly ping :-)

When using LSTM models the accuracy of character bounding boxes is low with many blobs assigned to wrong characters. This is caused by the fact that LSTM model output produces only approximate character positions without boundary data. As a result the input blobs cannot be accurately mapped to characters and which compromises the accuracy of character bounding boxes. Current this problem is solved as follows. The character boundaries are computed according to the character positions from the LSTM output by placing the boundaries at the middle between two character positions. The blobs are then assigned according to which character the center of the blob falls to. In other words the blobs are assigned to the nearest characters. This unfortunately produces a lot of errors because the character positions in the LSTM output have a tendency to drift, thus the nearest character is often not the right one. Fortunately while the LSTM model produces approximate positions, the blob boundaries produced by the regular segmenter are pretty good. Most of the time a single blob corresponds to a single character and vice-versa. The above is used to create an optimization algorithm that treats the output of the regular segmenter as a template to which LSTM model output is matched. The selection of best match is done by assigning each unwanted property of the outcome a cost and then minimizing the total cost of the solution. This reliably solves the most frequent error present in the current solution when blobs are simply assigned to wrong character. As a result the current algorithm produces up to 20 times less errors. Fixes tesseract-ocr#1712.

rmast · 2022-08-13T23:25:26Z

I've done some testing with this branch merged to the current main. It's still not perfect.
With oem 0:

With LSTM:

Original image:

Program code of bounding boxes:

import tesserocr
import csv
import numpy as np
import cv2
import pytesseract
from PIL import Image
import matplotlib as mpl

import matplotlib.pyplot as plt
img3 = cv2.imread('/home/rmast/plaatjes/out1496-3078-212-39.png')
h, w, _ = img3.shape # assumes color image
pytess_result = pytesseract.image_to_boxes(img3, lang='nld+lat+Latin+eng',
        config="--psm 7 -c tessedit_create_boxfile=1", output_type=pytesseract.Output.DICT)
        #config="--psm 7 --oem 0 -c tessedit_create_boxfile=1", output_type=pytesseract.Output.DICT)
print(pytess_result)
for j in range(0, len(pytess_result["char"])):
    left = pytess_result["left"][j]
    bottom = pytess_result["bottom"][j]
    right = pytess_result["right"][j]
    top = pytess_result["top"][j]
    cv2.rectangle(img3, (left, h - top - 1), (right, h - bottom - 1), (255, 0, 0), 1)
mpl.use('tkAgg')
plt.imshow(img3)
plt.show()

oem 0:

With LSTM:

Original image:

Without your patch:

lv-saharan · 2023-10-26T05:52:04Z

why still not merged

rmast · 2023-10-26T06:37:56Z

I guess lack of testing capacity for core functionality. Since I discovered that and some other low level segmentation bugs my focus has shifted to EasyOCR. Verzonden vanaf Outlook voor Android<https://aka.ms/AAb9ysg>

…

________________________________ From: 海刚 ***@***.***> Sent: Thursday, October 26, 2023 7:52:16 AM To: tesseract-ocr/tesseract ***@***.***> Cc: rmast ***@***.***>; Comment ***@***.***> Subject: Re: [tesseract-ocr/tesseract] RFC: Improve positioning of symbol bounding boxes (PR #3787) why still not merged ― Reply to this email directly, view it on GitHub<#3787 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAZPZ5RPVKU5GD3FJM2I7HDYBH3BBAVCNFSM5TBLNP7KU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCNZYGA2DKNZUGI4Q>. You are receiving this because you commented.Message ID: ***@***.***>

Improve flexibility of MoveAndClipBox

dbb2adb

wollmers mentioned this pull request May 2, 2022

Repair boundingbox of individual characters of textangle 90 text #3599

Closed

wollmers mentioned this pull request May 10, 2022

LSTM Engine Diplopia Issue and Inaccurate HOCR Character Level Box Dimensions #3477

Open

p12tic force-pushed the improve-symbol-positions branch from cbe83ec to 51a3398 Compare May 15, 2022 21:15

amitdo added the bounding box label Jun 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Improve positioning of symbol bounding boxes #3787

RFC: Improve positioning of symbol bounding boxes #3787

p12tic commented Apr 10, 2022 •

edited

p12tic commented Apr 18, 2022

rmast commented Aug 13, 2022 •

edited

lv-saharan commented Oct 26, 2023

rmast commented Oct 26, 2023 via email

RFC: Improve positioning of symbol bounding boxes #3787

Are you sure you want to change the base?

RFC: Improve positioning of symbol bounding boxes #3787

Conversation

p12tic commented Apr 10, 2022 • edited

p12tic commented Apr 18, 2022

rmast commented Aug 13, 2022 • edited

lv-saharan commented Oct 26, 2023

rmast commented Oct 26, 2023 via email

p12tic commented Apr 10, 2022 •

edited

rmast commented Aug 13, 2022 •

edited