Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect word coordinates #1712

Open
troplin opened this issue Jun 28, 2018 · 15 comments · May be fixed by #3787
Open

Incorrect word coordinates #1712

troplin opened this issue Jun 28, 2018 · 15 comments · May be fixed by #3787

Comments

@troplin
Copy link

troplin commented Jun 28, 2018

Environment

  • Tesseract Version:

    tesseract --version
    tesseract 4.0.0-beta.1
    leptonica-1.76.0 (Jun 26 2018, 18:21:40) [MSC v.1900 LIB Release x86]
     libgif 5.1.4 : libjpeg 9b : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
    Found AVX
    Found SSE
    

    (says beta.1, but beta.3 seems to be correct)

  • Commit Number: AppVeyor: 4.0.0-beta.3.1776

  • Platform: Windows 10 64bit (but tesseract running as 32bit)

  • Tessdata: tessdata-fast

We've integrated the engine in our (closed-source) application using the C API, so I cannot share the actual code.

What I do is basically just iterating over the result iterator using RIL_WORD, get the text, bounding box and baseline for each word and then creating a PDF out of it, with the recognized text as a red overlay and drawing the bounding boxes in green for better visibility.

But the official PDF output config has the same flaw, it's just more difficult to spot:

tesseract andromeda.png andromeda.tess4cli pdf

produces the following PDF:
andromeda.tess4cli.pdf

Files:

Behavior can be reproduced using the following PNG (which is a part of a bigger file):

andromeda

Current Behavior:

The coordinates are correct for most words.
But for some words, there seems to be an error in computing the boundaries between the words.
Couriously, the wrong boundary is always before the last character of the previous word.
I almost looks like some kind of off-by-one error.

grafik

grafik

Expected Behavior:

Result with tesseract 3:

grafik

grafik

@willaaam
Copy link

willaaam commented Sep 3, 2018

Seems like this could be related to the issue I'm facing.

#1192 (comment)

@amitdo
Copy link
Collaborator

amitdo commented Oct 5, 2018

I don't think there is a bug in this case.

The lstm network only returns xcoords. AFAIK, each xcoord is just one spot (pixel) in each recognized glyph. The spot can be in the beginning, middle or end of each glyph. Because the lstm net does not produce bboxes, tesseract tries to estimate the bboxes from these xcoords. This process can't be done accurately.

In cases where some bboxes are just complete garbage, there is certainly a bug somewhere, probably in the conversion from xcoords to bboxes.

Don't mix between the 'no bug' cases and the buggy cases.

@amitdo
Copy link
Collaborator

amitdo commented Oct 5, 2018

Thinking about this again.
It's possible that there is a small bug here.

As said, the cases with complete garbage boxes are certainly bugs.

@Sintun
Copy link
Contributor

Sintun commented Oct 6, 2018

Hey there, I'm still working on this bug.
By looking at the internal values of tesseract applied to
test_3
I tracked the issue down to RecodeBeamSearch::Decode.
After the application of this function
beam_[beam_size_ - 1]->beams_[beam_index].get(h).data
contains slightly off positions, duplicates, unichar_id s for all valid values of beam_index, h.
For example, the positions of u and s are attributed to the left and right half of the letter u.

It is possible, that this function works correctly but gets the off-values from the lstmrecognizer. I will investigate this the next time i can take a look at this.

Reproducible by dropping the following code at the end of RecodeBeamSearch::Decode

{
    GenericVector<const RecodeNode*> _best_nodes;
    GenericVector<const RecodeNode*>* best_nodes = &_best_nodes;
    const RecodeBeam* last_beam = beam_[beam_size_ - 1];
    unsigned node_ct = 0;
    for (int c = 0; c < NC_COUNT; ++c)
    {
      NodeContinuation cont = static_cast<NodeContinuation>(c);
      for (int is_dawg = 0; is_dawg < 2; ++is_dawg)
      {
        int beam_index = BeamIndex(is_dawg, cont, 0);
        int heap_size = last_beam->beams_[beam_index].size();
        for (int h = 0; h < heap_size; ++h)
        {
          const RecodeNode* node = &last_beam->beams_[beam_index].get(h).data;
          ExtractPath(node, best_nodes);
          ++node_ct;
          
          int dbg_ct = 0;
          for( unsigned i = 0; i < best_nodes->size(); ++i )
          {
            if( (*best_nodes)[i]->unichar_id != INVALID_UNICHAR_ID )
              printf("~ %d %d x-break pos %d, "
                "conf:%f, duplicate:%d, u-id:%d\n", node_ct, dbg_ct++, int(std::round(i*2.75f)),
                (*best_nodes)[i]->certainty,int((*best_nodes)[i]->duplicate),
                (*best_nodes)[i]->unichar_id);
          }
        }
      }
    }
  }

Multiplication by 2.75 is done in order to obtain the original image coordinates. This factor probably only applies to exactly this image.

Example output (for one of the beam_index, h combinations) :

~ 2 13 x-break pos 61, conf:-0.085068, duplicate:1, u-id:101
~ 2 14 x-break pos 72, conf:-0.085045, duplicate:0, u-id:91
~ 2 15 x-break pos 74, conf:-0.085093, duplicate:1, u-id:91
~ 2 16 x-break pos 77, conf:-0.085023, duplicate:1, u-id:91
~ 2 17 x-break pos 80, conf:-0.089978, duplicate:1, u-id:91
~ 2 18 x-break pos 83, conf:-0.085023, duplicate:1, u-id:91
~ 2 19 x-break pos 91, conf:-0.085059, duplicate:0, u-id:94
~ 2 20 x-break pos 94, conf:-0.086384, duplicate:1, u-id:94
~ 2 21 x-break pos 96, conf:-0.085012, duplicate:1, u-id:94
~ 2 22 x-break pos 99, conf:-0.085054, duplicate:1, u-id:94
~ 2 23 x-break pos 102, conf:-0.085051, duplicate:1, u-id:94
~ 2 24 x-break pos 113, conf:-0.085026, duplicate:0, u-id:89
~ 2 25 x-break pos 116, conf:-0.085202, duplicate:1, u-id:89

u-id 91 and 94 are the letters u and s. the x-break positions show that they are attributed to the position of u in the original image.

@Sintun
Copy link
Contributor

Sintun commented Oct 21, 2018

Hey there,

I invested another day, and got into the LSTM decoder. I found nothing wrong with it and so i tested the outcome of different lstm models for the same and different languages. It turns out that this behavior is strongly model dependent, for example the standard and best english model as well as the fast german give the correct bounding boxes for the "thousand Billion" example, while other models of these languages fail to do so.
So i came to the conclusion, that the LSTM output is unreliable for exact character & word positions. Nonetheless the output of the segmenter ist still very good, it just gets discarded before the lines go into the LSTM.

I will write a little function, that first uses tesseracts layout analysis to obtain word (and probably character) positions and then overwrite the LSTM results with the results of the layout analysis when possible.

So the bug source is unfixable for me and a possible workaround can be done.
I don't think workarounds should be part of tesseract, so i will do this outside of Tesseract after getting the necessary information via the api.

At the moment I'm planning to only create this workaround function for my own project. But I'm willing to integrate this workaround in the tesseract command line tool if someone is interested in using it.

@Shreeshrii
Copy link
Collaborator

I will write a little function, that first uses tesseracts layout analysis to obtain word (and probably character) positions and then overwrite the LSTM results with the results of the layout analysis when possible.

@Sintun I think this function will be useful for many Tesseract users, specially those who have been relying on the character coordinates in older versions. It would be great if you will integrate it in tesseract.

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Oct 22, 2018 via email

@FarhadKhalafi
Copy link

I am wondering if what I see in this screenshot is related to the same problem. The text is correctly extracted but the bounding rectangles are wrong for a couple of words.

image

@Sintun
Copy link
Contributor

Sintun commented Oct 26, 2018

Hi @FarhadKhalafi ,
yes that is this bug. I have a working workaround (for my own project) that reliably restores the word boxes and gives some reasonable results for character positions. It should also work for #1192. I'm planning on integrating it into tesseract within the next 2-3 weeks.

@FarhadKhalafi
Copy link

Hi @Sintun ,
Thanks for the update. Accurate location of, at least, words is often critical. I use the OCR text for table extraction and this kind of shift will break my algorithms. If you are willing to share your fix, I might be able to assist or at least test.

@Sintun
Copy link
Contributor

Sintun commented Oct 26, 2018

Specifically testing would be greatly appreciated :) Sharing the workaround code would be of small value, because it solves the problem after the tesseract api calls (even after the tess object destruction) in my project. I will refactor this in order to do this within tesseract.

@willaaam
Copy link

Even sharing that might help someone Sintun, a person in my team is writing that after API fix as well for python (it corrects the HOCR). I'll be sharing that as well when its done.

@Sintun
Copy link
Contributor

Sintun commented Nov 11, 2018

Hey there, i just started to write my workaround and looked at the state of affairs.
I had to realize, that ray created a native workaround using the historically grown tesseract objects. From my perspective, this is the best case, because i wouldn't had the time to do this in the tess framework. A closer look reveals, that this is similiar to my external approach, so i'm stopping my work on this.
As can be seen in #1192 it should find it's way into the git repo within the next days.

@zdenop
Copy link
Contributor

zdenop commented Nov 12, 2018

Please test current code.

@FarhadKhalafi
Copy link

FarhadKhalafi commented Nov 14, 2018

I tested the new code for accurate placement of bounding rectangles for recognized words. The code works much better in identifying the left and right edges of words. However, I did observe certain anomalies with the vertical placement of bounding rectangles. I have attached a ZIP of two TIFF files. The file named 85754_full.tif shows the vertical placement problem:

image

If I crop this image and zoom on the problem area, then the anomaly disappears (see 85754_crop.tif):
image

My guess is that the surrounding boxes influence the vertical placement when the text lines are detected.

85754.zip

@tesseract-ocr tesseract-ocr deleted a comment from Shreeshrii Aug 16, 2019
@tesseract-ocr tesseract-ocr deleted a comment from Shreeshrii Aug 16, 2019
p12tic added a commit to p12tic/tesseract that referenced this issue Apr 10, 2022
LSTM model output produces only approximate character positions without
boundary data. This creates a problem because the input blobs cannot be
accuratelly mapped to characters and thus the accuracy of character
bounding boxes is compromised as well.

Current this problem is solved as follows. The character boundaries are
computed according to the character positions from the LSTM output by
placing the boundaries at the middle between two character positions.
The blobs are then assigned according to which character the center of
the blob falls to. In other words the blobs are assigned to the nearest
characters.

This unfortunately produces a lot of errors because the character
positions in the LSTM output have a tendency to drift, thus the nearest
character is often not the right one.

Fortunately while the LSTM model produces approximate positions, the
blob boundaries produced by the regular segmenter are pretty good. Most
of the time a single blob corresponds to a single character and
vice-versa.

The above is used to create an optimization algorithm that treats the
output of the regular segmenter as a template to which LSTM model output
is matched. The selection of best match is done by assigning each
unwanted property of the outcome a cost and then minimizing the total
cost of the solution.

This reliably solves the most frequent error present in the current
solution when blobs are simply assigned to wrong character. As a result
the current algorithm produces up to 20 times less errors.

Fixes tesseract-ocr#1712.
p12tic added a commit to p12tic/tesseract that referenced this issue Apr 10, 2022
LSTM model output produces only approximate character positions without
boundary data. This creates a problem because the input blobs cannot be
accurately mapped to characters and thus the accuracy of character
bounding boxes is compromised as well.

Current this problem is solved as follows. The character boundaries are
computed according to the character positions from the LSTM output by
placing the boundaries at the middle between two character positions.
The blobs are then assigned according to which character the center of
the blob falls to. In other words the blobs are assigned to the nearest
characters.

This unfortunately produces a lot of errors because the character
positions in the LSTM output have a tendency to drift, thus the nearest
character is often not the right one.

Fortunately while the LSTM model produces approximate positions, the
blob boundaries produced by the regular segmenter are pretty good. Most
of the time a single blob corresponds to a single character and
vice-versa.

The above is used to create an optimization algorithm that treats the
output of the regular segmenter as a template to which LSTM model output
is matched. The selection of best match is done by assigning each
unwanted property of the outcome a cost and then minimizing the total
cost of the solution.

This reliably solves the most frequent error present in the current
solution when blobs are simply assigned to wrong character. As a result
the current algorithm produces up to 20 times less errors.

Fixes tesseract-ocr#1712.
p12tic added a commit to p12tic/tesseract that referenced this issue Apr 10, 2022
When using LSTM models the accuracy of character bounding boxes is low
with many blobs assigned to wrong characters. This is caused by the fact
that LSTM model output produces only approximate character positions
without boundary data. As a result the input blobs cannot be accurately
mapped to characters and which compromises the accuracy of character
bounding boxes.

Current this problem is solved as follows. The character boundaries are
computed according to the character positions from the LSTM output by
placing the boundaries at the middle between two character positions.
The blobs are then assigned according to which character the center of
the blob falls to. In other words the blobs are assigned to the nearest
characters.

This unfortunately produces a lot of errors because the character
positions in the LSTM output have a tendency to drift, thus the nearest
character is often not the right one.

Fortunately while the LSTM model produces approximate positions, the
blob boundaries produced by the regular segmenter are pretty good. Most
of the time a single blob corresponds to a single character and
vice-versa.

The above is used to create an optimization algorithm that treats the
output of the regular segmenter as a template to which LSTM model output
is matched. The selection of best match is done by assigning each
unwanted property of the outcome a cost and then minimizing the total
cost of the solution.

This reliably solves the most frequent error present in the current
solution when blobs are simply assigned to wrong character. As a result
the current algorithm produces up to 20 times less errors.

Fixes tesseract-ocr#1712.
@p12tic p12tic linked a pull request Apr 10, 2022 that will close this issue
p12tic added a commit to p12tic/tesseract that referenced this issue May 15, 2022
When using LSTM models the accuracy of character bounding boxes is low
with many blobs assigned to wrong characters. This is caused by the fact
that LSTM model output produces only approximate character positions
without boundary data. As a result the input blobs cannot be accurately
mapped to characters and which compromises the accuracy of character
bounding boxes.

Current this problem is solved as follows. The character boundaries are
computed according to the character positions from the LSTM output by
placing the boundaries at the middle between two character positions.
The blobs are then assigned according to which character the center of
the blob falls to. In other words the blobs are assigned to the nearest
characters.

This unfortunately produces a lot of errors because the character
positions in the LSTM output have a tendency to drift, thus the nearest
character is often not the right one.

Fortunately while the LSTM model produces approximate positions, the
blob boundaries produced by the regular segmenter are pretty good. Most
of the time a single blob corresponds to a single character and
vice-versa.

The above is used to create an optimization algorithm that treats the
output of the regular segmenter as a template to which LSTM model output
is matched. The selection of best match is done by assigning each
unwanted property of the outcome a cost and then minimizing the total
cost of the solution.

This reliably solves the most frequent error present in the current
solution when blobs are simply assigned to wrong character. As a result
the current algorithm produces up to 20 times less errors.

Fixes tesseract-ocr#1712.
p12tic added a commit to p12tic/tesseract that referenced this issue Oct 29, 2022
When using LSTM models the accuracy of character bounding boxes is low
with many blobs assigned to wrong characters. This is caused by the fact
that LSTM model output produces only approximate character positions
without boundary data. As a result the input blobs cannot be accurately
mapped to characters and which compromises the accuracy of character
bounding boxes.

Current this problem is solved as follows. The character boundaries are
computed according to the character positions from the LSTM output by
placing the boundaries at the middle between two character positions.
The blobs are then assigned according to which character the center of
the blob falls to. In other words the blobs are assigned to the nearest
characters.

This unfortunately produces a lot of errors because the character
positions in the LSTM output have a tendency to drift, thus the nearest
character is often not the right one.

Fortunately while the LSTM model produces approximate positions, the
blob boundaries produced by the regular segmenter are pretty good. Most
of the time a single blob corresponds to a single character and
vice-versa.

The above is used to create an optimization algorithm that treats the
output of the regular segmenter as a template to which LSTM model output
is matched. The selection of best match is done by assigning each
unwanted property of the outcome a cost and then minimizing the total
cost of the solution.

This reliably solves the most frequent error present in the current
solution when blobs are simply assigned to wrong character. As a result
the current algorithm produces up to 20 times less errors.

Fixes tesseract-ocr#1712.
GerHobbelt pushed a commit to GerHobbelt/tesseract that referenced this issue Aug 24, 2023
When using LSTM models the accuracy of character bounding boxes is low
with many blobs assigned to wrong characters. This is caused by the fact
that LSTM model output produces only approximate character positions
without boundary data. As a result the input blobs cannot be accurately
mapped to characters and which compromises the accuracy of character
bounding boxes.

Current this problem is solved as follows. The character boundaries are
computed according to the character positions from the LSTM output by
placing the boundaries at the middle between two character positions.
The blobs are then assigned according to which character the center of
the blob falls to. In other words the blobs are assigned to the nearest
characters.

This unfortunately produces a lot of errors because the character
positions in the LSTM output have a tendency to drift, thus the nearest
character is often not the right one.

Fortunately while the LSTM model produces approximate positions, the
blob boundaries produced by the regular segmenter are pretty good. Most
of the time a single blob corresponds to a single character and
vice-versa.

The above is used to create an optimization algorithm that treats the
output of the regular segmenter as a template to which LSTM model output
is matched. The selection of best match is done by assigning each
unwanted property of the outcome a cost and then minimizing the total
cost of the solution.

This reliably solves the most frequent error present in the current
solution when blobs are simply assigned to wrong character. As a result
the current algorithm produces up to 20 times less errors.

Fixes tesseract-ocr#1712.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants