Wrong model causes negative array index when calling DetectOrientationScript() #4062

todd-richmond · 2023-04-27T01:18:14Z

Current Behavior

Latest Tesseract 5.3.1 has a bug in DetectOrientationScript() that causes it to randomly fail when attempting to detect image orientation for 90 degree image scans. The problem is that getting a script id can return -1 which then indexes into a "done" array
On my system this fails about 20% of the time on the exact same image input
Error found with valgrind

Expected Behavior

Trivial fix to not write to done[-1] which then allows DetectOrientationScript() to always return the same results

Suggested Fix

--- src/ccmain/osdetect.cpp.org	2023-04-26 16:26:18.618274574 -0700
+++ src/ccmain/osdetect.cpp	2023-04-26 16:54:37.041375058 -0700
@@ -488,10 +488,12 @@
         }
       }
       // Script already processed before.
-      if (done[id]) {
-        continue;
+      if (id != -1) {
+        if (done[id]) {
+          continue;
+        }
+        done[id] = true;
       }
-      done[id] = true;
 
       unichar = tess_->unicharset.id_to_unichar(choice->unichar_id());

tesseract -v

No response

Operating System

RHEL 8

Other Operating System

No response

uname -a

No response

Compiler

gcc 12.2

CPU

No response

Virtualization / Containers

No response

Other Information

No response

The text was updated successfully, but these errors were encountered:

stweil · 2023-04-27T04:13:10Z

Is this a regression, or do you see the same issue also with older releases? How did you start tesseract, so that the problematic code is executed?

todd-richmond · 2023-04-27T05:11:44Z

I'm using my own code that just calls SetImage and then DetectOrientationScript, but if you look at the diff, it is quite trivial so inspection should be enough. "id" can be -1 from choice->script_id() and when it is there is no check before using it as an array index. GitHub "blame" shows it back from 3.0
96e8b51

on a side note, this method is much, must faster than using Recognize and iterator calls to determine image rotation as shown in the tesseract samples page

zdenop · 2023-04-27T05:27:59Z

Can you please also provide some example image to test the issue?

todd-richmond · 2023-04-27T16:21:54Z

scan-90.zip
Here is a scanned costco receipt, but most any image should work

stweil · 2023-04-27T16:50:17Z

I tried tesseract scan-90.tiff - --psm 0 and never get a negative script id. Please paste your test code.

Maybe you can also test this patch which uses a packed vector of boolean values:

--- a/src/ccmain/osdetect.cpp
+++ b/src/ccmain/osdetect.cpp
@@ -460,7 +460,7 @@ ScriptDetector::ScriptDetector(const std::vector<int> *allowed_scripts, OSResult
 // adding this blob.
 void ScriptDetector::detect_blob(BLOB_CHOICE_LIST *scores) {
   for (int i = 0; i < 4; ++i) {
-    bool done[kMaxNumberOfScripts] = {false};
+    std::vector<bool> done(kMaxNumberOfScripts);
 
     BLOB_CHOICE_IT choice_it;
     choice_it.set_to_list(scores + i);
@@ -488,7 +488,7 @@ void ScriptDetector::detect_blob(BLOB_CHOICE_LIST *scores) {
         }
       }
       // Script already processed before.
-      if (done[id]) {
+      if (done.at(id)) {
         continue;
       }
       done[id] = true;

todd-richmond · 2023-04-27T17:11:18Z

your new patch still calls done.at(-1) which is invalid

The code to repro is quite simple. The returned "degree" value is indeterminate because the -1 index randomly breaks the loop early depending on an invalid stack value

int degree;
float dconf, sconf;
const char *script;

ocr->Init(ocrdir.c_str(), 'eng');
ocr->SetImage(pix);
if (ocr->DetectOrientationScript(&degree, &dconf, &script, &sconf) && degree && dconf >= 0.9) {
            Pix *rpix = pixRotateOrth(pix, (360 - degree) / 90);

            if (rpix) {
                ocr->SetImage(rpix);
                pixDestroy(&rpix);
            }
}

stweil · 2023-04-27T18:05:57Z

Calling done.at(-1) should raise an exception, so your program would no longer show random behaviour but fail.

I assume that your code is incomplete and misses some initialisation which is part of tesseract (which does not fail because the index is never negative). Therefore I asked for your complete code. With the snippet above I still have to guess how you created ocr.

todd-richmond · 2023-04-27T18:39:20Z

? by definition the Init call succeeds (using only 'eng' data) because 80% of the time rotation is detected and text is extracted. The index is absolutely -1 on every call, but dereferencing an array index of -1 is ALWAYS a bug with indeterminate value
An exception does not help whatsoever because the program crashes instead of returning correct results. With this fix, valgrind no longer shows a branch on uninitialized var and rotation value is always returned correctly

ocr->Init(ocrdir.c_str(), 'eng')

stweil · 2023-04-27T19:01:15Z

Is it possible to provide a complete working example code including main()?

It's not necessary to repeat that a negative index is not okay because nobody questions that. But according to my tests the tesseract executable never uses an illegal index here.

Supporting your use case might be useful (maybe with a different patch).

But it might also be undesired behaviour which requires better documentation how that function should be used to avoid negative indexes. Raising an exception when an API is not used according to the documentation would be a valid fix.

I still don't know which of the two options is the better one.

todd-richmond · 2023-04-27T19:14:33Z

you can paste those 3 lines into any test snippet that reads a Pix object (such as the ones on tesseract site) - there really is nothing special here other than init(), setImage() and DetectOrientationScript() in order

There are quite a few stackoverflow and other internet questions asking how to detect image rotation. They all recommend Recognize() with a 0 PSM, but that is incredibly slow. This solution runs dramatically faster and since you have to rotate before calling Recognize with a non-0 PSM, it can lead to dramatic improvements on bulk OCR

If this API solution is ok, it would help to add internal image rotation as an option to Tesseract API enabled through config or some other option? It's a common problem with a simple solution, but very hard to track down (took me a couple days to find and debug)

stweil · 2023-04-27T19:24:31Z

Sorry, I don't have the time to collect code snippets to build a test program for this issue. I am afraid that without a complete working example code I cannot help.

todd-richmond · 2023-04-27T20:58:00Z

here is sample code

  const char* inputfile = "scan-90.tiff";                             
  int degree;
  float dconf, sconf;          
  const char *script;          
  PIX *image = pixRead(inputfile);
  tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
  api->Init("tessdata", "eng"); 
  api->SetImage(image);
  if (api->DetectOrientationScript(&degree, &dconf, &script, &sconf) && degree && dconf >= 15.0) {
    Pix *rimage = pixRotateOrth(image, (360 - degree) / 90);        

    if (rimage) {                          
      api->SetImage(rimage);               
      pixDestroy(&rimage);
    }        
  }  
  pixDestroy(&image);        
  api->Recognize(NULL); 
  char *text = api->GetUTF8Text();            
  printf("degree: %d  confidence: %f\n\n%s", degree, dconf, text);

amitdo · 2023-04-28T09:36:05Z

https://github.com/tesseract-ocr/tessdoc/blob/main/examples/OSD_example.cc

The code currently does not check the returned values from the methods it calls.

todd-richmond · 2023-04-28T16:10:33Z

My sample code does check error codes, but that doesn't matter because it returns correct results with my trivial 2 line patch - never a failure

amitdo · 2023-05-05T17:14:43Z

and since you have to rotate before calling Recognize with a non-0 PSM

You don't need to rotate with PSM 1.

stweil · 2023-05-06T08:26:45Z

@todd-richmond, add destructors to your sample code to avoid memory leaks:

  delete[] text;
  delete api;

The examples were also incomplete. This will be fixed with tesseract-ocr/tessdoc#110.

Even with your patched code, there remains an illegal memory read:

../../../src/ccmain/osdetect.cpp:525:7: runtime error: index -1 out of bounds for type 'float [120]'

todd-richmond · 2023-05-06T16:30:33Z

@todd-richmond, add destructors to your sample code to avoid memory leaks:
  delete[] text;
  delete api;
The examples were also incomplete. This will be fixed with tesseract-ocr/tessdoc#110.

Even with your patched code, there remains an illegal memory read:
../../../src/ccmain/osdetect.cpp:525:7: runtime error: index -1 out of bounds for type 'float [120]'

my runtime code also has deletes - this is just a snippet that was asked for. I don't have valgrind or other errors with my patch in my app and the API returns expected results 100% of the time. I see the vector<> change was merged so I can't speak to what happens with that since it doesn't check the subscript either

todd-richmond · 2023-05-06T16:35:10Z

I'm not quite sure why it has been so hard to get this patch accepted. I contribute small patches to dozens of open source projects we use at work and this example solves a problem that a LOT of people on the net have asked about. I found the alternative reading Tesseract headers after finding the common suggestion far too slow. The API call is documented as official, but fails in the same code + image I gave. The trivial patch resolves the bug w/o changing any internal execution paths

stweil · 2023-05-06T17:11:30Z

Your patch does not fix the problem, because there remains an array access with index -1 (see my previous comment). You can reproduce the problem by using a sanitizer build ('--disable-openmp' '--disable-shared' '--enable-debug' 'CXX=clang++-11' 'CXXFLAGS=-D_GLIBCXX_DEBUG -Wall -g -O0 -fsanitize=address,undefined -fstack-protector-strong -ftrapv').

In addition I get wrong results for an image without rotation and with mostly non-Latin script:

./test test/testing/devatest.png 
src/ccmain/osdetect.cpp:530:7: runtime error: index -1 out of bounds for type 'float [120]'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior src/ccmain/osdetect.cpp:530:7 in 
degree: 270  confidence: 93.000000, script Latin confidence 2.000000

If I apply your code to your test image, I get a rotation of 270°. I get the same result when I rotate that image by 90° (which makes it upright).

amitdo · 2023-05-06T20:41:04Z

In addition I get wrong results for an image without rotation and with mostly non-Latin script

That's because he used "eng", which is fine if the user knows that the text in the image is written in English and only wants to find the orientation of the page.

stweil · 2023-05-06T20:51:11Z

Then it should work for the image provided which only contains English text, but it does not.

If I replace "eng" by "osd", all tests succeed and there is no longer a negative array index.
So it looks like the issue is solved: use "osd" and everything is fine. If "osd" is not used, fail with an exception because of a wrong API call.

amitdo · 2023-05-06T21:54:23Z

Then it should work for the image provided which only contains English text, but it does not.

You need to use an old eng.traineddata with legacy only data.

It will probably also work with a trainedata from the tessdata repo, but you may need to change the oem to 0.

stweil · 2023-05-07T07:10:35Z

Thank you. Indeed, the test code works when I replace eng.traineddata from tessdata_fast or tessdata_best by the one from tessdata which provides the required legacy model.

So the rule is to use "osd" or any other model which supports the legacy engine. Then there is no negative index. Otherwise raise an exception to avoid undefined behaviour.

It looks like this issue can be closed.

zdenop · 2023-05-07T19:34:23Z

I am not sure if we are able to check the presence of the legacy model in OSD functions.
At least the tesseract can trigger a warning in case of id == -1 so a user could understand the problem...

stweil · 2023-05-09T17:22:35Z

main would have to test for api.tesseract_->font_table_size_ != 0 or api.tesseract_->tessedit_ocr_engine_mode != OEM_LSTM_ONLY, but that API is not public. The only public API which can be used to check for a valid model with legacy support which I have found is api.tesseract()->AnyTessLang(). It works after adding 5 (!) include directories for the build process (PR #4073).

todd-richmond · 2023-05-09T21:24:36Z

my sample code was run using tessdata_fast, not tessdata or tessdata_best because I care more about speed than recognition at scale. It tries to detect rotation before setting the PSM and calling recognize so that I can use different configurable modes that don't support OSD w/o having to use slow PSM 1 first.

@stweil , there is no subsequent valgrind issue using my sample and patched library (but I wasn't using debug mode). However, you are correct that the 0 degree rotation is still returning the same results so my code does fail

I tried PSM 0 with Recognize(), but it is always returning 270 degrees no matter what. I tried DetectOS, DetectOrientationScript and GetOsdText, but all fail on unrotated content even when tesseract cmdline program returns correct rotation.

In short, I need fast scanning with configurable PSM, but with 90 degree rotation handling and running Recognize twice is too slow. It seems I need to use only PSM modes that enable OSD

zdenop · 2023-05-10T05:53:56Z

fast and best models (without legacy models) does not provide always right result. See e.g. forum post How to get the correct text orientation with tesseract.

tessdata should be best models with legacy models. IMO you can unpack it and try to include legacy models to your custom fast model... (see tool combine_tessdata).

Regarding the speed you can have a look at leptonica (see e.g. https://bucket401.blogspot.com/2023/05/detecting-page-orientation-c.html, there is also example for fixing a skew). DanBloomberg (Leptonica author) is very experienced and he provide excellent support on leptonica (just create issue if you have problem or questions).

todd-richmond mentioned this issue Apr 27, 2023

Detect text rotation without running recognition #3836

Closed

todd-richmond changed the title ~~Uninitialized variable causes DetectOrientationScript() to fail randomly~~ Negative array index causes DetectOrientationScript() to fail randomly Apr 27, 2023

amitdo added the OSD Orientation and Script Detection label Apr 28, 2023

stweil mentioned this issue May 5, 2023

Replace bool array by more compact vector #4067

Merged

stweil changed the title ~~Negative array index causes DetectOrientationScript() to fail randomly~~ Wrong model causes negative array index when calling DetectOrientationScript() May 7, 2023

stweil mentioned this issue May 9, 2023

Abort with error message if OSD is requested with LSTM-only model #4073

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong model causes negative array index when calling DetectOrientationScript() #4062

Wrong model causes negative array index when calling DetectOrientationScript() #4062

todd-richmond commented Apr 27, 2023 •

edited by amitdo

stweil commented Apr 27, 2023 •

edited

todd-richmond commented Apr 27, 2023

zdenop commented Apr 27, 2023

todd-richmond commented Apr 27, 2023

stweil commented Apr 27, 2023 •

edited

todd-richmond commented Apr 27, 2023 •

edited

stweil commented Apr 27, 2023 •

edited

todd-richmond commented Apr 27, 2023

stweil commented Apr 27, 2023

todd-richmond commented Apr 27, 2023 •

edited

stweil commented Apr 27, 2023

todd-richmond commented Apr 27, 2023 •

edited

amitdo commented Apr 28, 2023 •

edited

todd-richmond commented Apr 28, 2023

amitdo commented May 5, 2023

stweil commented May 6, 2023 •

edited

todd-richmond commented May 6, 2023

todd-richmond commented May 6, 2023

stweil commented May 6, 2023 •

edited

amitdo commented May 6, 2023 •

edited

stweil commented May 6, 2023

amitdo commented May 6, 2023

stweil commented May 7, 2023

zdenop commented May 7, 2023

stweil commented May 9, 2023 •

edited

todd-richmond commented May 9, 2023

zdenop commented May 10, 2023

Wrong model causes negative array index when calling DetectOrientationScript() #4062

Wrong model causes negative array index when calling DetectOrientationScript() #4062

Comments

todd-richmond commented Apr 27, 2023 • edited by amitdo

Current Behavior

Expected Behavior

Suggested Fix

tesseract -v

Operating System

Other Operating System

uname -a

Compiler

CPU

Virtualization / Containers

Other Information

stweil commented Apr 27, 2023 • edited

todd-richmond commented Apr 27, 2023

zdenop commented Apr 27, 2023

todd-richmond commented Apr 27, 2023

stweil commented Apr 27, 2023 • edited

todd-richmond commented Apr 27, 2023 • edited

stweil commented Apr 27, 2023 • edited

todd-richmond commented Apr 27, 2023

stweil commented Apr 27, 2023

todd-richmond commented Apr 27, 2023 • edited

stweil commented Apr 27, 2023

todd-richmond commented Apr 27, 2023 • edited

amitdo commented Apr 28, 2023 • edited

todd-richmond commented Apr 28, 2023

amitdo commented May 5, 2023

stweil commented May 6, 2023 • edited

todd-richmond commented May 6, 2023

todd-richmond commented May 6, 2023

stweil commented May 6, 2023 • edited

amitdo commented May 6, 2023 • edited

stweil commented May 6, 2023

amitdo commented May 6, 2023

stweil commented May 7, 2023

zdenop commented May 7, 2023

stweil commented May 9, 2023 • edited

todd-richmond commented May 9, 2023

zdenop commented May 10, 2023

todd-richmond commented Apr 27, 2023 •

edited by amitdo

stweil commented Apr 27, 2023 •

edited

stweil commented Apr 27, 2023 •

edited

todd-richmond commented Apr 27, 2023 •

edited

stweil commented Apr 27, 2023 •

edited

todd-richmond commented Apr 27, 2023 •

edited

todd-richmond commented Apr 27, 2023 •

edited

amitdo commented Apr 28, 2023 •

edited

stweil commented May 6, 2023 •

edited

stweil commented May 6, 2023 •

edited

amitdo commented May 6, 2023 •

edited

stweil commented May 9, 2023 •

edited