Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong model causes negative array index when calling DetectOrientationScript() #4062

Open
todd-richmond opened this issue Apr 27, 2023 · 27 comments
Labels
OSD Orientation and Script Detection

Comments

@todd-richmond
Copy link

todd-richmond commented Apr 27, 2023

Current Behavior

Latest Tesseract 5.3.1 has a bug in DetectOrientationScript() that causes it to randomly fail when attempting to detect image orientation for 90 degree image scans. The problem is that getting a script id can return -1 which then indexes into a "done" array
On my system this fails about 20% of the time on the exact same image input
Error found with valgrind

Expected Behavior

Trivial fix to not write to done[-1] which then allows DetectOrientationScript() to always return the same results

Suggested Fix

--- src/ccmain/osdetect.cpp.org	2023-04-26 16:26:18.618274574 -0700
+++ src/ccmain/osdetect.cpp	2023-04-26 16:54:37.041375058 -0700
@@ -488,10 +488,12 @@
         }
       }
       // Script already processed before.
-      if (done[id]) {
-        continue;
+      if (id != -1) {
+        if (done[id]) {
+          continue;
+        }
+        done[id] = true;
       }
-      done[id] = true;
 
       unichar = tess_->unicharset.id_to_unichar(choice->unichar_id());

tesseract -v

No response

Operating System

RHEL 8

Other Operating System

No response

uname -a

No response

Compiler

gcc 12.2

CPU

No response

Virtualization / Containers

No response

Other Information

No response

@stweil
Copy link
Contributor

stweil commented Apr 27, 2023

Is this a regression, or do you see the same issue also with older releases? How did you start tesseract, so that the problematic code is executed?

@todd-richmond
Copy link
Author

I'm using my own code that just calls SetImage and then DetectOrientationScript, but if you look at the diff, it is quite trivial so inspection should be enough. "id" can be -1 from choice->script_id() and when it is there is no check before using it as an array index. GitHub "blame" shows it back from 3.0
96e8b51

on a side note, this method is much, must faster than using Recognize and iterator calls to determine image rotation as shown in the tesseract samples page

@zdenop
Copy link
Contributor

zdenop commented Apr 27, 2023

Can you please also provide some example image to test the issue?

@todd-richmond
Copy link
Author

scan-90.zip
Here is a scanned costco receipt, but most any image should work

@todd-richmond todd-richmond changed the title Uninitialized variable causes DetectOrientationScript() to fail randomly Negative array index causes DetectOrientationScript() to fail randomly Apr 27, 2023
@stweil
Copy link
Contributor

stweil commented Apr 27, 2023

I tried tesseract scan-90.tiff - --psm 0 and never get a negative script id. Please paste your test code.

Maybe you can also test this patch which uses a packed vector of boolean values:

--- a/src/ccmain/osdetect.cpp
+++ b/src/ccmain/osdetect.cpp
@@ -460,7 +460,7 @@ ScriptDetector::ScriptDetector(const std::vector<int> *allowed_scripts, OSResult
 // adding this blob.
 void ScriptDetector::detect_blob(BLOB_CHOICE_LIST *scores) {
   for (int i = 0; i < 4; ++i) {
-    bool done[kMaxNumberOfScripts] = {false};
+    std::vector<bool> done(kMaxNumberOfScripts);
 
     BLOB_CHOICE_IT choice_it;
     choice_it.set_to_list(scores + i);
@@ -488,7 +488,7 @@ void ScriptDetector::detect_blob(BLOB_CHOICE_LIST *scores) {
         }
       }
       // Script already processed before.
-      if (done[id]) {
+      if (done.at(id)) {
         continue;
       }
       done[id] = true;

@todd-richmond
Copy link
Author

todd-richmond commented Apr 27, 2023

your new patch still calls done.at(-1) which is invalid

The code to repro is quite simple. The returned "degree" value is indeterminate because the -1 index randomly breaks the loop early depending on an invalid stack value

int degree;
float dconf, sconf;
const char *script;

ocr->Init(ocrdir.c_str(), 'eng');
ocr->SetImage(pix);
if (ocr->DetectOrientationScript(&degree, &dconf, &script, &sconf) && degree && dconf >= 0.9) {
            Pix *rpix = pixRotateOrth(pix, (360 - degree) / 90);

            if (rpix) {
                ocr->SetImage(rpix);
                pixDestroy(&rpix);
            }
}

@stweil
Copy link
Contributor

stweil commented Apr 27, 2023

Calling done.at(-1) should raise an exception, so your program would no longer show random behaviour but fail.

I assume that your code is incomplete and misses some initialisation which is part of tesseract (which does not fail because the index is never negative). Therefore I asked for your complete code. With the snippet above I still have to guess how you created ocr.

@todd-richmond
Copy link
Author

? by definition the Init call succeeds (using only 'eng' data) because 80% of the time rotation is detected and text is extracted. The index is absolutely -1 on every call, but dereferencing an array index of -1 is ALWAYS a bug with indeterminate value
An exception does not help whatsoever because the program crashes instead of returning correct results. With this fix, valgrind no longer shows a branch on uninitialized var and rotation value is always returned correctly

ocr->Init(ocrdir.c_str(), 'eng')

@stweil
Copy link
Contributor

stweil commented Apr 27, 2023

Is it possible to provide a complete working example code including main()?

It's not necessary to repeat that a negative index is not okay because nobody questions that. But according to my tests the tesseract executable never uses an illegal index here.

Supporting your use case might be useful (maybe with a different patch).

But it might also be undesired behaviour which requires better documentation how that function should be used to avoid negative indexes. Raising an exception when an API is not used according to the documentation would be a valid fix.

I still don't know which of the two options is the better one.

@todd-richmond
Copy link
Author

todd-richmond commented Apr 27, 2023

you can paste those 3 lines into any test snippet that reads a Pix object (such as the ones on tesseract site) - there really is nothing special here other than init(), setImage() and DetectOrientationScript() in order

There are quite a few stackoverflow and other internet questions asking how to detect image rotation. They all recommend Recognize() with a 0 PSM, but that is incredibly slow. This solution runs dramatically faster and since you have to rotate before calling Recognize with a non-0 PSM, it can lead to dramatic improvements on bulk OCR

If this API solution is ok, it would help to add internal image rotation as an option to Tesseract API enabled through config or some other option? It's a common problem with a simple solution, but very hard to track down (took me a couple days to find and debug)

@stweil
Copy link
Contributor

stweil commented Apr 27, 2023

Sorry, I don't have the time to collect code snippets to build a test program for this issue. I am afraid that without a complete working example code I cannot help.

@todd-richmond
Copy link
Author

todd-richmond commented Apr 27, 2023

here is sample code

  const char* inputfile = "scan-90.tiff";                             
  int degree;
  float dconf, sconf;          
  const char *script;          
  PIX *image = pixRead(inputfile);
  tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
  api->Init("tessdata", "eng"); 
  api->SetImage(image);
  if (api->DetectOrientationScript(&degree, &dconf, &script, &sconf) && degree && dconf >= 15.0) {
    Pix *rimage = pixRotateOrth(image, (360 - degree) / 90);        

    if (rimage) {                          
      api->SetImage(rimage);               
      pixDestroy(&rimage);
    }        
  }  
  pixDestroy(&image);        
  api->Recognize(NULL); 
  char *text = api->GetUTF8Text();            
  printf("degree: %d  confidence: %f\n\n%s", degree, dconf, text); 

@amitdo amitdo added the OSD Orientation and Script Detection label Apr 28, 2023
@amitdo
Copy link
Collaborator

amitdo commented Apr 28, 2023

https://github.com/tesseract-ocr/tessdoc/blob/main/examples/OSD_example.cc

The code currently does not check the returned values from the methods it calls.

@todd-richmond
Copy link
Author

My sample code does check error codes, but that doesn't matter because it returns correct results with my trivial 2 line patch - never a failure

@amitdo
Copy link
Collaborator

amitdo commented May 5, 2023

and since you have to rotate before calling Recognize with a non-0 PSM

You don't need to rotate with PSM 1.

@stweil
Copy link
Contributor

stweil commented May 6, 2023

@todd-richmond, add destructors to your sample code to avoid memory leaks:

  delete[] text;
  delete api;

The examples were also incomplete. This will be fixed with tesseract-ocr/tessdoc#110.

Even with your patched code, there remains an illegal memory read:

../../../src/ccmain/osdetect.cpp:525:7: runtime error: index -1 out of bounds for type 'float [120]'

@todd-richmond
Copy link
Author

@todd-richmond, add destructors to your sample code to avoid memory leaks:

  delete[] text;
  delete api;

The examples were also incomplete. This will be fixed with tesseract-ocr/tessdoc#110.

Even with your patched code, there remains an illegal memory read:

../../../src/ccmain/osdetect.cpp:525:7: runtime error: index -1 out of bounds for type 'float [120]'

my runtime code also has deletes - this is just a snippet that was asked for. I don't have valgrind or other errors with my patch in my app and the API returns expected results 100% of the time. I see the vector<> change was merged so I can't speak to what happens with that since it doesn't check the subscript either

@todd-richmond
Copy link
Author

I'm not quite sure why it has been so hard to get this patch accepted. I contribute small patches to dozens of open source projects we use at work and this example solves a problem that a LOT of people on the net have asked about. I found the alternative reading Tesseract headers after finding the common suggestion far too slow. The API call is documented as official, but fails in the same code + image I gave. The trivial patch resolves the bug w/o changing any internal execution paths

@stweil
Copy link
Contributor

stweil commented May 6, 2023

Your patch does not fix the problem, because there remains an array access with index -1 (see my previous comment). You can reproduce the problem by using a sanitizer build ('--disable-openmp' '--disable-shared' '--enable-debug' 'CXX=clang++-11' 'CXXFLAGS=-D_GLIBCXX_DEBUG -Wall -g -O0 -fsanitize=address,undefined -fstack-protector-strong -ftrapv').

In addition I get wrong results for an image without rotation and with mostly non-Latin script:

./test test/testing/devatest.png 
src/ccmain/osdetect.cpp:530:7: runtime error: index -1 out of bounds for type 'float [120]'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior src/ccmain/osdetect.cpp:530:7 in 
degree: 270  confidence: 93.000000, script Latin confidence 2.000000

If I apply your code to your test image, I get a rotation of 270°. I get the same result when I rotate that image by 90° (which makes it upright).

@amitdo
Copy link
Collaborator

amitdo commented May 6, 2023

In addition I get wrong results for an image without rotation and with mostly non-Latin script

That's because he used "eng", which is fine if the user knows that the text in the image is written in English and only wants to find the orientation of the page.

@stweil
Copy link
Contributor

stweil commented May 6, 2023

Then it should work for the image provided which only contains English text, but it does not.

If I replace "eng" by "osd", all tests succeed and there is no longer a negative array index.
So it looks like the issue is solved: use "osd" and everything is fine. If "osd" is not used, fail with an exception because of a wrong API call.

@amitdo
Copy link
Collaborator

amitdo commented May 6, 2023

Then it should work for the image provided which only contains English text, but it does not.

You need to use an old eng.traineddata with legacy only data.

It will probably also work with a trainedata from the tessdata repo, but you may need to change the oem to 0.

@stweil
Copy link
Contributor

stweil commented May 7, 2023

Thank you. Indeed, the test code works when I replace eng.traineddata from tessdata_fast or tessdata_best by the one from tessdata which provides the required legacy model.

So the rule is to use "osd" or any other model which supports the legacy engine. Then there is no negative index. Otherwise raise an exception to avoid undefined behaviour.

It looks like this issue can be closed.

@stweil stweil changed the title Negative array index causes DetectOrientationScript() to fail randomly Wrong model causes negative array index when calling DetectOrientationScript() May 7, 2023
@zdenop
Copy link
Contributor

zdenop commented May 7, 2023

I am not sure if we are able to check the presence of the legacy model in OSD functions.
At least the tesseract can trigger a warning in case of id == -1 so a user could understand the problem...

@stweil
Copy link
Contributor

stweil commented May 9, 2023

main would have to test for api.tesseract_->font_table_size_ != 0 or api.tesseract_->tessedit_ocr_engine_mode != OEM_LSTM_ONLY, but that API is not public. The only public API which can be used to check for a valid model with legacy support which I have found is api.tesseract()->AnyTessLang(). It works after adding 5 (!) include directories for the build process (PR #4073).

@todd-richmond
Copy link
Author

my sample code was run using tessdata_fast, not tessdata or tessdata_best because I care more about speed than recognition at scale. It tries to detect rotation before setting the PSM and calling recognize so that I can use different configurable modes that don't support OSD w/o having to use slow PSM 1 first.

@stweil , there is no subsequent valgrind issue using my sample and patched library (but I wasn't using debug mode). However, you are correct that the 0 degree rotation is still returning the same results so my code does fail

I tried PSM 0 with Recognize(), but it is always returning 270 degrees no matter what. I tried DetectOS, DetectOrientationScript and GetOsdText, but all fail on unrotated content even when tesseract cmdline program returns correct rotation.

In short, I need fast scanning with configurable PSM, but with 90 degree rotation handling and running Recognize twice is too slow. It seems I need to use only PSM modes that enable OSD

@zdenop
Copy link
Contributor

zdenop commented May 10, 2023

fast and best models (without legacy models) does not provide always right result. See e.g. forum post How to get the correct text orientation with tesseract.

tessdata should be best models with legacy models. IMO you can unpack it and try to include legacy models to your custom fast model... (see tool combine_tessdata).

Regarding the speed you can have a look at leptonica (see e.g. https://bucket401.blogspot.com/2023/05/detecting-page-orientation-c.html, there is also example for fixing a skew). DanBloomberg (Leptonica author) is very experienced and he provide excellent support on leptonica (just create issue if you have problem or questions).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
OSD Orientation and Script Detection
Projects
None yet
Development

No branches or pull requests

4 participants