Tests/Tesseract: Differences between Pyocr and reference output #52

jflesch · 2016-12-06T14:48:33Z

For some reason there are differences between the references and the actual results. And it seem the actual results are good, so it's problably a bug in update_test_data.sh

The text was updated successfully, but these errors were encountered:

This package is a bit more involved because it assumes a lot of paths being there in a FHS compliant way, so we need to patch the data and binary directories for Tesseract and Cuneiform. I've also tried to get the tests working, but they produce different results comparing input/output. This is probably related to the following issue: openpaperwork/pyocr#52 So I've disabled certain tests that fail but don't generally impede the functionality of pyocr. Tested by building against Python 3.3, 3.4, 3.5 and 3.6. Signed-off-by: aszlig <aszlig@redmoonstudios.org>

This is a bit more involved, because Tesseract 3.05.00 comes not only with improvements but also with a few quirks we need to deal with. The first quirk is that the order arguments of the `tesseract' command now matters and the list of configurations has to be at the end of the command line. So we add a new attribute tesseract_flags to the BaseBuilder class that contains a list of all the flags to pass to `tesseract', the tesseract_configs however remains pretty much the same but now only really contains a list of configs instead of being mixed with flag arguments. Another quirk has to do with Leptonica >= 1.74 which Tesseract 3.05.00 now requires. Leptonica has special handling of files that reside in /tmp and assumes that it's an internal temporary file of Leptonica. In order to deal with it, we now run Tesseract in a temporary directory, which contains the input/output files and use the relative name of these files because Leptonica only searches for path names beginning with /tmp. Fortunately the last item we need to address is not really a quirk, but an API change. In Tesseract 3.05.00 there is now a new function called TessBaseAPIDetectOrientationScript(), which doesn't fill the OSResults object anymore but now allows to pass the values we're interested in directly by reference. We need to use this new function because the old function TessBaseAPIDetectOS() now *always* returns false. Ran the test suite successfully with Python 3.5 and both Tesseract 3.04.01 and 3.05.00 except the following tests, which also didn't succeed prior to this commit: * cuneiform:TestTxt.test_basic * cuneiform:TestTxt.test_european * cuneiform:TestTxt.test_french * cuneiform:TestWordBox.test_basic * cuneiform:TestWordBox.test_european * cuneiform:TestWordBox.test_french * libtesseract:TestBasicDoc.test_basic * libtesseract:TestDigitLineBox.test_digits * libtesseract:TestLineBox.test_japanese * libtesseract:TestTxt.test_japanese * libtesseract:TestWordBox.test_japanese * tesseract:TestDigitLineBox.test_digits * tesseract:TestTxt.test_japanese The failure of these test cases is probably related to issue openpaperwork#52, but from looking at the failures it doesn't seem to be related to this change anyway. Signed-off-by: aszlig <aszlig@redmoonstudios.org>

This is a bit more involved, because Tesseract 3.05.00 comes not only with improvements but also with a few quirks we need to deal with. The first quirk is that the order arguments of the `tesseract' command now matters and the list of configurations has to be at the end of the command line. So we add a new attribute tesseract_flags to the BaseBuilder class that contains a list of all the flags to pass to `tesseract', the tesseract_configs attribute however remains pretty much the same but now only really contains a list of configs instead of being mixed with flag arguments. Another quirk has to do with Leptonica >= 1.74 which Tesseract 3.05.00 now requires. Leptonica has special handling of files that reside in /tmp and assumes that it's an internal temporary file of Leptonica. In order to deal with it, we now run Tesseract in a temporary directory, which contains the input/output files and use the relative name of these files because Leptonica only searches for path names beginning with /tmp. Fortunately the last item we need to address is not really a quirk, but an API change. In Tesseract 3.05.00 there is now a new function called TessBaseAPIDetectOrientationScript(), which doesn't fill the OSResults object anymore but now allows to pass the values we're interested in directly by reference. We need to use this new function because the old function TessBaseAPIDetectOS() now *always* returns false. Ran the test suite successfully with Python 3.5 and both Tesseract 3.04.01 and 3.05.00 except the following tests, which also didn't succeed prior to this commit: * cuneiform:TestTxt.test_basic * cuneiform:TestTxt.test_european * cuneiform:TestTxt.test_french * cuneiform:TestWordBox.test_basic * cuneiform:TestWordBox.test_european * cuneiform:TestWordBox.test_french * libtesseract:TestBasicDoc.test_basic * libtesseract:TestDigitLineBox.test_digits * libtesseract:TestLineBox.test_japanese * libtesseract:TestTxt.test_japanese * libtesseract:TestWordBox.test_japanese * tesseract:TestDigitLineBox.test_digits * tesseract:TestTxt.test_japanese The failure of these test cases is probably related to issue openpaperwork#52, but from looking at the failures it doesn't seem to be related to this change anyway. Signed-off-by: aszlig <aszlig@redmoonstudios.org>

QuLogic · 2017-12-30T01:20:51Z

Which tests are known failures? Can they be marked as such at least?

On Fedora with tesseract 3.05.01-1 and cuneiform 1.1.0-25, I get the following failures:

tests.tests_cuneiform.TestTxt:test_french
tests.tests_cuneiform.TestWordBox:test_basic
tests.tests_cuneiform.TestWordBox:test_european
tests.tests_cuneiform.TestWordBox:test_french

tests.tests_libtesseract.TestBasicDoc:test_basic
tests.tests_libtesseract.TestContext:test_version
tests.tests_libtesseract.TestDigitLineBox:test_digits
tests.tests_libtesseract.TestLineBox:test_japanese
tests.tests_libtesseract.TestTxt:test_basic
tests.tests_libtesseract.TestTxt:test_european
tests.tests_libtesseract.TestTxt:test_japanese
tests.tests_libtesseract.TestTxt:test_multi
tests.tests_libtesseract.TestWordBox:test_japanese

tests.tests_tesseract.TestContext:test_version
tests.tests_tesseract.TestDigitLineBox:test_digits
tests.tests_tesseract.TestTxt:test_basic
tests.tests_tesseract.TestTxt:test_european
tests.tests_tesseract.TestTxt:test_japanese
tests.tests_tesseract.TestTxt:test_multi

Compared to NixOS in the linked commits, that means I get a slightly better working cuneiform but tesseract fails the basic, european and multi tests. The basic test seems to have some odd character twiddling with "ocr" vs "cor".

jflesch · 2017-12-30T12:07:22Z

This is going to be tricky. It depends on the exact version of Tesseract, the exact compilation options of Tesseract and Liblept. I haven't yet found a way to avoid having to control manually the results each time :(

(Note that this is not the topic of this ticket).

QuLogic · 2017-12-31T07:04:43Z

Building tesseract 3.05.00 from source (and using leptonica binaries), at least the basic tests pass again. Since the expected results are correct (at least as a human would read them), I think it might be a regression with tesseract. I will bisect and see if I can figure out what's up there.

QuLogic · 2017-12-31T21:39:50Z

So at least the new failure is tesseract-ocr/tesseract#1253; at what point did the remaining tests pass?

jflesch · 2018-01-01T10:58:29Z

Quite frankly, I don't even remember the last time I was able to pass successfully all the tests at once :/

aszlig mentioned this issue Apr 9, 2017

Add support for Tesseract version 3.05.00 #62

Merged

jflesch added bug to study labels Apr 26, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tests/Tesseract: Differences between Pyocr and reference output #52

Tests/Tesseract: Differences between Pyocr and reference output #52

jflesch commented Dec 6, 2016

QuLogic commented Dec 30, 2017

jflesch commented Dec 30, 2017 •

edited

QuLogic commented Dec 31, 2017 •

edited

QuLogic commented Dec 31, 2017

jflesch commented Jan 1, 2018

Tests/Tesseract: Differences between Pyocr and reference output #52

Tests/Tesseract: Differences between Pyocr and reference output #52

Comments

jflesch commented Dec 6, 2016

QuLogic commented Dec 30, 2017

jflesch commented Dec 30, 2017 • edited

QuLogic commented Dec 31, 2017 • edited

QuLogic commented Dec 31, 2017

jflesch commented Jan 1, 2018

jflesch commented Dec 30, 2017 •

edited

QuLogic commented Dec 31, 2017 •

edited