Skip to content
This repository has been archived by the owner on Jun 14, 2018. It is now read-only.

[Libtesseract] Reduce calls to tesseract_raw.init() #89

Open
wwqgtxx opened this issue Jan 6, 2018 · 5 comments
Open

[Libtesseract] Reduce calls to tesseract_raw.init() #89

wwqgtxx opened this issue Jan 6, 2018 · 5 comments

Comments

@wwqgtxx
Copy link

wwqgtxx commented Jan 6, 2018

When I use the image_to_string() function frequently, I find the tesseract_raw.init()'s call use the most of CPU times (by pstat). Then I read the code about image_to_string() found it call init() to get libtesseract handle each time when call. This is a advise that could use a threadlocal based cache or a class based cache the libtesseract handle to reuse that and I supposed it can make program run faster.
Thanks.

@jflesch
Copy link
Member

jflesch commented Jan 6, 2018

The problem here is that init() provides a handle that must be free with cleanup(). And with the current Pyocr's API, it's hard to figure out the best time to free it.
Some program may want to keep the same handle as long as they are running, but others (like Paperwork for instance) prefer to have it freed when not used anymore.

So I think this change will imply changing the API in non-backward-compatible way. The API is the same for all the modules, so it will have to be changed on all the others too.

@jflesch jflesch changed the title Could increase tesseract_raw.init()'s call Reduce calls to tesseract_raw.init() Jan 6, 2018
@jflesch jflesch changed the title Reduce calls to tesseract_raw.init() [Libtesseract] Reduce calls to tesseract_raw.init() Jan 6, 2018
@wwqgtxx
Copy link
Author

wwqgtxx commented Jan 6, 2018

My own patch was add a option input kward to image_to_string()
 

85   -def image_to_string(image, lang=None, builder=None):
85   +def image_to_string(image, lang=None, builder=None, tesseract_raw_handle=None):
86	
86     if builder is None:
87	
87         builder = builders.TextBuilder()
88    -    handle = tesseract_raw.init(lang=lang)
88    +    if tesseract_raw_handle is None:
89    +        handle = tesseract_raw.init(lang=lang)
90    +    else:
91    +        handle = tesseract_raw_handle
89	
92
90	
93     lvl_line = tesseract_raw.PageIteratorLevel.TEXTLINE
91	
94     lvl_word = tesseract_raw.PageIteratorLevel.WORD
92	
95
93	
96     try:
94    -        # XXX(Jflesch): Issue #51:
95    -        # Tesseract TessBaseAPIRecognize() may segfault when the target
96    -        # language is not available
97    -        clang = lang if lang else "eng"
98    -        for lang_item in clang.split("+"):
99    -            if lang_item not in tesseract_raw.get_available_languages(handle):
100   -                raise TesseractError(
101   -                    "no lang",
102   -                    "language {} is not available".format(lang_item)
103   -                )
97    +        if tesseract_raw_handle is None:
98    +            # XXX(Jflesch): Issue #51:
99    +            # Tesseract TessBaseAPIRecognize() may segfault when the target
100   +            # language is not available
101   +            clang = lang if lang else "eng"
102   +            for lang_item in clang.split("+"):
103   +                if lang_item not in tesseract_raw.get_available_languages(handle):
104   +                    raise TesseractError(
105   +                        "no lang",
106   +                        "language {} is not available".format(lang_item)
107   +                    )
104	
108
105	
109         tesseract_raw.set_page_seg_mode(
106	
110             handle, builder.tesseract_layout
...	...
@@ -159,7 +163,8 @@ def image_to_string(image, lang=None, builder=None):
159	
163                 break
160	
164
161	
165     finally:
162   -        tesseract_raw.cleanup(handle)
166   +        if tesseract_raw_handle is None:
167   +            tesseract_raw.cleanup(handle)
163	
168
164	
169     return builder.get_output()

add I init and cleanup the handle by myself

            tesseract_raw_handle = libtesseract.tesseract_raw.init("eng")
            try:
                for image in images:
                    libtesseract.image_to_string(
                     image,
                     lang="eng",
                     builder=builders.DigitBuilder(7),
                     tesseract_raw_handle=tesseract_raw_handle
                   )
            finally:
                libtesseract.tesseract_raw.cleanup(tesseract_raw_handle)

@wwqgtxx
Copy link
Author

wwqgtxx commented Jan 7, 2018

maybe add a new class base api like ImageToString class is a optional way to solve this problem, and we can use weakref.finalize to force call the cleanup when the instance of ImageToString class was gc to avoid user forget free the handle.Of course, told users use a with ImageToString() as i: to call cleanup at __exit__ was the best way.

@jflesch
Copy link
Member

jflesch commented Jan 7, 2018

Interresting idea. But still a new API. So I'll consider it, but for a next major new version (PyOCR2 :).

@wwqgtxx
Copy link
Author

wwqgtxx commented Jan 17, 2018

add a note, before we want to reuse the handle we need to call TessBaseAPIClearAdaptiveClassifier to avoid recognition the different picture cause tesseract internal struct change

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants