Skip to content
This repository has been archived by the owner on Jun 14, 2018. It is now read-only.

tessedit_char_whitelist . detect only predefined chars . #78

Open
MyraBaba opened this issue Oct 2, 2017 · 15 comments
Open

tessedit_char_whitelist . detect only predefined chars . #78

MyraBaba opened this issue Oct 2, 2017 · 15 comments

Comments

@MyraBaba
Copy link

MyraBaba commented Oct 2, 2017

Hi,

We are using pyocr to detect labels which is only contains alphanumeric chars and digits.

How I can Apply a specific list of the chars to be detected . ?

I try to :

in libtesseract/__init__py

if "label" in builder.tesseract_configs:
            tesseract_raw.set_is_label(handle, True)

and in tesseract_raw.py:

def set_is_label(handle, mode):
    global g_libtesseract
    assert(g_libtesseract)

    if mode:
        # wl = b"0123456789ABCDEFGHIJKLMNOPRSTUVYZXW"
        wl = b"0123456789ABNOPRSTUVYZXW"

    else:
        wl = b""

    g_libtesseract.TessBaseAPISetVariable(
        ctypes.c_void_p(handle),
        b"tessedit_char_whitelist",
        wl
    )

Bu I couldn't succeed ?

Is there anyway to do it more simple way, like:

tool.image_to_string(
            Image.open("tmp.png"),
            lang="eng",
            tessedit_char_whitelist = "0123456789ABNOPRSTUVYZXW"
            builder=pyocr.builders.LineBoxBuilder()
        )

thanks

@jflesch jflesch added the support label Oct 2, 2017
@jflesch
Copy link
Member

jflesch commented Oct 2, 2017

From what I can tell, with your patch you should have the expected behavior. The only question is where did you put your call to set_is_label() in libtesseract/init.py exactly ?

@MyraBaba
Copy link
Author

MyraBaba commented Oct 2, 2017

I put in the image_to_string . def:

def image_to_string(image, lang=None, builder=None):
    if builder is None:
        builder = builders.TextBuilder()
    handle = tesseract_raw.init(lang=lang)

    lvl_line = tesseract_raw.PageIteratorLevel.TEXTLINE
    lvl_word = tesseract_raw.PageIteratorLevel.WORD

    try:
        # XXX(Jflesch): Issue #51:
        # Tesseract TessBaseAPIRecognize() may segfault when the target
        # language is not available
        clang = lang if lang else "eng"
        if clang not in tesseract_raw.get_available_languages(handle):
            raise TesseractError(
                "no lang",
                "language {} is not available".format(clang)
            )

        tesseract_raw.set_page_seg_mode(
            handle, builder.tesseract_layout
        )

        tesseract_raw.set_image(handle, image)
        if "digits" in builder.tesseract_configs:
            tesseract_raw.set_is_numeric(handle, True)

        ## LABEL SPECIFIC ###

        if "label" in builder.tesseract_configs:
            tesseract_raw.set_is_label(handle, True)


        # XXX(JFlesch): PageIterator and ResultIterator are actually the
        # very same thing. If it changes, we are screwed.
        tesseract_raw.recognize(handle)
        res_iterator = tesseract_raw.get_iterator(handle)
        if res_iterator is None:
            raise TesseractError(
                "no script", "no script detected"
            )
        page_iterator = tesseract_raw.result_iterator_get_page_iterator(
            res_iterator
        )

        while True:
            if tesseract_raw.page_iterator_is_at_beginning_of(
                    page_iterator, lvl_line):
                (r, box) = tesseract_raw.page_iterator_bounding_box(
                    page_iterator, lvl_line
                )
                assert(r)
                box = _tess_box_to_pyocr_box(box)
                builder.start_line(box)

            last_word_in_line = tesseract_raw.page_iterator_is_at_final_element(
                page_iterator, lvl_line, lvl_word
            )

            word = tesseract_raw.result_iterator_get_utf8_text(
                res_iterator, lvl_word
            )

            if word is not None and word != "":
                (r, box) = tesseract_raw.page_iterator_bounding_box(
                    page_iterator, lvl_word
                )
                assert(r)
                box = _tess_box_to_pyocr_box(box)
                builder.add_word(word, box)

                if last_word_in_line:
                    builder.end_line()

            if not tesseract_raw.page_iterator_next(page_iterator, lvl_word):
                break

    finally:
        tesseract_raw.cleanup(handle)

    return builder.get_output()

`

@MyraBaba
Copy link
Author

MyraBaba commented Oct 2, 2017

by the way ; is there any easy way without a patch ?

@jflesch
Copy link
Member

jflesch commented Oct 2, 2017

Currently no. It's something that should be handled using a custom Builder class, but currently there isn't the required hooks for such builder.

@jflesch
Copy link
Member

jflesch commented Oct 2, 2017

I guess your modifications should work. I'll try to have a look at home when possible (I'm at work currently) .. but currently my life is a little bit complicated, so it may take a while, sorry.

@MyraBaba
Copy link
Author

MyraBaba commented Oct 2, 2017

thanks a lot.

I wish a decent path which has full of light for your life journey my friend.

txh

@jflesch
Copy link
Member

jflesch commented Oct 2, 2017

Actually, if you're ok with using Tesseract (through fork()+exec()) instead of libtesseract, you can use a custom builder.

Something along those lines should work:

import pyocr
import pyocr.builders
import pyocr.tesseract


class MyBuilder(pyocr.builders.TextBuilder):
    def __init__(self):
        self.tesseract_configs += ["-c", "tessedit_char_whitelist=0123456789ABNOPRSTUVYZXW"]


builder = MyBuilder()
txt = pyocr.tesseract.image_to_string(
    Image.open('test.png'),
    builder=builder
)

@jflesch
Copy link
Member

jflesch commented Oct 2, 2017

(I haven't tested though)

@MyraBaba
Copy link
Author

MyraBaba commented Oct 2, 2017

For testing purposes:
in tesseract.py line 265 there is :

   ` command += configs`

I am debugging and I have below command and config varibables:

command: <class 'list'>: ['tesseract', 'input.bmp', 'output', '-l', 'eng', '-psm', '1', 'hocr', '-c tessedit_char_whitelist=01239ABN']

config:<class 'list'>: ['hocr', '-c tessedit_char_whitelist=01239ABN']

In this debug still result NOT whitelisted ?? I am confused at that moment...

@jflesch
Copy link
Member

jflesch commented Oct 2, 2017

If I remember correctly, the argument order does matter to Tesseract, so if you're using the LineBoxBuilder as base, I would actually suggest the following builder:

class MyBuilder(pyocr.builders.LineBoxBuilder):
    def __init__(self):
        super().__init__()
        self.tesseract_configs = ["-c", "tessedit_char_whitelist=0123456789ABNOPRSTUVYZXW"] + self.tesseract_configs

The idea with this change is to have the arguments "-c" "tessedit_char_whitelist..." before the file config argument hocr.

@jflesch
Copy link
Member

jflesch commented Oct 2, 2017

Also, did you make sure to use explicitly pyocr.tesseract instead of the first tool/module provided by tool.get_available_tools() ?

@MyraBaba
Copy link
Author

MyraBaba commented Oct 2, 2017

finally 👍

ATTENTION : The ARGument order DOES MATTER to Tesseract

Thanks now all is fine and working..

['tesseract', 'input.bmp', 'output', '-l', 'eng', '-psm', '1', '-c', 'tessedit_char_whitelist=39BN', 'hocr', '-c tessedit_char_whitelist=01239ABN']

My Best...

@mit456
Copy link

mit456 commented Mar 20, 2018

@MyraBaba @jflesch I am also trying to build custom LineBoxBuilder and applying tessedit_char_blacklist=K now for testing but I need to apply some other config parameters too like tessedit_enable_dict_correction, language_model_ngram_order .. etc but it seems configurations are not getting applied,
This is the following code I am using

class TesseractCustomBuilder(pyocr.builders.LineBoxBuilder):
     def __init__(self):
        super().__init__()
        self.tesseract_configs = ['-c tessedit_char_blacklist=K'] + self.tesseract_configs
builder = TesseractCustomBuilder()
boxes = pyocr.tesseract.image_to_string(Image.fromarray(image),
                                                builder=builder)

This is the print I am getting at L-277 tesseract.py ['-c tessedit_char_blacklist=K', 'hocr'] but it looks K is getting detected.

Please look, if any mistake that I am doing.

@jflesch
Copy link
Member

jflesch commented Mar 20, 2018

@mit456 Have you tried from the command line directly ? (to make sure that Tesseract actually takes into account the option you specified)

@mit456
Copy link

mit456 commented Mar 20, 2018

@jflesch I found it from tesseract --print-parameters and but when I am trying to pass from command line it is not working, don't think it's a problem of pyocr. From next time I will try CLI first.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants