C# Tesseract 3.02 How I access each character of word from image #64

ominouse · 2014-01-12T07:25:51Z

Hi, I'm newbie here.
First, I need to draw rectangle on each character of word from image.
in old version of tesseract I found that we can access each character by

foreach (tessnet2.Character c in word.CharList)
e.Graphics.DrawRectangle..........

But, now I'm working on C# winform with Tesseract 3.02

TesseractEngine a = new TesseractEngine(@"./tessdata", "eng", EngineMode.TesseractAndCube);
Tesseract.Page page1 = a.Process(image);
foreach ( ....... in page1)
{
// draw rectangle from (bounding box of each character)
}

Question 1: how i access each character of page1.

I try many method like PageIteratorLevel and get some part of page like first line, first word or first block , but i can't get first character of them.
Well, I notice that on result text of HOCRtext from page1 each element like word, line , block has Bounding box's value.

Question 2: how i get value of bounding box of each element. ( I found only 1 method "TryGetBoundingBox" that return only boolean.

thank you.

charlesw · 2014-01-13T02:01:19Z

Answer for Q1:

Check out the console sample provided as it gives an example of how to iterate through the results, however something like the following should work:

using (var iter = page.GetIterator()) {
    do {
        do {
            do {
                if (iter.IsAtBeginningOf(PageIteratorLevel.Block)) {
                    // do whatever you need to do when a block (top most level result) is encountered.
                }
                if (iter.IsAtBeginningOf(PageIteratorLevel.Para)) {
                    // do whatever you need to do when a paragraph is encountered.
                }
                if (iter.IsAtBeginningOf(PageIteratorLevel.TextLine)) {
                    // do whatever you need to do when a line of text is encountered is encountered.
                }                                               
                if (iter.IsAtBeginningOf(PageIteratorLevel.Word)) {
                    // do whatever you need to do when a word is encountered is encountered.
                }

                // get bounding box for symbol
                Rect symbolBounds;
                if(iter.TryGetBoundingBox(PageIteratorLevel.Symbol, out symbolBounds)) {
                    // do whatever you want with bounding box for the symbol
                }
            } while(iter.Next(PageIteratorLevel.Word, PageIteratorLevel.Block));
        } while (iter.Next(PageIteratorLevel.TextLine, PageIteratorLevel.Word));
    } while (iter.Next(PageIteratorLevel.Para, PageIteratorLevel.TextLine));
}

Note that the general result hierarchy is as follows:

Block -> Para -> TextLine -> Word -> Symbol

I.e. the result set can contain many Blocks, which can in turn contain many Paragraphs and so on.

Answer for Question 2:

As per above the TryGetBoundingBox method returns the bounds in an out parameter. Much like Dictionary.TryGetValue does.

kndnath · 2019-09-01T14:59:43Z

Hi Charles,

Hope you're doing great.

I am new to this stuff, I can get the required text from a small picture or test picture but not from the actual picture:

how to extract a BIB# from a photograph.
How to recognize a BIB# area from the whole photograph.

Thanks.

tdhintz · 2019-09-01T15:07:04Z

Use opencv to find and crop the region. There is a guy with demos written in Python that aren't too hard to translate to .net.

ominouse closed this as completed Jan 13, 2014

Codendaal1120 mentioned this issue Mar 5, 2014

Read each Symbol #81

Closed

azs-rahimi mentioned this issue Nov 25, 2014

Finding position of block is so slow #138

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

C# Tesseract 3.02 How I access each character of word from image #64

C# Tesseract 3.02 How I access each character of word from image #64

ominouse commented Jan 12, 2014

charlesw commented Jan 13, 2014

kndnath commented Sep 1, 2019

tdhintz commented Sep 1, 2019 via email

C# Tesseract 3.02 How I access each character of word from image #64

C# Tesseract 3.02 How I access each character of word from image #64

Comments

ominouse commented Jan 12, 2014

charlesw commented Jan 13, 2014

Answer for Q1:

Answer for Question 2:

kndnath commented Sep 1, 2019

tdhintz commented Sep 1, 2019 via email