Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

C# Tesseract 3.02 How I access each character of word from image #64

Closed
ominouse opened this issue Jan 12, 2014 · 3 comments
Closed

C# Tesseract 3.02 How I access each character of word from image #64

ominouse opened this issue Jan 12, 2014 · 3 comments
Labels

Comments

@ominouse
Copy link

Hi, I'm newbie here.
First, I need to draw rectangle on each character of word from image.
in old version of tesseract I found that we can access each character by

foreach (tessnet2.Character c in word.CharList)
e.Graphics.DrawRectangle..........

demo

But, now I'm working on C# winform with Tesseract 3.02

TesseractEngine a = new TesseractEngine(@"./tessdata", "eng", EngineMode.TesseractAndCube);
Tesseract.Page page1 = a.Process(image);
foreach ( ....... in page1)
{
// draw rectangle from (bounding box of each character)
}

Question 1: how i access each character of page1.

I try many method like PageIteratorLevel and get some part of page like first line, first word or first block , but i can't get first character of them.
Well, I notice that on result text of HOCRtext from page1 each element like word, line , block has Bounding box's value.

Question 2: how i get value of bounding box of each element. ( I found only 1 method "TryGetBoundingBox" that return only boolean.

thank you.

@charlesw
Copy link
Owner

Answer for Q1:

Check out the console sample provided as it gives an example of how to iterate through the results, however something like the following should work:

using (var iter = page.GetIterator()) {
    do {
        do {
            do {
                if (iter.IsAtBeginningOf(PageIteratorLevel.Block)) {
                    // do whatever you need to do when a block (top most level result) is encountered.
                }
                if (iter.IsAtBeginningOf(PageIteratorLevel.Para)) {
                    // do whatever you need to do when a paragraph is encountered.
                }
                if (iter.IsAtBeginningOf(PageIteratorLevel.TextLine)) {
                    // do whatever you need to do when a line of text is encountered is encountered.
                }                                               
                if (iter.IsAtBeginningOf(PageIteratorLevel.Word)) {
                    // do whatever you need to do when a word is encountered is encountered.
                }

                // get bounding box for symbol
                Rect symbolBounds;
                if(iter.TryGetBoundingBox(PageIteratorLevel.Symbol, out symbolBounds)) {
                    // do whatever you want with bounding box for the symbol
                }
            } while(iter.Next(PageIteratorLevel.Word, PageIteratorLevel.Block));
        } while (iter.Next(PageIteratorLevel.TextLine, PageIteratorLevel.Word));
    } while (iter.Next(PageIteratorLevel.Para, PageIteratorLevel.TextLine));
}

Note that the general result hierarchy is as follows:

Block -> Para -> TextLine -> Word -> Symbol

I.e. the result set can contain many Blocks, which can in turn contain many Paragraphs and so on.

Answer for Question 2:

As per above the TryGetBoundingBox method returns the bounds in an out parameter. Much like Dictionary.TryGetValue does.

@kndnath
Copy link

kndnath commented Sep 1, 2019

Hi Charles,

Hope you're doing great.

I am new to this stuff, I can get the required text from a small picture or test picture but not from the actual picture:

  1. how to extract a BIB# from a photograph.
    NotWorking

  2. How to recognize a BIB# area from the whole photograph.
    H1764

Thanks.

@tdhintz
Copy link
Contributor

tdhintz commented Sep 1, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants