Read each Symbol #81

Codendaal1120 · 2014-03-05T13:59:31Z

Hi charlesw

I have a similar issue as described in #64.

If I use the example you provided :


using (var iter = page.GetIterator())
{
    do
    {
        do
        {
            do
            {
                if (iter.IsAtBeginningOf(PageIteratorLevel.Para))
                {
                    p++;
                }
            } while (iter.Next(PageIteratorLevel.Word, PageIteratorLevel.Block));
        } while (iter.Next(PageIteratorLevel.TextLine, PageIteratorLevel.Word));
    } while (iter.Next(PageIteratorLevel.Para, PageIteratorLevel.TextLine));
}


using (var iter = page.GetIterator())
{
    do
    {
        l++;
    } while (iter.Next(PageIteratorLevel.Para));
}

Using the code above I get p = 2 and l = 16. When using GetHOCRText it returns 14 paragraphs.

What I would like is to capture each symbol, their confidence and group it all by word and line. How can I iterate through each line, word and then symbol?

Regards

The text was updated successfully, but these errors were encountered:

charlesw · 2014-03-05T20:34:17Z

Hi, your not getting all paragraphs as your not iterating through all blocks (these are the top most element). To iterate through each symbol you need to change the inner loop's while to while (iter.Next(PageIteratorLevel.Word, PageIteratorLevel.Symbol));. See bellow for an example that iterates through the entire document hierarchy.

using (var iter = page.GetIterator()) {
    iter.Begin();
    do {
        do {
            do {
                do {
                    if (iter.IsAtBeginningOf(PageIteratorLevel.Block)) {
                        logger.Log("New block");
                    }
                    if (iter.IsAtBeginningOf(PageIteratorLevel.Para)) {
                        logger.Log("New paragraph");
                    }
                    if (iter.IsAtBeginningOf(PageIteratorLevel.TextLine)) {
                        logger.Log("New line");
                    }
                    if (iter.IsAtBeginningOf(PageIteratorLevel.Word)) {
                        logger.Log("New word");
                    }
                    logger.Log(iter.GetText(PageIteratorLevel.Symbol));

                    // get bounding box for symbol
                    Rect symbolBounds;
                    if(iter.TryGetBoundingBox(PageIteratorLevel.Symbol, out symbolBounds)) {
                        // do whatever you want with bounding box for the symbol
                    }
                } while (iter.Next(PageIteratorLevel.Word, PageIteratorLevel.Symbol));
                // DO any word post processing here (e.g. group symbols by word)
            } while (iter.Next(PageIteratorLevel.TextLine, PageIteratorLevel.Word));
        } while (iter.Next(PageIteratorLevel.Para, PageIteratorLevel.TextLine));
    } while(iter.Next(PageIteratorLevel.Block, PageIteratorLevel.Para));                                    
}

Codendaal1120 · 2014-03-06T05:39:58Z

Hi, I am still not getting the required results. I am using the attached Image, its some random text. The actual numbers are :
Paragraphs = 5
Lines = 27
Words = 429
Characters = 2204 (Without spaces)

The HTML in GetHOCRText shows 5 paragraphs, 27 lines and 283 words. It;s not correct, but close.

The loops code is


using (var iter = page.GetIterator())
{
    iter.Begin();
    do
    {
        do
        {
            do
            {
                do
                {
                    //if (iter.IsAtBeginningOf(PageIteratorLevel.Block))
                    //{
                    //    logger.Log("New block");
                    //}
                    if (iter.IsAtBeginningOf(PageIteratorLevel.Para))
                    {
                        p++;//counts paragraph
                        //logger.Log("New paragraph");
                    }
                    if (iter.IsAtBeginningOf(PageIteratorLevel.TextLine))
                    {
                        l++;//count lines
                        //logger.Log("New line");
                    }
                    if (iter.IsAtBeginningOf(PageIteratorLevel.Word))
                    {
                        w++;//count words
                        //logger.Log("New word");
                    }
                    s++;//count symbols
                    //logger.Log(iter.GetText(PageIteratorLevel.Symbol));
                    // get bounding box for symbol
                    Rect symbolBounds;
                    if (iter.TryGetBoundingBox(PageIteratorLevel.Symbol, out symbolBounds))
                    {
                        // do whatever you want with bounding box for the symbol
                    }
                } while (iter.Next(PageIteratorLevel.Word, PageIteratorLevel.Symbol));
                // DO any word post processing here (e.g. group symbols by word)
            } while (iter.Next(PageIteratorLevel.TextLine, PageIteratorLevel.Word));
        } while (iter.Next(PageIteratorLevel.Para, PageIteratorLevel.TextLine));
    } while (iter.Next(PageIteratorLevel.Block, PageIteratorLevel.Para));
}
Debug.WriteLine("Pragraphs = " + p);
Debug.WriteLine("Lines = " + l);
Debug.WriteLine("Words = " + w);
Debug.WriteLine("Symbols = " + s);

The output is:
Pragraphs = 2
Lines = 8
Words = 74
Symbols = 632

The Image I am using is

charlesw · 2014-03-06T08:01:12Z

Ops needed one more loop to iterate through the blocks themselves, my sample program:

using System;
using System.IO;

namespace Tesseract.Issue81
{
    class Program
    {
        public static void Main(string[] args)
        {
            var testImagePath = "./test.jpg";
            if (args.Length > 0) {
                testImagePath = args[0];
            }

            try {
                using (var engine = new TesseractEngine(@"./tessdata", "eng", EngineMode.Default)) {
                    using (var img = Pix.LoadFromFile(testImagePath)) {
                        using (var page = engine.Process(img)) {
                            var text = page.GetHOCRText(1);
                            File.WriteAllText("test.html", text);
                            //Console.WriteLine("Text: {0}", text);
                            //Console.WriteLine("Mean confidence: {0}", page.GetMeanConfidence());

                            int p = 0;
                            int l = 0;
                            int w = 0;
                            int s = 0;
                            using (var iter = page.GetIterator()) {
                                iter.Begin();
                                do {
                                    do
                                    {
                                        do
                                        {
                                            do
                                            {
                                                do
                                                {
                                                    //if (iter.IsAtBeginningOf(PageIteratorLevel.Block))
                                                    //{
                                                    //    logger.Log("New block");
                                                    //}
                                                    if (iter.IsAtBeginningOf(PageIteratorLevel.Para))
                                                    {
                                                        p++;//counts paragraph
                                                        //logger.Log("New paragraph");
                                                    }
                                                    if (iter.IsAtBeginningOf(PageIteratorLevel.TextLine))
                                                    {
                                                        l++;//count lines
                                                        //logger.Log("New line");
                                                    }
                                                    if (iter.IsAtBeginningOf(PageIteratorLevel.Word))
                                                    {
                                                        w++;//count words
                                                        //logger.Log("New word");
                                                    }
                                                    s++;//count symbols
                                                    //logger.Log(iter.GetText(PageIteratorLevel.Symbol));
                                                    // get bounding box for symbol
                                                    Rect symbolBounds;
                                                    if (iter.TryGetBoundingBox(PageIteratorLevel.Symbol, out symbolBounds))
                                                    {
                                                        // do whatever you want with bounding box for the symbol
                                                    }
                                                } while (iter.Next(PageIteratorLevel.Word, PageIteratorLevel.Symbol));
                                                // DO any word post processing here (e.g. group symbols by word)
                                            } while (iter.Next(PageIteratorLevel.TextLine, PageIteratorLevel.Word));
                                        } while (iter.Next(PageIteratorLevel.Para, PageIteratorLevel.TextLine));
                                    } while (iter.Next(PageIteratorLevel.Block, PageIteratorLevel.Para));
                                } while(iter.Next(PageIteratorLevel.Block));
                            }
                            Console.WriteLine("Pragraphs = " + p);
                            Console.WriteLine("Lines = " + l);
                            Console.WriteLine("Words = " + w);
                            Console.WriteLine("Symbols = " + s);        
                        }
                    }
                }
            } catch (Exception e) {
                Console.WriteLine("Unexpected Error: " + e.Message);
                Console.WriteLine("Details: ");
                Console.WriteLine(e.ToString());
            }
            Console.Write("Press any key to continue . . . ");
            Console.ReadKey(true);
        }
    }
}

Which returns the following using the attached image which matched the results from the HOCR output:

Pragraphs = 5
Lines = 27
Words = 286
Symbols = 1926

charlesw added the question label Mar 5, 2014

charlesw closed this as completed Jun 22, 2014

azs-rahimi mentioned this issue Nov 25, 2014

Finding position of block is so slow #138

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read each Symbol #81

Read each Symbol #81

Codendaal1120 commented Mar 5, 2014

charlesw commented Mar 5, 2014

Codendaal1120 commented Mar 6, 2014

charlesw commented Mar 6, 2014

Read each Symbol #81

Read each Symbol #81

Comments

Codendaal1120 commented Mar 5, 2014

charlesw commented Mar 5, 2014

Codendaal1120 commented Mar 6, 2014

charlesw commented Mar 6, 2014