Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read each Symbol #81

Closed
Codendaal1120 opened this issue Mar 5, 2014 · 3 comments
Closed

Read each Symbol #81

Codendaal1120 opened this issue Mar 5, 2014 · 3 comments
Labels

Comments

@Codendaal1120
Copy link

Hi charlesw

I have a similar issue as described in #64.

If I use the example you provided :

using (var iter = page.GetIterator()) { do { do { do { if (iter.IsAtBeginningOf(PageIteratorLevel.Para)) { p++; } } while (iter.Next(PageIteratorLevel.Word, PageIteratorLevel.Block)); } while (iter.Next(PageIteratorLevel.TextLine, PageIteratorLevel.Word)); } while (iter.Next(PageIteratorLevel.Para, PageIteratorLevel.TextLine)); } using (var iter = page.GetIterator()) { do { l++; } while (iter.Next(PageIteratorLevel.Para)); }

Using the code above I get p = 2 and l = 16. When using GetHOCRText it returns 14 paragraphs.

What I would like is to capture each symbol, their confidence and group it all by word and line. How can I iterate through each line, word and then symbol?

Regards

@charlesw
Copy link
Owner

charlesw commented Mar 5, 2014

Hi, your not getting all paragraphs as your not iterating through all blocks (these are the top most element). To iterate through each symbol you need to change the inner loop's while to while (iter.Next(PageIteratorLevel.Word, PageIteratorLevel.Symbol));. See bellow for an example that iterates through the entire document hierarchy.

using (var iter = page.GetIterator()) {
    iter.Begin();
    do {
        do {
            do {
                do {
                    if (iter.IsAtBeginningOf(PageIteratorLevel.Block)) {
                        logger.Log("New block");
                    }
                    if (iter.IsAtBeginningOf(PageIteratorLevel.Para)) {
                        logger.Log("New paragraph");
                    }
                    if (iter.IsAtBeginningOf(PageIteratorLevel.TextLine)) {
                        logger.Log("New line");
                    }
                    if (iter.IsAtBeginningOf(PageIteratorLevel.Word)) {
                        logger.Log("New word");
                    }
                    logger.Log(iter.GetText(PageIteratorLevel.Symbol));

                    // get bounding box for symbol
                    Rect symbolBounds;
                    if(iter.TryGetBoundingBox(PageIteratorLevel.Symbol, out symbolBounds)) {
                        // do whatever you want with bounding box for the symbol
                    }
                } while (iter.Next(PageIteratorLevel.Word, PageIteratorLevel.Symbol));
                // DO any word post processing here (e.g. group symbols by word)
            } while (iter.Next(PageIteratorLevel.TextLine, PageIteratorLevel.Word));
        } while (iter.Next(PageIteratorLevel.Para, PageIteratorLevel.TextLine));
    } while(iter.Next(PageIteratorLevel.Block, PageIteratorLevel.Para));                                    
}

@Codendaal1120
Copy link
Author

Hi, I am still not getting the required results. I am using the attached Image, its some random text. The actual numbers are :
Paragraphs = 5
Lines = 27
Words = 429
Characters = 2204 (Without spaces)

The HTML in GetHOCRText shows 5 paragraphs, 27 lines and 283 words. It;s not correct, but close.

The loops code is


using (var iter = page.GetIterator())
{
    iter.Begin();
    do
    {
        do
        {
            do
            {
                do
                {
                    //if (iter.IsAtBeginningOf(PageIteratorLevel.Block))
                    //{
                    //    logger.Log("New block");
                    //}
                    if (iter.IsAtBeginningOf(PageIteratorLevel.Para))
                    {
                        p++;//counts paragraph
                        //logger.Log("New paragraph");
                    }
                    if (iter.IsAtBeginningOf(PageIteratorLevel.TextLine))
                    {
                        l++;//count lines
                        //logger.Log("New line");
                    }
                    if (iter.IsAtBeginningOf(PageIteratorLevel.Word))
                    {
                        w++;//count words
                        //logger.Log("New word");
                    }
                    s++;//count symbols
                    //logger.Log(iter.GetText(PageIteratorLevel.Symbol));
                    // get bounding box for symbol
                    Rect symbolBounds;
                    if (iter.TryGetBoundingBox(PageIteratorLevel.Symbol, out symbolBounds))
                    {
                        // do whatever you want with bounding box for the symbol
                    }
                } while (iter.Next(PageIteratorLevel.Word, PageIteratorLevel.Symbol));
                // DO any word post processing here (e.g. group symbols by word)
            } while (iter.Next(PageIteratorLevel.TextLine, PageIteratorLevel.Word));
        } while (iter.Next(PageIteratorLevel.Para, PageIteratorLevel.TextLine));
    } while (iter.Next(PageIteratorLevel.Block, PageIteratorLevel.Para));
}
Debug.WriteLine("Pragraphs = " + p);
Debug.WriteLine("Lines = " + l);
Debug.WriteLine("Words = " + w);
Debug.WriteLine("Symbols = " + s);

The output is:
Pragraphs = 2
Lines = 8
Words = 74
Symbols = 632

The Image I am using is
testplain

@charlesw
Copy link
Owner

charlesw commented Mar 6, 2014

Ops needed one more loop to iterate through the blocks themselves, my sample program:

using System;
using System.IO;

namespace Tesseract.Issue81
{
    class Program
    {
        public static void Main(string[] args)
        {
            var testImagePath = "./test.jpg";
            if (args.Length > 0) {
                testImagePath = args[0];
            }

            try {
                using (var engine = new TesseractEngine(@"./tessdata", "eng", EngineMode.Default)) {
                    using (var img = Pix.LoadFromFile(testImagePath)) {
                        using (var page = engine.Process(img)) {
                            var text = page.GetHOCRText(1);
                            File.WriteAllText("test.html", text);
                            //Console.WriteLine("Text: {0}", text);
                            //Console.WriteLine("Mean confidence: {0}", page.GetMeanConfidence());

                            int p = 0;
                            int l = 0;
                            int w = 0;
                            int s = 0;
                            using (var iter = page.GetIterator()) {
                                iter.Begin();
                                do {
                                    do
                                    {
                                        do
                                        {
                                            do
                                            {
                                                do
                                                {
                                                    //if (iter.IsAtBeginningOf(PageIteratorLevel.Block))
                                                    //{
                                                    //    logger.Log("New block");
                                                    //}
                                                    if (iter.IsAtBeginningOf(PageIteratorLevel.Para))
                                                    {
                                                        p++;//counts paragraph
                                                        //logger.Log("New paragraph");
                                                    }
                                                    if (iter.IsAtBeginningOf(PageIteratorLevel.TextLine))
                                                    {
                                                        l++;//count lines
                                                        //logger.Log("New line");
                                                    }
                                                    if (iter.IsAtBeginningOf(PageIteratorLevel.Word))
                                                    {
                                                        w++;//count words
                                                        //logger.Log("New word");
                                                    }
                                                    s++;//count symbols
                                                    //logger.Log(iter.GetText(PageIteratorLevel.Symbol));
                                                    // get bounding box for symbol
                                                    Rect symbolBounds;
                                                    if (iter.TryGetBoundingBox(PageIteratorLevel.Symbol, out symbolBounds))
                                                    {
                                                        // do whatever you want with bounding box for the symbol
                                                    }
                                                } while (iter.Next(PageIteratorLevel.Word, PageIteratorLevel.Symbol));
                                                // DO any word post processing here (e.g. group symbols by word)
                                            } while (iter.Next(PageIteratorLevel.TextLine, PageIteratorLevel.Word));
                                        } while (iter.Next(PageIteratorLevel.Para, PageIteratorLevel.TextLine));
                                    } while (iter.Next(PageIteratorLevel.Block, PageIteratorLevel.Para));
                                } while(iter.Next(PageIteratorLevel.Block));
                            }
                            Console.WriteLine("Pragraphs = " + p);
                            Console.WriteLine("Lines = " + l);
                            Console.WriteLine("Words = " + w);
                            Console.WriteLine("Symbols = " + s);        
                        }
                    }
                }
            } catch (Exception e) {
                Console.WriteLine("Unexpected Error: " + e.Message);
                Console.WriteLine("Details: ");
                Console.WriteLine(e.ToString());
            }
            Console.Write("Press any key to continue . . . ");
            Console.ReadKey(true);
        }
    }
}

Which returns the following using the attached image which matched the results from the HOCR output:

Pragraphs = 5
Lines = 27
Words = 286
Symbols = 1926

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants