New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Read each Symbol #81
Comments
Hi, your not getting all paragraphs as your not iterating through all blocks (these are the top most element). To iterate through each symbol you need to change the inner loop's while to
|
Hi, I am still not getting the required results. I am using the attached Image, its some random text. The actual numbers are : The HTML in GetHOCRText shows 5 paragraphs, 27 lines and 283 words. It;s not correct, but close. The loops code is
The output is: |
Ops needed one more loop to iterate through the blocks themselves, my sample program:
Which returns the following using the attached image which matched the results from the HOCR output: Pragraphs = 5 |
Hi charlesw
I have a similar issue as described in #64.
If I use the example you provided :
using (var iter = page.GetIterator()) { do { do { do { if (iter.IsAtBeginningOf(PageIteratorLevel.Para)) { p++; } } while (iter.Next(PageIteratorLevel.Word, PageIteratorLevel.Block)); } while (iter.Next(PageIteratorLevel.TextLine, PageIteratorLevel.Word)); } while (iter.Next(PageIteratorLevel.Para, PageIteratorLevel.TextLine)); }
using (var iter = page.GetIterator()) { do { l++; } while (iter.Next(PageIteratorLevel.Para)); }
Using the code above I get p = 2 and l = 16. When using GetHOCRText it returns 14 paragraphs.
What I would like is to capture each symbol, their confidence and group it all by word and line. How can I iterate through each line, word and then symbol?
Regards
The text was updated successfully, but these errors were encountered: