Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge lines using bounding boxes #59

Open
danvk opened this issue Apr 30, 2015 · 1 comment
Open

Merge lines using bounding boxes #59

danvk opened this issue Apr 30, 2015 · 1 comment
Labels

Comments

@danvk
Copy link
Owner

danvk commented Apr 30, 2015

I'm currently doing this with the OCR'd text directly, mostly out of expedience. Lines with similar widths are joined.

But it would be better to do this with the bounding boxes from ocropus-gpageseg. For example, in 712393b, the first line of the paragraph is indented. The right edges of the lines in the paragraph are all close to one another, even though the first line has fewer characters.

Vertical gaps between lines could also be used as cues here.

While I'm at it, it would also be better to detect "NO REPRODUCTIONS"-style lines on a per-box basis, since these sometimes get merged with dates or attributions.

This would be done in extract_ocropy_text.py.

@danvk
Copy link
Owner Author

danvk commented Apr 30, 2015

722041f is an interesting case here. The small line (east side.) between paragraphs should be joined to the first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant