Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pdPageExtractText should support multi-column documents #17

Open
sambitdash opened this issue Nov 14, 2017 · 4 comments
Open

pdPageExtractText should support multi-column documents #17

sambitdash opened this issue Nov 14, 2017 · 4 comments

Comments

@sambitdash
Copy link
Owner

This implementation may be needed to be reviewed along with #2. Although, there may not be an exact overlap in some cases the implementation logic can be similar.

@Nosferican
Copy link

Is there any way to currently do this?

@sambitdash
Copy link
Owner Author

Not really. You can manually estimate every textrun and see if they form a column. The specification does not provide any structural hints for the same.

@vargonis
Copy link

On a related note, since by the nature of the format the output of pdPageExtractText is not fully determined, it would be useful to:

  1. Have access to character level information (font, bounding box and so on).
  2. Document what the word inference and ordering heuristics are.

@sambitdash
Copy link
Owner Author

@vargonis you can use pdPageEvalContent and get the content tree. The content tree has all the bounding box information at a text run level.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants