Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: support GetWords() #46

Open
Banyc opened this issue Jul 26, 2021 · 2 comments
Open

Feature: support GetWords() #46

Banyc opened this issue Jul 26, 2021 · 2 comments
Labels
enhancement New feature or request

Comments

@Banyc
Copy link

Banyc commented Jul 26, 2021

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

We usually query a document word-by-word. However, docnet only supports character-oriented queries. Character-oriented queries are really cool since users can build word-oriented queries based on them. However, I believe it will be better if this common requirement could be implemented in docnet package.

Describe the solution you'd like
A clear and concise description of what you want to happen.

Thus, there is a need for GetWords function to return a list of words. Each word model has the location box and the text information, just like GetCharacters.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

@Modest-as
Copy link
Member

The main reason why GetWords is not supported is because PDF documents have no concept of words. We expose all the info about the characters that one needs to do business logic specific clustering and so on. I am reluctant to add any sort of clustering to the core library itself because there will always be edge cases either due to document formatting or text direction and so on.

@Modest-as Modest-as added the enhancement New feature or request label Sep 22, 2021
@talrand
Copy link

talrand commented Sep 25, 2021

I also had a need to get words and lines of text as they appeared in the PDF.

I've created a small simple library to help with this https://github.com/talrand/DocnetExtended

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants