Feature: support GetWords() #46

Banyc · 2021-07-26T09:34:54Z

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

We usually query a document word-by-word. However, docnet only supports character-oriented queries. Character-oriented queries are really cool since users can build word-oriented queries based on them. However, I believe it will be better if this common requirement could be implemented in docnet package.

Describe the solution you'd like
A clear and concise description of what you want to happen.

Thus, there is a need for GetWords function to return a list of words. Each word model has the location box and the text information, just like GetCharacters.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

The text was updated successfully, but these errors were encountered:

Modest-as · 2021-08-22T07:53:30Z

The main reason why GetWords is not supported is because PDF documents have no concept of words. We expose all the info about the characters that one needs to do business logic specific clustering and so on. I am reluctant to add any sort of clustering to the core library itself because there will always be edge cases either due to document formatting or text direction and so on.

talrand · 2021-09-25T07:02:50Z

I also had a need to get words and lines of text as they appeared in the PDF.

I've created a small simple library to help with this https://github.com/talrand/DocnetExtended

Modest-as added the enhancement New feature or request label Sep 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: support GetWords() #46

Feature: support GetWords() #46

Banyc commented Jul 26, 2021

Modest-as commented Aug 22, 2021

talrand commented Sep 25, 2021

Feature: support GetWords() #46

Feature: support GetWords() #46

Comments

Banyc commented Jul 26, 2021

Modest-as commented Aug 22, 2021

talrand commented Sep 25, 2021