`pdPageExtractText` should support multi-column documents #17

sambitdash · 2017-11-14T11:18:31Z

This implementation may be needed to be reviewed along with #2. Although, there may not be an exact overlap in some cases the implementation logic can be similar.

Nosferican · 2020-11-09T22:01:38Z

Is there any way to currently do this?

sambitdash · 2020-11-11T06:30:58Z

Not really. You can manually estimate every textrun and see if they form a column. The specification does not provide any structural hints for the same.

vargonis · 2022-11-18T11:28:55Z

On a related note, since by the nature of the format the output of pdPageExtractText is not fully determined, it would be useful to:

Have access to character level information (font, bounding box and so on).
Document what the word inference and ordering heuristics are.

sambitdash · 2022-11-18T11:45:02Z

@vargonis you can use pdPageEvalContent and get the content tree. The content tree has all the bounding box information at a text run level.

sambitdash added the enhancement label Apr 6, 2018

sambitdash mentioned this issue Dec 22, 2019

Extract all Text Objects #83

Closed

sambitdash mentioned this issue Dec 26, 2023

problem extracting text on a two columns layout #112

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`pdPageExtractText` should support multi-column documents #17

`pdPageExtractText` should support multi-column documents #17

sambitdash commented Nov 14, 2017

Nosferican commented Nov 9, 2020

sambitdash commented Nov 11, 2020

vargonis commented Nov 18, 2022

sambitdash commented Nov 18, 2022

pdPageExtractText should support multi-column documents #17

pdPageExtractText should support multi-column documents #17

Comments

sambitdash commented Nov 14, 2017

Nosferican commented Nov 9, 2020

sambitdash commented Nov 11, 2020

vargonis commented Nov 18, 2022

sambitdash commented Nov 18, 2022

`pdPageExtractText` should support multi-column documents #17

`pdPageExtractText` should support multi-column documents #17