How to extract text and keep their semantics? #10073

Jun711 · 2018-09-13T18:54:20Z

I found an example on how to extract text on a StackOverflow thread

This is the example code linked on the thread

But, it just grabs all the text without keeping the semantics. I wonder if there is a API method that is provided by pdf.js to extract text semantically?

Thanks

timvandermeij · 2018-09-13T19:32:45Z

The getTextContent API (refer to https://github.com/mozilla/pdf.js/blob/master/examples/node/getinfo.js#L45 for a usage example) can only give you the text content of a single page, but there are no more semantics. This is mainly because in the PDF format text is just a series of glyphs and positions and in general no more information is included. Exceptions are tagged PDFs, which we don't support yet but we do track the support in #6269.

timvandermeij closed this as completed Sep 13, 2018

timvandermeij added the other label Sep 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to extract text and keep their semantics? #10073

How to extract text and keep their semantics? #10073

Jun711 commented Sep 13, 2018

timvandermeij commented Sep 13, 2018

How to extract text and keep their semantics? #10073

How to extract text and keep their semantics? #10073

Comments

Jun711 commented Sep 13, 2018

timvandermeij commented Sep 13, 2018