Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to extract text and keep their semantics? #10073

Closed
Jun711 opened this issue Sep 13, 2018 · 1 comment
Closed

How to extract text and keep their semantics? #10073

Jun711 opened this issue Sep 13, 2018 · 1 comment
Labels

Comments

@Jun711
Copy link

Jun711 commented Sep 13, 2018

I found an example on how to extract text on a StackOverflow thread

This is the example code linked on the thread

But, it just grabs all the text without keeping the semantics. I wonder if there is a API method that is provided by pdf.js to extract text semantically?

Thanks

@timvandermeij
Copy link
Contributor

The getTextContent API (refer to https://github.com/mozilla/pdf.js/blob/master/examples/node/getinfo.js#L45 for a usage example) can only give you the text content of a single page, but there are no more semantics. This is mainly because in the PDF format text is just a series of glyphs and positions and in general no more information is included. Exceptions are tagged PDFs, which we don't support yet but we do track the support in #6269.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants