Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request for examples in the documentation #38

Open
johannspies opened this issue Nov 21, 2018 · 2 comments
Open

Request for examples in the documentation #38

johannspies opened this issue Nov 21, 2018 · 2 comments

Comments

@johannspies
Copy link

I am struggling to figure out how to use this library to read a pdf as text for the purpose of Natural Language Processing as an alternative to

using Taro
Taro.init()
meta, txtdata = Taro.extract(files[1]);

as shown in
https://github.com/aviks/nlp-workshop/blob/master/NLP-in-julia.ipynb

Or can I not use this library in stead of Taro (which I cannot compile on Julia 1.0.2)?

@sambitdash
Copy link
Owner

PDFIO is a little low level API than Taro in this respect. It deals with PDF each page separately. So you may need a few extra lines of code. The piece of code you are looking for is the following:

function getPDFText(src, out)
       doc = pdDocOpen(src)
       docinfo = pdDocGetInfo(doc)
       open(out, "w") do io
               npage = pdDocGetPageCount(doc)
               for i=1:npage
                     page = pdDocGetPage(doc, i)
                     pdPageExtractText(io, page)
               end
       end
       pdDocClose(doc)
       return docinfo
end

If you still face any issue or challenges with the code please let us know so that we can try to address those.

The library is kept very flexible for accessing detailed query into PDF objects. A summary level API or samples will definitely help for someone to get some quick tasks done as well. We will keep that in mind to add a few examples and samples in the documentation.

@johannspies
Copy link
Author

Thanks! That helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants