Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract information from bytes #300

Open
asciidiego opened this issue Aug 25, 2019 · 6 comments
Open

Extract information from bytes #300

asciidiego opened this issue Aug 25, 2019 · 6 comments

Comments

@asciidiego
Copy link

I have a PDF that I have downloaded, so is not saved as a file yet. How can I use textract to extract the text without actually saving the file?

@jpweytjens
Copy link
Contributor

What do you mean with "downloaded, but not saved as a file yet"?

Textract requires that you specify the path to the pdf file. So far I have only parsed files that have been saved locally. You might try some of the ideas here, but I don't completly understand what you're trying to do.

@asciidiego
Copy link
Author

I get the PDFs from a HTTP response. So, with the body (as bytes) I should be able to extract the pdf from the bytes alone, I do not think it's necessary to save the PDF as a file, to then parse it to extract the text to then delete the created file; when it was already in memory as a Python variable.

@jpweytjens
Copy link
Contributor

Currently, textract does not supports streams. See also #85, #97 and #99. Perhaps this might be able to help you while we work on support for streams.

@multinucliated
Copy link

any progress in byte stream ( file.read() ) or you can suggest any other way out ?

@shzy2012
Copy link

shzy2012 commented Jul 6, 2021

import textract
with tempfile.NamedTemporaryFile(delete=True) as temp:
    temp.write(f.read())
    temp.flush()
    context = textract.process(temp.name,encoding='utf-8',extension=".pdf")

@uxtt2000
Copy link

uxtt2000 commented Apr 8, 2023

import textract
with tempfile.NamedTemporaryFile(delete=True) as temp:
    temp.write(f.read())
    temp.flush()
    context = textract.process(temp.name,encoding='utf-8',extension=".pdf")

That's the solution. Works like a charm and works in the cloud in a stateless function without any filesystem access!
Thanks @shzy2012 !
@jpweytjens : Maybe put this workaround in the docs while streams are not yet supported, as its really good for usage cloudbased
Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants