Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PIL.UnidentifiedImageError #239

Open
camipozas opened this issue Aug 24, 2022 · 4 comments
Open

PIL.UnidentifiedImageError #239

camipozas opened this issue Aug 24, 2022 · 4 comments

Comments

@camipozas
Copy link
Contributor

Describe the bug
Different behavior on my computer to AWS EC2 instance m5.xlarge.

Expected behavior
That they have the same behavior since it works on my computer, however when I execute it it cannot find the images.

AWS Log

Process Process-1:
Traceback (most recent call last):
  File "/opt/build/app/read_contracts.py", line 67, in read_contracts
    text_contract = read_pdf(filepath)
  File "/opt/build/app/read_contracts.py", line 27, in read_pdf
    images_from_path = convert_from_path(pdf_path=pdf,
  File "/usr/local/lib/python3.9/site-packages/pdf2image/pdf2image.py", line 218, in convert_from_path
    images += _load_from_output_folder(
  File "/usr/local/lib/python3.9/site-packages/pdf2image/pdf2image.py", line 517, in _load_from_output_folder
    images.append(Image.open(os.path.join(output_folder, f)))
  File "/usr/local/lib/python3.9/site-packages/PIL/Image.py", line 3123, in open
    raise UnidentifiedImageError(
PIL.UnidentifiedImageError: cannot identify image file '/tmp/tmpqo3mn0om/2d473b9f-5b6c-46f0-9220-a4bf51124f6e-03.ppm'

Desktop (please complete the following information):

  • OS: Ubuntu, m5.xlarge instance.
  • Version [e.g. 22] 22.04

Additional context

Function error

def read_pdf(pdf):
    """
    It takes a pdf file, converts it to images, and then converts those images to text
    :param pdf: The path to the PDF file you want to convert
    :return: A string with the text of the pdf
    """
    full_text = ''
    with tempfile.TemporaryDirectory() as path:
        images_from_path = convert_from_path(pdf_path=pdf,
                                             dpi=350,
                                             output_folder=path)

        for page in tqdm(images_from_path):
            full_text += image_to_text(page, lang='spa')
    return full_text

I printed the filenames to see if it was a path issue but it displays correctly. Additionally I am using multiprocessing, again in local it works but in the instance it does not.

@camipozas
Copy link
Contributor Author

@jedwards94

@Belval
Copy link
Owner

Belval commented Sep 3, 2022

Is this only happening with a single PDF? If you run pdftoppm -r 200 -jpeg your_file.pdf out does it show any warnings?

@asanaa8
Copy link

asanaa8 commented Mar 8, 2023

same error as @camipozas

@camipozas
Copy link
Contributor Author

@asanaa8 I fixed with this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants