PIL.UnidentifiedImageError #239

camipozas · 2022-08-24T18:12:09Z

Describe the bug
Different behavior on my computer to AWS EC2 instance m5.xlarge.

Expected behavior
That they have the same behavior since it works on my computer, however when I execute it it cannot find the images.

AWS Log

Process Process-1:
Traceback (most recent call last):
  File "/opt/build/app/read_contracts.py", line 67, in read_contracts
    text_contract = read_pdf(filepath)
  File "/opt/build/app/read_contracts.py", line 27, in read_pdf
    images_from_path = convert_from_path(pdf_path=pdf,
  File "/usr/local/lib/python3.9/site-packages/pdf2image/pdf2image.py", line 218, in convert_from_path
    images += _load_from_output_folder(
  File "/usr/local/lib/python3.9/site-packages/pdf2image/pdf2image.py", line 517, in _load_from_output_folder
    images.append(Image.open(os.path.join(output_folder, f)))
  File "/usr/local/lib/python3.9/site-packages/PIL/Image.py", line 3123, in open
    raise UnidentifiedImageError(
PIL.UnidentifiedImageError: cannot identify image file '/tmp/tmpqo3mn0om/2d473b9f-5b6c-46f0-9220-a4bf51124f6e-03.ppm'

Desktop (please complete the following information):

OS: Ubuntu, m5.xlarge instance.
Version [e.g. 22] 22.04

Additional context

Function error

def read_pdf(pdf):
    """
    It takes a pdf file, converts it to images, and then converts those images to text
    :param pdf: The path to the PDF file you want to convert
    :return: A string with the text of the pdf
    """
    full_text = ''
    with tempfile.TemporaryDirectory() as path:
        images_from_path = convert_from_path(pdf_path=pdf,
                                             dpi=350,
                                             output_folder=path)

        for page in tqdm(images_from_path):
            full_text += image_to_text(page, lang='spa')
    return full_text

I printed the filenames to see if it was a path issue but it displays correctly. Additionally I am using multiprocessing, again in local it works but in the instance it does not.

The text was updated successfully, but these errors were encountered:

camipozas · 2022-08-26T13:42:31Z

@jedwards94

Belval · 2022-09-03T19:10:12Z

Is this only happening with a single PDF? If you run pdftoppm -r 200 -jpeg your_file.pdf out does it show any warnings?

asanaa8 · 2023-03-08T17:10:19Z

same error as @camipozas

camipozas · 2023-03-08T17:28:10Z

@asanaa8 I fixed with this

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PIL.UnidentifiedImageError #239

PIL.UnidentifiedImageError #239

camipozas commented Aug 24, 2022

camipozas commented Aug 26, 2022

Belval commented Sep 3, 2022

asanaa8 commented Mar 8, 2023

camipozas commented Mar 8, 2023

PIL.UnidentifiedImageError #239

PIL.UnidentifiedImageError #239

Comments

camipozas commented Aug 24, 2022

Function error

camipozas commented Aug 26, 2022

Belval commented Sep 3, 2022

asanaa8 commented Mar 8, 2023

camipozas commented Mar 8, 2023