page.images is empty #449

benbro · 2023-02-20T16:49:22Z

I'm trying to resize only images with large resolution in a document. I've been told that the attached document has such images.
page.images doesn't show me any image. Am I doing something wrong?

test.pdf

from pikepdf import Pdf, PdfImage, Name
doc = Pdf.open('test.pdf')
page = doc.pages[0]
print(list(page.images.keys()))

mara004 · 2023-02-20T17:09:49Z

I think the image is nested in a Form XObject, which is not handled by the .images accessor.
~~But apart from that, the image is not visible on the page and doesn't have high resolution AFAICS.~~

benbro · 2023-02-20T17:20:20Z

Thanks @mara004

What is the correct way to access all images in a document.
I've tried this:

from pikepdf import Pdf, PdfImage, Name
doc = Pdf.open('test.pdf')
for object in doc.objects:
  print('object')
  if getattr(object, "Type", None) == "/XObject" and getattr(object, "Subtype", None) == "/Image":
    print('image')

But getting an error:

Traceback (most recent call last):
    if getattr(object, "Type", None) == "/XObject" and getattr(object, "Subtype", None) == "/Image":
ValueError: pikepdf.Object is not a Dictionary or Stream

benbro · 2023-02-20T17:22:20Z

@mara004 according to this issue mozilla/pdf.js#16073 (comment) 16 0 obj is an image with huge dimensions. Is this image invisible?

mara004 · 2023-02-20T17:26:28Z

There have been various reports about images nested in XObjects in the past. Maybe see #423 (comment)

mara004 · 2023-02-20T17:27:51Z

@mara004 according to this issue mozilla/pdf.js#16073 (comment) 16 0 obj is an image with huge dimensions. Is this image invisible?

According to PDFium, your actual image is 2x2 pixels, i. e. extremely small (though it is displayed differently).

$ pypdfium2 pageobjects test.pdf --filter image
# Page 1
    image
        Position: (1.036, 50.3844, 77.2091, 100.9194)
        Filters: []
        width: 2
        height: 2
        horizontal_dpi: 1.8904294967651367
        vertical_dpi: 2.8495075702667236
        bits_per_pixel: 1
        colorspace: Indexed
-> Count: 1

-> Total count: 1

mara004 · 2023-02-20T17:29:37Z

And this is the rendered image (again using PDFium for my simplicity):

$ pypdfium2 extract-images test.pdf -o out/ --use-bitmap --render

test_1.png

benbro · 2023-02-20T17:34:35Z

This is what I'm getting when using _find_images(page) from #423 (comment)
If I understand it correctly, there is 2x2 image with large smask.
Could the image and/or the smask cause the issue?

[[<pikepdf.Stream(owner=<...>, data=<...>, {
    "/BitsPerComponent": 1,
    "/ColorSpace": [ "/Indexed", "/DeviceRGB", 1, "ÿÿÿ" ],
    "/Height": 2,
    "/Interpolate": False,
    "/Length": 2,
    "/SMask": pikepdf.Stream(owner=<...>, data=<...>, {
        "/BitsPerComponent": 1,
        "/ColorSpace": "/DeviceGray",
        "/Filter": "/FlateDecode",
        "/Height": 4332,
        "/Length": 38424,
        "/Subtype": "/Image",
        "/Type": "/XObject",
        "/Width": 34862
      }),
    "/Subtype": "/Image",
    "/Type": "/XObject",
    "/Width": 2
  })>, <pikepdf.Stream(owner=<...>, data=<...>, {
    "/BitsPerComponent": 1,
    "/ColorSpace": "/DeviceGray",
    "/Filter": "/FlateDecode",
    "/Height": 4332,
    "/Length": 38424,
    "/Subtype": "/Image",
    "/Type": "/XObject",
    "/Width": 34862
  })>]]

mara004 · 2023-02-20T18:52:15Z

Ah, the image are the three arrows, right? Then it's actually visible. I was confused by the coordinates, but it makes sense if they're relative to the Form XObject.

benbro · 2023-02-20T18:54:03Z

Is there a way to detect large images and resize them? What should I check in the images returned from _find_images()?

mara004 · 2023-02-20T19:03:38Z

What do you mean by resizing? Making the image visually smaller, or downsampling it?

mara004 · 2023-02-20T19:05:11Z

In either case, see https://pikepdf.readthedocs.io/en/latest/topics/images.html#replacing-an-image

benbro · 2023-02-20T19:07:15Z

How do I create a PdfImage from the list returned from find_images in #423 (comment)?

mara004 · 2023-02-20T19:09:03Z

Just PdfImage(raw_image) I suppose

benbro · 2023-02-20T19:10:51Z

Thanks, I'll try to downsample and replace the images.

mnmtz · 2023-07-06T09:33:32Z

I have the same issue in different PDF files, where find_images from #423 finds more images than page.images.
Examples are (find_images | page.images):

example_120.pdf
11 images | 9 images

example_043.pdf
13 images | 0 images

(external link because of the size limit)
https://query.prod.cms.rt.microsoft.com/cms/api/am/binary/RE5bUvv?culture=en-us&country=us
120 images | 39 images

example_063.pdf
63 images | 27 images

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

page.images is empty #449

page.images is empty #449

benbro commented Feb 20, 2023

mara004 commented Feb 20, 2023 •

edited

benbro commented Feb 20, 2023

benbro commented Feb 20, 2023

mara004 commented Feb 20, 2023

mara004 commented Feb 20, 2023 •

edited

mara004 commented Feb 20, 2023 •

edited

benbro commented Feb 20, 2023 •

edited

mara004 commented Feb 20, 2023

benbro commented Feb 20, 2023

mara004 commented Feb 20, 2023

mara004 commented Feb 20, 2023 •

edited

benbro commented Feb 20, 2023

mara004 commented Feb 20, 2023

benbro commented Feb 20, 2023

mnmtz commented Jul 6, 2023 •

edited

page.images is empty #449

page.images is empty #449

Comments

benbro commented Feb 20, 2023

mara004 commented Feb 20, 2023 • edited

benbro commented Feb 20, 2023

benbro commented Feb 20, 2023

mara004 commented Feb 20, 2023

mara004 commented Feb 20, 2023 • edited

mara004 commented Feb 20, 2023 • edited

benbro commented Feb 20, 2023 • edited

mara004 commented Feb 20, 2023

benbro commented Feb 20, 2023

mara004 commented Feb 20, 2023

mara004 commented Feb 20, 2023 • edited

benbro commented Feb 20, 2023

mara004 commented Feb 20, 2023

benbro commented Feb 20, 2023

mnmtz commented Jul 6, 2023 • edited

mara004 commented Feb 20, 2023 •

edited

mara004 commented Feb 20, 2023 •

edited

mara004 commented Feb 20, 2023 •

edited

benbro commented Feb 20, 2023 •

edited

mara004 commented Feb 20, 2023 •

edited

mnmtz commented Jul 6, 2023 •

edited