Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

page.images is empty #449

Open
benbro opened this issue Feb 20, 2023 · 15 comments
Open

page.images is empty #449

benbro opened this issue Feb 20, 2023 · 15 comments

Comments

@benbro
Copy link

benbro commented Feb 20, 2023

I'm trying to resize only images with large resolution in a document. I've been told that the attached document has such images.
page.images doesn't show me any image. Am I doing something wrong?

test.pdf

from pikepdf import Pdf, PdfImage, Name
doc = Pdf.open('test.pdf')
page = doc.pages[0]
print(list(page.images.keys()))
@mara004
Copy link
Contributor

mara004 commented Feb 20, 2023

I think the image is nested in a Form XObject, which is not handled by the .images accessor.
But apart from that, the image is not visible on the page and doesn't have high resolution AFAICS.

@benbro
Copy link
Author

benbro commented Feb 20, 2023

Thanks @mara004

What is the correct way to access all images in a document.
I've tried this:

from pikepdf import Pdf, PdfImage, Name
doc = Pdf.open('test.pdf')
for object in doc.objects:
  print('object')
  if getattr(object, "Type", None) == "/XObject" and getattr(object, "Subtype", None) == "/Image":
    print('image')

But getting an error:

Traceback (most recent call last):
    if getattr(object, "Type", None) == "/XObject" and getattr(object, "Subtype", None) == "/Image":
ValueError: pikepdf.Object is not a Dictionary or Stream

@benbro
Copy link
Author

benbro commented Feb 20, 2023

@mara004 according to this issue mozilla/pdf.js#16073 (comment) 16 0 obj is an image with huge dimensions. Is this image invisible?

@mara004
Copy link
Contributor

mara004 commented Feb 20, 2023

There have been various reports about images nested in XObjects in the past. Maybe see #423 (comment)

@mara004
Copy link
Contributor

mara004 commented Feb 20, 2023

@mara004 according to this issue mozilla/pdf.js#16073 (comment) 16 0 obj is an image with huge dimensions. Is this image invisible?

According to PDFium, your actual image is 2x2 pixels, i. e. extremely small (though it is displayed differently).

$ pypdfium2 pageobjects test.pdf --filter image
# Page 1
    image
        Position: (1.036, 50.3844, 77.2091, 100.9194)
        Filters: []
        width: 2
        height: 2
        horizontal_dpi: 1.8904294967651367
        vertical_dpi: 2.8495075702667236
        bits_per_pixel: 1
        colorspace: Indexed
-> Count: 1

-> Total count: 1

@mara004
Copy link
Contributor

mara004 commented Feb 20, 2023

And this is the rendered image (again using PDFium for my simplicity):

$ pypdfium2 extract-images test.pdf -o out/ --use-bitmap --render

test_1.png

@benbro
Copy link
Author

benbro commented Feb 20, 2023

This is what I'm getting when using _find_images(page) from #423 (comment)
If I understand it correctly, there is 2x2 image with large smask.
Could the image and/or the smask cause the issue?

[[<pikepdf.Stream(owner=<...>, data=<...>, {
    "/BitsPerComponent": 1,
    "/ColorSpace": [ "/Indexed", "/DeviceRGB", 1, "ÿÿÿ" ],
    "/Height": 2,
    "/Interpolate": False,
    "/Length": 2,
    "/SMask": pikepdf.Stream(owner=<...>, data=<...>, {
        "/BitsPerComponent": 1,
        "/ColorSpace": "/DeviceGray",
        "/Filter": "/FlateDecode",
        "/Height": 4332,
        "/Length": 38424,
        "/Subtype": "/Image",
        "/Type": "/XObject",
        "/Width": 34862
      }),
    "/Subtype": "/Image",
    "/Type": "/XObject",
    "/Width": 2
  })>, <pikepdf.Stream(owner=<...>, data=<...>, {
    "/BitsPerComponent": 1,
    "/ColorSpace": "/DeviceGray",
    "/Filter": "/FlateDecode",
    "/Height": 4332,
    "/Length": 38424,
    "/Subtype": "/Image",
    "/Type": "/XObject",
    "/Width": 34862
  })>]]

@mara004
Copy link
Contributor

mara004 commented Feb 20, 2023

Ah, the image are the three arrows, right? Then it's actually visible. I was confused by the coordinates, but it makes sense if they're relative to the Form XObject.

@benbro
Copy link
Author

benbro commented Feb 20, 2023

Is there a way to detect large images and resize them? What should I check in the images returned from _find_images()?

@mara004
Copy link
Contributor

mara004 commented Feb 20, 2023

What do you mean by resizing? Making the image visually smaller, or downsampling it?

@mara004
Copy link
Contributor

mara004 commented Feb 20, 2023

@benbro
Copy link
Author

benbro commented Feb 20, 2023

How do I create a PdfImage from the list returned from find_images in #423 (comment)?

@mara004
Copy link
Contributor

mara004 commented Feb 20, 2023

Just PdfImage(raw_image) I suppose

@benbro
Copy link
Author

benbro commented Feb 20, 2023

Thanks, I'll try to downsample and replace the images.

@mnmtz
Copy link

mnmtz commented Jul 6, 2023

I have the same issue in different PDF files, where find_images from #423 finds more images than page.images.
Examples are (find_images | page.images):

example_120.pdf
11 images | 9 images

example_043.pdf
13 images | 0 images

(external link because of the size limit)
https://query.prod.cms.rt.microsoft.com/cms/api/am/binary/RE5bUvv?culture=en-us&country=us
120 images | 39 images

example_063.pdf
63 images | 27 images

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants