Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PdfImage unable to handle image type #464

Open
sim0nx opened this issue Mar 23, 2023 · 3 comments
Open

PdfImage unable to handle image type #464

sim0nx opened this issue Mar 23, 2023 · 3 comments

Comments

@sim0nx
Copy link

sim0nx commented Mar 23, 2023

I am trying to parse various PDF documents and came across one I get an exception when trying to extract its images.

  File ".../python3.9/site-packages/pikepdf/models/image.py", line 665, in extract_to
    return self._extract_to_stream(stream=stream)
  File ".../python3.9/site-packages/pikepdf/models/image.py", line 611, in _extract_to_stream
    im = self._extract_transcoded()
  File ".../python3.9/site-packages/pikepdf/models/image.py", line 564, in _extract_transcoded
    if self.mode in {'DeviceN', 'Separation'}:
  File ".../python3.9/site-packages/pikepdf/models/image.py", line 270, in mode
    raise NotImplementedError(
NotImplementedError: Not sure how to handle PDF image of this type

The document in question (https://impotsdirects.public.lu/dam-assets/fr/formulaires/pers_physiques/2022/100d-2022.pdf) contains an XFA form and 1+ images.

Following are the properties of the /PdfImage in question:

MAIN_COLORSPACES = {set: 7} {'/CalRGB', '/ICCBased', '/DeviceRGB', '/DeviceCMYK', '/CalCMYK', '/CalGray', '/DeviceGray'}
PRINT_COLORSPACES = {set: 2} {'/DeviceN', '/Separation'}
SIMPLE_COLORSPACES = {set: 4} {'/CalRGB', '/DeviceRGB', '/DeviceGray', '/CalGray'}
bits_per_component = {int} 1
colorspace = {NoneType} None
decode_parms = {list: 0} []
filter_decodeparms = {list: 1} [('/FlateDecode', {})]
filters = {list: 1} ['/FlateDecode']
height = {int} 16
icc = {NoneType} None
image_mask = {bool} True
indexed = {bool} False
is_device_n = {bool} False
is_separation = {bool} False
palette = {NoneType} None
size = {tuple: 2} (16, 16)
width = {int} 16

Would it be possible to implement support for this ?

@mara004
Copy link
Contributor

mara004 commented Mar 23, 2023

It looks like your image's colorspace is None, so pikepdf doesn't know how to handle it.
On which page in that long document is this image, anyway?

@sim0nx
Copy link
Author

sim0nx commented Mar 24, 2023

I am not sure which image / path it is exactly; not sure how I would find that out.
The first object-id/generation that is affected is (3,0). It seems like all /Image objects have that same issue for this particular PDF.

I guess setting a color space manually will not work ?

@jbarlow83
Copy link
Member

The image in question is actually a transparency mask that is involved in rendering some other image or some other feature. If you explore the structure of the PDF you may be able to learn how the image is being used.

As @sim0nx suggests, assigning a colorspace of DeviceGray and setting ImageMask to False would allow you to treat the mask as a binary image and export it, as a workaround.

In the next release I will improve support for exporting masks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants