Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Page rot metadata and size param interact incorrectly in convert_from_path() #272

Open
Crowfunder opened this issue Nov 6, 2023 · 5 comments

Comments

@Crowfunder
Copy link

Crowfunder commented Nov 6, 2023

Describe the bug
Attempting to convert a pdf with a size param, with pdf Page rot rotation metadata changing its original orientation (90, 270 etc) forces the scanned pages onto i.e a horizontal template, despite it being vertical. Any PDF viewer displays the pdf, correctly, as a vertical one. As a result of this issue, half of the page is cut off, and its remainder is squished.

To Reproduce
Steps to reproduce the behavior:

import numpy as np
import cv2
from pdf2image import convert_from_path, pdfinfo_from_path

pdf_path = 'our pdf path'

# Return PDF rotation from its metadata
rotation = pdf2image.pdfinfo_from_path(pdf_path)['Page rot'])
print(f'PDF rotation: {rotation}') 

# Get the pdf pages' images
images = convert_from_path(pdf_path, 600, size=(1653, 2338))

# Write all page images to files
i=0
for image in images:
    i+=1
    cv2.imwrite(f'page{i}.jpg', np.array(image)) 

Expected behavior
Rotation metadata and size param get applied correctly.

Screenshots
An example page from a pdf with rotation

Desktop (please complete the following information):

  • OS: Debian WSL on Win10
  • Version 22

Notes:
I'm well aware that it's probably an issue with Poppler, not with pdf2image, but there may be some walkaround, or some info may be gathered here for a Poppler issue.

Theoretically the issue will be resolved if the rotation gets applied into the file permanently, instead of being embedded in metadata.

@Crowfunder Crowfunder changed the title Page rot metadata appresolved incorrectly in convert_from_path() Page rot metadata and size param interact incorrectly in convert_from_path() Nov 6, 2023
@Belval
Copy link
Owner

Belval commented Nov 6, 2023

Could you try to manually run popper on the asset? Something like:

pdftoppm -r 200 your_asset.pdf out

As you pointed out this might be an issue with poppler but I'd like to confirm first. You can also try to use pdftocairo and see if the orientation is correct in that case.

@Crowfunder
Copy link
Author

pdftoppm -r 200 your_asset.pdf out
This one worked perfectly.

@Belval
Copy link
Owner

Belval commented Nov 6, 2023

Ok so the issue is with pdf2image somehow. Can you share the asset?

@Crowfunder
Copy link
Author

Forgot to mention that pdftocairo works fine.
the pdf in question https://wormhole.app/kRZQl#5lMmzZ6BtD7RFIGOOaTOsw

@tenberg
Copy link

tenberg commented Feb 1, 2024

I just ran into a similar issue also with dpi not being set correctly. Not sure if this helps the debug process, but in my code I decided to the following:
page[0].save(f"{working_path}{pdf[0:pdf.find('.')]}.tif", "TIFF", dpi=300)

and saw this error in PIL/TiffImagePlugin.py:
ifd[RESOLUTION_UNIT] = 2
ifd[X_RESOLUTION] = dpi[0]
ifd[Y_RESOLUTION] = dpi[1]

which led me to believe dpi should be a 2 element list. So I then tried:
page[0].save(f"{working_path}{pdf[0:pdf.find('.')]}.tif", "TIFF", dpi=[300, 300])

and when I checked the .tif in Preview, the resolution was correct at 300dpi instead of 72.

Just to sum up, I converted a 11 x 8.5 pdf to tiff using the following lines and removed dpi=300 from convert_from_path and moved it to save as a 2 element list:
page = convert_from_path(f"{working_path}{pdf}", size=(3300, 2550))
page[0].save(f"{working_path}{pdf[0:pdf.find('.')]}.tif", "TIFF", dpi=[300, 300])

Hope this helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants