Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF.check() does not report warnings when qpdf --check does #515

Open
workflowsguy opened this issue Aug 29, 2023 Discussed in #511 · 1 comment
Open

PDF.check() does not report warnings when qpdf --check does #515

workflowsguy opened this issue Aug 29, 2023 Discussed in #511 · 1 comment

Comments

@workflowsguy
Copy link

Discussed in #511

I have encountered several pdfs for which Pdf.check() does not report any warnings whereas qpdf --checkdoes.

This can be reproduced with the following files, which I believe to be freely downloadable on the iInternet:

202108webinar1631038510844.pdf
markdown-cheatsheet-online.pdf
WNDR3700_SM_04JUN2010.pdf

@jbarlow83
Copy link
Member

Triage:

In the interest of sanity, the newer QPDF Job API does get the exact same behavior as qpdf --check.

job = Job(['yourapp', '--check', 'issue515.1.pdf'])
job.run()

Output

WARNING: issue515.1.pdf: linearized file contains an uncompressed object after a compressed one in a cross-reference stream
WARNING: issue515.1.pdf: first shared object offset mismatch: hint table = 4542437; computed = 4541196
WARNING: issue515.1.pdf: page 0 has shared identifier entries
WARNING: issue515.1.pdf: page 0: shared object 11210: in hint table but not computed list
WARNING: issue515.1.pdf: incorrect offset in outlines table: hint table = 5047191; computed = 5045950
pikepdf: operation succeeded with warnings
checking issue515.1.pdf
PDF Version: 1.6
File is not encrypted
File is **linearized**

The above means pikepdf is capable of replicating qpdf's behavior.

The errors in the first file, at least, are all linearization issues, and all minor issues that pikepdf/libqpdf can repair in stride without loss of data.

The current behavior is that Pdf.check_linearization() reports no issues but returns False, but logs warnings; and Pdf.check() reports the issues actually found by Pdf.check_linearization(). For current versions of pikepdf, you'd need to call .check_linearization() first, then .check(). I don't like that the API works this way, with .check_linearization() changing the output of .check(), but that is mostly how libqpdf behaves and in this case we're mostly just forwarding its behavior.

I am tempted to deprecate .check and .check_linearization, replacing both with something more consistent and Pythonic. Another tempting possibility is to simply drop both methods and recommend the use of the Job API for these checks. (As a separate issue, the Job API needs to be wired into the Python logging system.)

Unfortunately, this can't be improved a new API or breaking change to existing APIs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants