PDF.check() does not report warnings when qpdf --check does #515

workflowsguy · 2023-08-29T10:04:10Z

Discussed in #511

I have encountered several pdfs for which Pdf.check() does not report any warnings whereas qpdf --checkdoes.

This can be reproduced with the following files, which I believe to be freely downloadable on the iInternet:

202108webinar1631038510844.pdf
markdown-cheatsheet-online.pdf
WNDR3700_SM_04JUN2010.pdf

The text was updated successfully, but these errors were encountered:

jbarlow83 · 2023-09-10T06:11:51Z

Triage:

In the interest of sanity, the newer QPDF Job API does get the exact same behavior as qpdf --check.

job = Job(['yourapp', '--check', 'issue515.1.pdf'])
job.run()

Output

WARNING: issue515.1.pdf: linearized file contains an uncompressed object after a compressed one in a cross-reference stream
WARNING: issue515.1.pdf: first shared object offset mismatch: hint table = 4542437; computed = 4541196
WARNING: issue515.1.pdf: page 0 has shared identifier entries
WARNING: issue515.1.pdf: page 0: shared object 11210: in hint table but not computed list
WARNING: issue515.1.pdf: incorrect offset in outlines table: hint table = 5047191; computed = 5045950
pikepdf: operation succeeded with warnings
checking issue515.1.pdf
PDF Version: 1.6
File is not encrypted
File is **linearized**

The above means pikepdf is capable of replicating qpdf's behavior.

The errors in the first file, at least, are all linearization issues, and all minor issues that pikepdf/libqpdf can repair in stride without loss of data.

The current behavior is that Pdf.check_linearization() reports no issues but returns False, but logs warnings; and Pdf.check() reports the issues actually found by Pdf.check_linearization(). For current versions of pikepdf, you'd need to call .check_linearization() first, then .check(). I don't like that the API works this way, with .check_linearization() changing the output of .check(), but that is mostly how libqpdf behaves and in this case we're mostly just forwarding its behavior.

I am tempted to deprecate .check and .check_linearization, replacing both with something more consistent and Pythonic. Another tempting possibility is to simply drop both methods and recommend the use of the Job API for these checks. (As a separate issue, the Job API needs to be wired into the Python logging system.)

Unfortunately, this can't be improved a new API or breaking change to existing APIs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDF.check() does not report warnings when qpdf --check does #515

PDF.check() does not report warnings when qpdf --check does #515

workflowsguy commented Aug 29, 2023

jbarlow83 commented Sep 10, 2023

PDF.check() does not report warnings when qpdf --check does #515

PDF.check() does not report warnings when qpdf --check does #515

Comments

workflowsguy commented Aug 29, 2023

Discussed in #511

jbarlow83 commented Sep 10, 2023