Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docinfo_from_xmp() fails on reduced precision dates (YYYY / YYYY-MM) #576

Open
devnoname120 opened this issue Apr 7, 2024 · 1 comment
Labels

Comments

@devnoname120
Copy link

docinfo_from_xmp() returns an exception when given a reduced precision date (YYYY / YYYY-MM).

Investigation

From the documentation of Adobe about the Date XMP data type:

A date-time value is represented using a subset of the formats as defined in Date and Time Formats:
YYYY
YYYY-MM
YYYY-MM-DD
YYYY-MM-DDThh:mmTZD
YYYY-MM-DDThh:mm:ssTZD
YYYY-MM-DDThh:mm:ss.sTZD

However pikepdf uses datetime.fromisoformat(), which according to Python's documentation doesn't support the YYYY and YYYY-MM date formats:

classmethod date .fromisoformat(date_string)
Return a date corresponding to a date_string given in any valid ISO 8601 format, with the following exceptions:

  • Reduced precision dates are not currently supported (YYYY-MM, YYYY).

How to reproduce

Example 1

import pikepdf

pdf = pikepdf.new()

with pdf.open_metadata() as pdf_metadata:
    pdf_metadata['xmp:CreateDate'] = '2023'

Output:

project/.venv/lib/python3.12/site-packages/pikepdf/models/metadata.py:529: UserWarning: The DocumentInfo field /CreationDate could not be updated from XMP

Example 2

import pikepdf

pdf = pikepdf.new()

with pdf.open_metadata() as pdf_metadata:
    pdf_metadata['xmp:CreateDate'] = '2023-11'

Output:

project/.venv/lib/python3.12/site-packages/pikepdf/models/metadata.py:529: UserWarning: The DocumentInfo field /CreationDate could not be updated from XMP
@jbarlow83
Copy link
Member

Surprisingly both the PDF internal date spec (PDFmark) and XMP both allow reduced precision dates.

It looks using pendulum.Interval would make it possible to round-trip reduced precision dates, with the interval set to the time period covered, e.g. 2023 would become the interval 2023-01-01 (inclusive) through 2024-01-01 (exclusive). That would allow a consistent representation that could distinguish between the year 2023 and 2023-01-01. Then encode_pdf_date and decode_pdf_date would have to learn about pendulum's datetime and interval classes, without breaking backward compatibility.

Arrow and Python standard library (along with most other software) will render a date like "2023" as "2023-01-01".

I can't say this issue is high priority from my perspective, and it will be fussy, but PRs are welcome if you want to see it tackled sooner.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants