Feature: tagged pdf #461

joemarshall · 2023-03-18T06:54:40Z

I spent a bit of time messing with pikepdf to create pdfs with structure tags (and add structure tags to existing pdfs). It's entirely doable with the current api, I have a proof of concept working okay, but it would be nice to have a wrapper for the structtreeroot just like there is for pages etc.

What the wrapper would do:

Allow addition, deletion of structure elements into the tree.
Allow content node structure elements to be pointed at a page and marked content ID (MCID), and update the parenttree (a numbertree mapping from page/mcid back into the structure tree) accordingly.

It maybe also would provide a method for inserting mcid into a content stream and updating the tree at the same time, just to guarantee consistency, but I'm not sure that is needed.

I've done a basic proof of concept in python, but I'm guessing looking at the library code that for this to be in the library itself it should probably be c++? I'm not sure when I'd have the time to do any c++ work on this, but you can see my python code if you want.

joemarshall · 2023-03-18T06:59:19Z

https://gist.github.com/joemarshall/1b4906e49b6f8570e2020af901d944c1#file-pdftags-ipynb

jbarlow83 · 2023-03-18T08:49:20Z

This is quite interesting and aligns well with my current work. Overall, this is an interesting proof of concept and with some work I could definitely work based on this into pikepdf.

pikepdf has a mix of C++ and Python. Many significant higher level features are entirely in Python - it's not an issue to accept code that does useful things in Python. C++ is for mainly binding QPDF, and occasionally, for improving performance in tight spots.

pikepdf is a library, and in a library we don't want to dictate policy. We let the library's user set policy. For example it seems like you've determined that only H1, H2 and P are supported and no more than 3 fonts will appear, if I've reading your code correctly? A library shouldn't be dictating that sort of thing - while it should have sensible defaults, it needs to be more flexible. It would make more sense if insert_marks took as a parameter a mapping of font name and size to a tag. Then, another function would scan the PDF and provide a list of font name and size combinations, so that the library user can use that information to decide on the font name size -> tag mapping (or, it can provide an interface for the user to decide).

insert_marks should also anticipate the possibility that marks already existing in the content stream. We ought to scan the existing stream and set the initial MCID number above any existing numbers. Alternately, detect existing PDF tag and either discard the existing or refuse to modify it.

pikepdf also provides NumberTree. StructTree involves number trees and should use the existing API where possible.

I prefer pikepdf.Name.Thing and encourage it. e.g. I think struct_root.Type is cleaner than struct_root["/Type"].

It looks like the essential features are covered but it would be worth using e.g. Acrobat's preflight feature to confirm that the tagging is correctly generated.

joemarshall · 2023-03-18T09:34:43Z

Yeah, I think what needs to be in the library is a) the structtree class rewritten properly as a pikepdf style helper class like the page object is. b) code to generate an MCID to put in a stream.

The autotagging stuff in the gist was just for one specific file, I was thinking that level of functionality probably lives outside the library. It's just a demo that you can use this code to generate a tagged pdf (that works in the tag view in acrobat pro, so I think it is to spec)

If you're happy with python code in the library, I don't mind putting in a structtree wrapper class when I next have a bit of time.

joemarshall · 2023-03-18T09:40:56Z

Yeah, I think what needs to be in the library is a) the structtree class rewritten properly as a pikepdf style helper class like the page object is. b) code to generate an MCID to put in a stream.

The autotagging stuff in the gist was just for one specific file, I was thinking that level of functionality probably lives outside the library. It's just a demo that you can use this code to generate a tagged pdf (that works in the tag view in acrobat pro, so I think it is to spec)

If you're happy with python code in the library, I don't mind putting in a structtree wrapper class when I next have a bit of time.

Oh and about MCIDs existing, good point, it should probably catch that and rejiggle the structure tree accordingly - I think MCIDs should really be in numerical order in the content stream, so inserting one in the middle would need to increment the others.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: tagged pdf #461

Feature: tagged pdf #461

joemarshall commented Mar 18, 2023

joemarshall commented Mar 18, 2023

jbarlow83 commented Mar 18, 2023

joemarshall commented Mar 18, 2023

joemarshall commented Mar 18, 2023

Feature: tagged pdf #461

Feature: tagged pdf #461

Comments

joemarshall commented Mar 18, 2023

joemarshall commented Mar 18, 2023

jbarlow83 commented Mar 18, 2023

joemarshall commented Mar 18, 2023

joemarshall commented Mar 18, 2023