Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: tagged pdf #461

Open
joemarshall opened this issue Mar 18, 2023 · 4 comments
Open

Feature: tagged pdf #461

joemarshall opened this issue Mar 18, 2023 · 4 comments

Comments

@joemarshall
Copy link

I spent a bit of time messing with pikepdf to create pdfs with structure tags (and add structure tags to existing pdfs). It's entirely doable with the current api, I have a proof of concept working okay, but it would be nice to have a wrapper for the structtreeroot just like there is for pages etc.

What the wrapper would do:

  1. Allow addition, deletion of structure elements into the tree.
  2. Allow content node structure elements to be pointed at a page and marked content ID (MCID), and update the parenttree (a numbertree mapping from page/mcid back into the structure tree) accordingly.

It maybe also would provide a method for inserting mcid into a content stream and updating the tree at the same time, just to guarantee consistency, but I'm not sure that is needed.

I've done a basic proof of concept in python, but I'm guessing looking at the library code that for this to be in the library itself it should probably be c++? I'm not sure when I'd have the time to do any c++ work on this, but you can see my python code if you want.

@joemarshall
Copy link
Author

@jbarlow83
Copy link
Member

This is quite interesting and aligns well with my current work. Overall, this is an interesting proof of concept and with some work I could definitely work based on this into pikepdf.

pikepdf has a mix of C++ and Python. Many significant higher level features are entirely in Python - it's not an issue to accept code that does useful things in Python. C++ is for mainly binding QPDF, and occasionally, for improving performance in tight spots.

pikepdf is a library, and in a library we don't want to dictate policy. We let the library's user set policy. For example it seems like you've determined that only H1, H2 and P are supported and no more than 3 fonts will appear, if I've reading your code correctly? A library shouldn't be dictating that sort of thing - while it should have sensible defaults, it needs to be more flexible. It would make more sense if insert_marks took as a parameter a mapping of font name and size to a tag. Then, another function would scan the PDF and provide a list of font name and size combinations, so that the library user can use that information to decide on the font name size -> tag mapping (or, it can provide an interface for the user to decide).

insert_marks should also anticipate the possibility that marks already existing in the content stream. We ought to scan the existing stream and set the initial MCID number above any existing numbers. Alternately, detect existing PDF tag and either discard the existing or refuse to modify it.

pikepdf also provides NumberTree. StructTree involves number trees and should use the existing API where possible.

I prefer pikepdf.Name.Thing and encourage it. e.g. I think struct_root.Type is cleaner than struct_root["/Type"].

It looks like the essential features are covered but it would be worth using e.g. Acrobat's preflight feature to confirm that the tagging is correctly generated.

@joemarshall
Copy link
Author

Yeah, I think what needs to be in the library is a) the structtree class rewritten properly as a pikepdf style helper class like the page object is. b) code to generate an MCID to put in a stream.

The autotagging stuff in the gist was just for one specific file, I was thinking that level of functionality probably lives outside the library. It's just a demo that you can use this code to generate a tagged pdf (that works in the tag view in acrobat pro, so I think it is to spec)

If you're happy with python code in the library, I don't mind putting in a structtree wrapper class when I next have a bit of time.

@joemarshall
Copy link
Author

Yeah, I think what needs to be in the library is a) the structtree class rewritten properly as a pikepdf style helper class like the page object is. b) code to generate an MCID to put in a stream.

The autotagging stuff in the gist was just for one specific file, I was thinking that level of functionality probably lives outside the library. It's just a demo that you can use this code to generate a tagged pdf (that works in the tag view in acrobat pro, so I think it is to spec)

If you're happy with python code in the library, I don't mind putting in a structtree wrapper class when I next have a bit of time.

Oh and about MCIDs existing, good point, it should probably catch that and rejiggle the structure tree accordingly - I think MCIDs should really be in numerical order in the content stream, so inserting one in the middle would need to increment the others.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants