New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature: tagged pdf #461
Comments
This is quite interesting and aligns well with my current work. Overall, this is an interesting proof of concept and with some work I could definitely work based on this into pikepdf. pikepdf has a mix of C++ and Python. Many significant higher level features are entirely in Python - it's not an issue to accept code that does useful things in Python. C++ is for mainly binding QPDF, and occasionally, for improving performance in tight spots. pikepdf is a library, and in a library we don't want to dictate policy. We let the library's user set policy. For example it seems like you've determined that only H1, H2 and P are supported and no more than 3 fonts will appear, if I've reading your code correctly? A library shouldn't be dictating that sort of thing - while it should have sensible defaults, it needs to be more flexible. It would make more sense if
pikepdf also provides NumberTree. StructTree involves number trees and should use the existing API where possible. I prefer It looks like the essential features are covered but it would be worth using e.g. Acrobat's preflight feature to confirm that the tagging is correctly generated. |
Yeah, I think what needs to be in the library is a) the structtree class rewritten properly as a pikepdf style helper class like the page object is. b) code to generate an MCID to put in a stream. The autotagging stuff in the gist was just for one specific file, I was thinking that level of functionality probably lives outside the library. It's just a demo that you can use this code to generate a tagged pdf (that works in the tag view in acrobat pro, so I think it is to spec) If you're happy with python code in the library, I don't mind putting in a structtree wrapper class when I next have a bit of time. |
Yeah, I think what needs to be in the library is a) the structtree class rewritten properly as a pikepdf style helper class like the page object is. b) code to generate an MCID to put in a stream. The autotagging stuff in the gist was just for one specific file, I was thinking that level of functionality probably lives outside the library. It's just a demo that you can use this code to generate a tagged pdf (that works in the tag view in acrobat pro, so I think it is to spec) If you're happy with python code in the library, I don't mind putting in a structtree wrapper class when I next have a bit of time. Oh and about MCIDs existing, good point, it should probably catch that and rejiggle the structure tree accordingly - I think MCIDs should really be in numerical order in the content stream, so inserting one in the middle would need to increment the others. |
I spent a bit of time messing with pikepdf to create pdfs with structure tags (and add structure tags to existing pdfs). It's entirely doable with the current api, I have a proof of concept working okay, but it would be nice to have a wrapper for the structtreeroot just like there is for pages etc.
What the wrapper would do:
It maybe also would provide a method for inserting mcid into a content stream and updating the tree at the same time, just to guarantee consistency, but I'm not sure that is needed.
I've done a basic proof of concept in python, but I'm guessing looking at the library code that for this to be in the library itself it should probably be c++? I'm not sure when I'd have the time to do any c++ work on this, but you can see my python code if you want.
The text was updated successfully, but these errors were encountered: