Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF.js does not recognize semantic markup in PDFs #214

Closed
mattdricker opened this issue Jun 30, 2021 · 5 comments
Closed

PDF.js does not recognize semantic markup in PDFs #214

mattdricker opened this issue Jun 30, 2021 · 5 comments

Comments

@mattdricker
Copy link
Contributor

As reported to us during a meeting with the Ohio State University accessibility team, Hypothesis -- using PDF.js as its PDF viewer -- does not recognize or make visible any semantic markup or tagging (the tag tree) that may be employed by the PDF author. And thus, any such tagging is opaque to screen readers or other adaptive technology tools.

This is a large barrier to being able to meet the accessibility requirements at OSU, and a considerable gap in our striving to meet the accessibility needs of all our users.

@mattdricker
Copy link
Contributor Author

PDF.js appears to making recent improvements to read the tag tree:

Corey at OSU has emailed us to report that using the latest PDF.js pre-release v2.9.359 may work quite well.

@robertknight
Copy link
Member

Thanks for the update Matt. Per the release notes (https://github.com/mozilla/pdf.js/releases/tag/v2.9.359), there are some significant changes to rendering of the hidden text layer in this release:

This release features improved text layer rendering (so words and whitespace better match the rendered page)

This has the potential to impact anchoring existing annotations made with Hypothesis, so we need to test this carefully before we can ship this change.

@dwhly
Copy link
Member

dwhly commented Jul 1, 2021

This release features improved text layer rendering

This has been an issue for so long. Completely awesome if this really is a substantial improvement.

Obviously we need to understand the impact any changes would have.

However, assuming that:

  • In PDFs where whitespace was poorly handled before, some of those PDFs would now be better rendered, and
  • That in some of these the changes would be big enough that any ability of fuzzy anchoring to gracefully reanchor would fail, and
  • That this means some not insignificant number of annotations would now orphan on those pages....

I think the decision should probably be to proceed anyway (assuming there isn't some magic solution, needing implementation, that would allow us both to proceed and to be able to successfully reanchor historic annotations).

We're still in a kind of happy early state where the large majority of annotations are freshly made on documents each semester, and neither students nor teachers are able to return to the ones they've made earlier in a prior course. That will soon change w/ course copy functionality (at some point) allowing teachers to copy forward annotations made as scaffolding on documents they teach regularly, and also any features that allow students to claim and preserve annotations they make during courses.

Obviously w/ > 25 million annotations now, made over the course of 7 years or so, there may be some pain-- but moving towards better tech for the billions of annotations that will follow probably gets the vote.

@mattdricker
Copy link
Contributor Author

@mattdricker
Copy link
Contributor Author

Solved with update to latest PDF.js hypothesis/pdf.js-hypothes.is@0fc20ea

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants