PDF.js does not recognize semantic markup in PDFs #214

mattdricker · 2021-06-30T18:43:42Z

As reported to us during a meeting with the Ohio State University accessibility team, Hypothesis -- using PDF.js as its PDF viewer -- does not recognize or make visible any semantic markup or tagging (the tag tree) that may be employed by the PDF author. And thus, any such tagging is opaque to screen readers or other adaptive technology tools.

This is a large barrier to being able to meet the accessibility requirements at OSU, and a considerable gap in our striving to meet the accessibility needs of all our users.

mattdricker · 2021-06-30T18:49:06Z

PDF.js appears to making recent improvements to read the tag tree:

Corey at OSU has emailed us to report that using the latest PDF.js pre-release v2.9.359 may work quite well.

robertknight · 2021-07-01T11:11:55Z

Thanks for the update Matt. Per the release notes (https://github.com/mozilla/pdf.js/releases/tag/v2.9.359), there are some significant changes to rendering of the hidden text layer in this release:

This release features improved text layer rendering (so words and whitespace better match the rendered page)

This has the potential to impact anchoring existing annotations made with Hypothesis, so we need to test this carefully before we can ship this change.

dwhly · 2021-07-01T14:16:42Z

This release features improved text layer rendering

This has been an issue for so long. Completely awesome if this really is a substantial improvement.

Obviously we need to understand the impact any changes would have.

However, assuming that:

In PDFs where whitespace was poorly handled before, some of those PDFs would now be better rendered, and
That in some of these the changes would be big enough that any ability of fuzzy anchoring to gracefully reanchor would fail, and
That this means some not insignificant number of annotations would now orphan on those pages....

I think the decision should probably be to proceed anyway (assuming there isn't some magic solution, needing implementation, that would allow us both to proceed and to be able to successfully reanchor historic annotations).

We're still in a kind of happy early state where the large majority of annotations are freshly made on documents each semester, and neither students nor teachers are able to return to the ones they've made earlier in a prior course. That will soon change w/ course copy functionality (at some point) allowing teachers to copy forward annotations made as scaffolding on documents they teach regularly, and also any features that allow students to claim and preserve annotations they make during courses.

Obviously w/ > 25 million annotations now, made over the course of 7 years or so, there may be some pain-- but moving towards better tech for the billions of annotations that will follow probably gets the vote.

mattdricker · 2021-07-01T15:52:46Z

Internal Slack convos for reference:
https://hypothes-is.slack.com/archives/C8TPC8XMK/p1622039652008500
https://hypothes-is.slack.com/archives/C8TPC8XMK/p1625076568000700

mattdricker · 2021-10-25T18:18:30Z

Solved with update to latest PDF.js hypothesis/pdf.js-hypothes.is@0fc20ea

mattdricker added the accessibility label Jul 1, 2021

mattdricker closed this as completed Oct 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDF.js does not recognize semantic markup in PDFs #214

PDF.js does not recognize semantic markup in PDFs #214

mattdricker commented Jun 30, 2021

mattdricker commented Jun 30, 2021

robertknight commented Jul 1, 2021

dwhly commented Jul 1, 2021 •

edited

mattdricker commented Jul 1, 2021

mattdricker commented Oct 25, 2021

PDF.js does not recognize semantic markup in PDFs #214

PDF.js does not recognize semantic markup in PDFs #214

Comments

mattdricker commented Jun 30, 2021

mattdricker commented Jun 30, 2021

robertknight commented Jul 1, 2021

dwhly commented Jul 1, 2021 • edited

mattdricker commented Jul 1, 2021

mattdricker commented Oct 25, 2021

dwhly commented Jul 1, 2021 •

edited