Document Identity Determination? #50

BigBlueHat · 2019-01-14T19:06:50Z

Curious to get thoughts from everyone on whether having document identification determination code would be useful for this project.

By "document identification determination" I mean the process of sorting out which one (or more!) identifiers should be stored as the target.

For instance:

GET /?utm_source=twitter&utm_medium=social
Host: http://example.com/

<html>
<head>
  <base href="http://cdn.example.com/">
  <link rel="canonical" href="http://www.example.com/">
  <link rel="latest-version" href="index.html">
  <link rel="working-copy" href="newer.html">
  <link rel="ogp:url" href="https://www.example.com/">
  <link rel="schema:url" href="https://www.example.com/index.html">
</head>

The ?utm_ prefixed query param are typical marketing-bot tracking thingies.
The canonical rel is from https://tools.ietf.org/html/rfc6596
The latest-version and working-copy rel's are from https://tools.ietf.org/html/rfc5829
The ogp:url is from http://ogp.me/
The schema:url is from http://schema.org/

At some level all (or most) of these are the same (presumably 😉). However, determining their "sameness" is outside of the scope of an annotation tool (I'd reckon), but storing the right one (or more) is mandatory for the annotation to make sense.

What I'm wondering is if we should provide a basic retrieval mechanism for determining the existence and potential value of them to the annotation. At the very least it would be handy to get back a list of all stated identifiers for the current document.

Real world scenario (which I just tripped over) is W3C Editorial Draft specs with GitHub URLs (or hosted locally) have their future Technical Recommendation (TR) URLs set as the rel="canonical" (which is injected by ReSpec post-page loading). Consequently, annotating the Verifiable Claims Data Model is hampered if only the canonical URL is stored (because it's not yet hit TR).

It's that "other" part of annotation creation that's so fun. 😁

💭's?

The text was updated successfully, but these errors were encountered:

tilgovi · 2019-06-23T00:20:24Z

I have long thought that mozilla/fathom would be an interesting tool to use for such things.

Treora added the discussion Issues without a clear plan for action label Jul 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document Identity Determination? #50

Document Identity Determination? #50

BigBlueHat commented Jan 14, 2019

tilgovi commented Jun 23, 2019 •

edited

Document Identity Determination? #50

Document Identity Determination? #50

Comments

BigBlueHat commented Jan 14, 2019

tilgovi commented Jun 23, 2019 • edited

tilgovi commented Jun 23, 2019 •

edited