Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document Identity Determination? #50

Open
BigBlueHat opened this issue Jan 14, 2019 · 2 comments
Open

Document Identity Determination? #50

BigBlueHat opened this issue Jan 14, 2019 · 2 comments
Labels
discussion Issues without a clear plan for action

Comments

@BigBlueHat
Copy link
Member

Curious to get thoughts from everyone on whether having document identification determination code would be useful for this project.

By "document identification determination" I mean the process of sorting out which one (or more!) identifiers should be stored as the target.

For instance:

GET /?utm_source=twitter&utm_medium=social
Host: http://example.com/
<html>
<head>
  <base href="http://cdn.example.com/">
  <link rel="canonical" href="http://www.example.com/">
  <link rel="latest-version" href="index.html">
  <link rel="working-copy" href="newer.html">
  <link rel="ogp:url" href="https://www.example.com/">
  <link rel="schema:url" href="https://www.example.com/index.html">
</head>

The ?utm_ prefixed query param are typical marketing-bot tracking thingies.
The canonical rel is from https://tools.ietf.org/html/rfc6596
The latest-version and working-copy rel's are from https://tools.ietf.org/html/rfc5829
The ogp:url is from http://ogp.me/
The schema:url is from http://schema.org/

At some level all (or most) of these are the same (presumably 😉). However, determining their "sameness" is outside of the scope of an annotation tool (I'd reckon), but storing the right one (or more) is mandatory for the annotation to make sense.

What I'm wondering is if we should provide a basic retrieval mechanism for determining the existence and potential value of them to the annotation. At the very least it would be handy to get back a list of all stated identifiers for the current document.

Real world scenario (which I just tripped over) is W3C Editorial Draft specs with GitHub URLs (or hosted locally) have their future Technical Recommendation (TR) URLs set as the rel="canonical" (which is injected by ReSpec post-page loading). Consequently, annotating the Verifiable Claims Data Model is hampered if only the canonical URL is stored (because it's not yet hit TR).

It's that "other" part of annotation creation that's so fun. 😁

💭's?

@tilgovi
Copy link
Contributor

tilgovi commented Jun 23, 2019

I have long thought that mozilla/fathom would be an interesting tool to use for such things.

@Treora Treora added the discussion Issues without a clear plan for action label Jul 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Issues without a clear plan for action
Projects
None yet
Development

No branches or pull requests

4 participants
@tilgovi @BigBlueHat @Treora and others