Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Document practices that may benefit from standardisation #54

Open
anjackson opened this issue Jul 3, 2019 · 3 comments
Open

WIP: Document practices that may benefit from standardisation #54

anjackson opened this issue Jul 3, 2019 · 3 comments

Comments

@anjackson
Copy link
Member

anjackson commented Jul 3, 2019

We have a few cases where different tools are implementing shared use cases in slightly different WARC record structures. The purpose of this issue is to collect information on these variations so we can at least document their usage and prevent any further unnecessary variation. Understanding current usage should also set the stage for standardisation.

Crawl-time rendering artefacts

A number of organisations are now running web browsers during the crawl, and this provides an opportunity to preserve more information about how a site looked at the time it was captured.

WARC-Type Content-Type WARC-Target-URI Tool
resource application/pdf, text/html, image/png urn:X-wpull:snapshot?url=<ENCODED_URL> wpull
Also stores a WARC-Concurrent-To pointer to a snapshot action metadata record
resource image/jpeg screenshot:<CANONICAL_URL> Brozzler code
resource image/jpeg thumbnail:<CANONICAL_URL> Brozzler code
resource image/jpeg screenshot:<URL> UKWA code
resource application/pdf pdf:<URL> UKWA code
resource image/jpeg thumbnail:<URL> UKWA code
resource text/html; charset="utf-8" imagemap:<URL> UKWA code
resource application/json har:<URL> UKWA code
resource text/html onreadydom:<URL> UKWA code
resource image/png urn:view:<URL> browsertrix-crawler
resource image/png urn:fullPage:<URL> browsertrix-crawler
resource image/jpeg urn:thumbnail:<URL> browsertrix-crawler
conversion text/html; charset="utf-8" <URL> crocoite*
conversion image/png <URL> crocoite*
UMBRA?

*Note that crocoite uses additional record headers to indicate the type of the conversion record, e.g. X-Crocoite-Type': 'dom-snapshot

Web A/V Capture

WARC-Type Content-Type WARC-Target-URI Tool
metadata application/vnd.youtube-dl_formats+json metadata://<AUTHORITY_AND_RESOURCE> wpull, Heritrix3 ExtractorYoutubeDL module, Old Webrecorder
metadata application/vnd.youtube-dl_formats+json;charset=utf-8 youtube-dl:<CANONICAL_URL> Brozzler code
resource as found youtube-dl:<PLAYLIST_INDEX>:<WEBPAGE_URL> Brozzler code
Webrecorder

Crawl Logs

At UKWA we consider our crawl logs to be important artefacts, but we don't put them in WARC. Maybe we should?

WARC-Type Content-Type WARC-Target-URI Tool
resource text/plain urn:X-wpull:log wpull
metadata application/json ? crocoite

EDIT 2023-10-18: Updated with notes from comments.

@PromyLOPh
Copy link

For reference, crocoite is using conversion records to store screenshot and DOM snapshot and metadata records log entries, see https://github.com/PromyLOPh/crocoite/blob/master/crocoite/warc.py#L216 and https://github.com/PromyLOPh/crocoite/blob/master/crocoite/warc.py#L201 and https://github.com/PromyLOPh/crocoite/blob/master/crocoite/warc.py#L236

@tw4l
Copy link

tw4l commented May 11, 2023

Updating with Webrecorder's current practices for screenshots:

Crawl-time rendering artefacts

WARC-Type Content-Type WARC-Target-URI Tool
resource image/png urn:view:<URL> browsertrix-crawler
resource image/png urn:fullPage:<URL> browsertrix-crawler
resource image/jpeg urn:thumbnail:<URL> browsertrix-crawler

@anjackson
Copy link
Member Author

I've attempted to update this with the information you provided, @PromyLOPh @tw4l .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants