WIP: Document practices that may benefit from standardisation #54

anjackson · 2019-07-03T09:16:52Z

We have a few cases where different tools are implementing shared use cases in slightly different WARC record structures. The purpose of this issue is to collect information on these variations so we can at least document their usage and prevent any further unnecessary variation. Understanding current usage should also set the stage for standardisation.

Crawl-time rendering artefacts

A number of organisations are now running web browsers during the crawl, and this provides an opportunity to preserve more information about how a site looked at the time it was captured.

WARC-Type	Content-Type	WARC-Target-URI	Tool
`resource`	`application/pdf`, `text/html`, `image/png`	`urn:X-wpull:snapshot?url=<ENCODED_URL>`	wpull Also stores a `WARC-Concurrent-To` pointer to a snapshot action metadata record
`resource`	`image/jpeg`	`screenshot:<CANONICAL_URL>`	Brozzler code
`resource`	`image/jpeg`	`thumbnail:<CANONICAL_URL>`	Brozzler code
`resource`	`image/jpeg`	`screenshot:<URL>`	UKWA code
`resource`	`application/pdf`	`pdf:<URL>`	UKWA code
`resource`	`image/jpeg`	`thumbnail:<URL>`	UKWA code
`resource`	`text/html; charset="utf-8"`	`imagemap:<URL>`	UKWA code
`resource`	`application/json`	`har:<URL>`	UKWA code
`resource`	`text/html`	`onreadydom:<URL>`	UKWA code
`resource`	`image/png`	`urn:view:<URL>`	browsertrix-crawler
`resource`	`image/png`	`urn:fullPage:<URL>`	browsertrix-crawler
`resource`	`image/jpeg`	`urn:thumbnail:<URL>`	browsertrix-crawler
`conversion`	`text/html; charset="utf-8"`	`<URL>`	crocoite*
`conversion`	`image/png`	`<URL>`	crocoite*
			UMBRA?

*Note that crocoite uses additional record headers to indicate the type of the conversion record, e.g. X-Crocoite-Type': 'dom-snapshot

Web A/V Capture

WARC-Type	Content-Type	WARC-Target-URI	Tool
`metadata`	`application/vnd.youtube-dl_formats+json`	`metadata://<AUTHORITY_AND_RESOURCE>`	wpull, Heritrix3 ExtractorYoutubeDL module, Old Webrecorder
`metadata`	`application/vnd.youtube-dl_formats+json;charset=utf-8`	`youtube-dl:<CANONICAL_URL>`	Brozzler code
`resource`	as found	`youtube-dl:<PLAYLIST_INDEX>:<WEBPAGE_URL>`	Brozzler code
		Webrecorder

Crawl Logs

At UKWA we consider our crawl logs to be important artefacts, but we don't put them in WARC. Maybe we should?

WARC-Type	Content-Type	WARC-Target-URI	Tool
`resource`	`text/plain`	`urn:X-wpull:log`	wpull
`metadata`	`application/json`	?	crocoite

EDIT 2023-10-18: Updated with notes from comments.

The text was updated successfully, but these errors were encountered:

PromyLOPh · 2019-07-06T13:50:42Z

For reference, crocoite is using conversion records to store screenshot and DOM snapshot and metadata records log entries, see https://github.com/PromyLOPh/crocoite/blob/master/crocoite/warc.py#L216 and https://github.com/PromyLOPh/crocoite/blob/master/crocoite/warc.py#L201 and https://github.com/PromyLOPh/crocoite/blob/master/crocoite/warc.py#L236

tw4l · 2023-05-11T13:26:54Z

Updating with Webrecorder's current practices for screenshots:

Crawl-time rendering artefacts

WARC-Type	Content-Type	WARC-Target-URI	Tool
`resource`	`image/png`	`urn:view:<URL>`	browsertrix-crawler
`resource`	`image/png`	`urn:fullPage:<URL>`	browsertrix-crawler
`resource`	`image/jpeg`	`urn:thumbnail:<URL>`	browsertrix-crawler

anjackson · 2023-10-18T07:39:47Z

I've attempted to update this with the information you provided, @PromyLOPh @tw4l .

ikreymer mentioned this issue Oct 4, 2023

Improved Text Extraction, stored to WARC webrecorder/browsertrix-crawler#403

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Document practices that may benefit from standardisation #54

WIP: Document practices that may benefit from standardisation #54

anjackson commented Jul 3, 2019 •

edited

PromyLOPh commented Jul 6, 2019

tw4l commented May 11, 2023

anjackson commented Oct 18, 2023

WIP: Document practices that may benefit from standardisation #54

WIP: Document practices that may benefit from standardisation #54

Comments

anjackson commented Jul 3, 2019 • edited

Crawl-time rendering artefacts

Web A/V Capture

Crawl Logs

PromyLOPh commented Jul 6, 2019

tw4l commented May 11, 2023

Crawl-time rendering artefacts

anjackson commented Oct 18, 2023

anjackson commented Jul 3, 2019 •

edited