You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have a few cases where different tools are implementing shared use cases in slightly different WARC record structures. The purpose of this issue is to collect information on these variations so we can at least document their usage and prevent any further unnecessary variation. Understanding current usage should also set the stage for standardisation.
Crawl-time rendering artefacts
A number of organisations are now running web browsers during the crawl, and this provides an opportunity to preserve more information about how a site looked at the time it was captured.
We have a few cases where different tools are implementing shared use cases in slightly different WARC record structures. The purpose of this issue is to collect information on these variations so we can at least document their usage and prevent any further unnecessary variation. Understanding current usage should also set the stage for standardisation.
Crawl-time rendering artefacts
A number of organisations are now running web browsers during the crawl, and this provides an opportunity to preserve more information about how a site looked at the time it was captured.
resource
application/pdf
,text/html
,image/png
urn:X-wpull:snapshot?url=<ENCODED_URL>
Also stores a
WARC-Concurrent-To
pointer to a snapshot action metadata recordresource
image/jpeg
screenshot:<CANONICAL_URL>
resource
image/jpeg
thumbnail:<CANONICAL_URL>
resource
image/jpeg
screenshot:<URL>
resource
application/pdf
pdf:<URL>
resource
image/jpeg
thumbnail:<URL>
resource
text/html; charset="utf-8"
imagemap:<URL>
resource
application/json
har:<URL>
resource
text/html
onreadydom:<URL>
resource
image/png
urn:view:<URL>
resource
image/png
urn:fullPage:<URL>
resource
image/jpeg
urn:thumbnail:<URL>
conversion
text/html; charset="utf-8"
<URL>
conversion
image/png
<URL>
*Note that crocoite uses additional record headers to indicate the type of the conversion record, e.g.
X-Crocoite-Type': 'dom-snapshot
Web A/V Capture
metadata
application/vnd.youtube-dl_formats+json
metadata://<AUTHORITY_AND_RESOURCE>
metadata
application/vnd.youtube-dl_formats+json;charset=utf-8
youtube-dl:<CANONICAL_URL>
resource
youtube-dl:<PLAYLIST_INDEX>:<WEBPAGE_URL>
Crawl Logs
At UKWA we consider our crawl logs to be important artefacts, but we don't put them in WARC. Maybe we should?
resource
text/plain
urn:X-wpull:log
metadata
application/json
EDIT 2023-10-18: Updated with notes from comments.
The text was updated successfully, but these errors were encountered: