Add WARC support #128

stuartyeates · 2024-03-17T21:08:29Z

WARC support would be great. It's used at-scale web archives across the world as the standard file format for web archiving. More information at https://en.wikipedia.org/wiki/WARC_(file_format)

Most linux distros have wget, whose modern versions can generate one flavour of WARC file, using the --warc-file=file argument.

mxmlnkn · 2024-03-19T09:07:42Z

Thanks for your suggestion. It seems that libarchive has support for WARC since 2014, but when I tried to mount it with archivemount or fuse-archive, the mount point was empty. If libarchive works in general, then adding a libarchive backend would also implement this, but the problem with archivemount doesn't bode well. The performance with libarchive wouldn't be optimal anyway because the interface is not designed for random access.

By default, wget also compresses each record individually with gzip, which is very well-behaved for random access via rapidgzip. It should be fast and the index should be small.

The Common Crawl dataset also is served as warc.gz and would be a very strong use case for performant access to this.

The format itself looks simple enough. It is reminiscent of TAR in that way. For example, I tried wget "http://www.archiveteam.org/" --warc-file="archiveteam" and the excerpt looks like this:

WARC/1.0
WARC-Type: warcinfo
Content-Type: application/warc-fields
WARC-Date: 2024-03-19T08:28:23Z
WARC-Record-ID: <urn:uuid:f58a9100-9f87-4d55-bcde-26ab3e6f24e3>
WARC-Filename: warc.warc.gz
WARC-Block-Digest: sha1:UBNRIW2HYDQJPWRCF62TBARPPVGRURDX
Content-Length: 230

software: Wget/1.21.3 (linux-gnu)
format: WARC File Format 1.0
conformsTo: http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf
robots: classic
wget-arguments: "http://www.archiveteam.org/" "--warc-file=warc" 



WARC/1.0
WARC-Type: request
WARC-Target-URI: <http://www.archiveteam.org/>
Content-Type: application/http;msgtype=request
WARC-Date: 2024-03-19T08:28:24Z
WARC-Record-ID: <urn:uuid:5b442677-ad1a-4e4e-9c9d-c4b3df9714ca>
WARC-IP-Address: 213.184.85.58
WARC-Warcinfo-ID: <urn:uuid:f58a9100-9f87-4d55-bcde-26ab3e6f24e3>
WARC-Block-Digest: sha1:FVU7DNIUKG52UCVEDHDW6HFHPAN5VBXK
Content-Length: 134

GET / HTTP/1.1
Host: www.archiveteam.org
User-Agent: Wget/1.21.3
Accept: */*
Accept-Encoding: identity
Connection: Keep-Alive



WARC/1.0
WARC-Type: response
WARC-Record-ID: <urn:uuid:d6066ba8-7de3-40c4-b6eb-3802e16b7052>
WARC-Warcinfo-ID: <urn:uuid:f58a9100-9f87-4d55-bcde-26ab3e6f24e3>
WARC-Concurrent-To: <urn:uuid:5b442677-ad1a-4e4e-9c9d-c4b3df9714ca>
WARC-Target-URI: <http://www.archiveteam.org/>
WARC-Date: 2024-03-19T08:28:24Z
WARC-IP-Address: 213.184.85.58
WARC-Block-Digest: sha1:PZCODAG3UOXR5KOPYW6I2CTKYOF5GEOJ
WARC-Payload-Digest: sha1:WKTQ7MWYCDFBHGCYQLTJQHZTTGECH2B6
Content-Type: application/http;msgtype=response
Content-Length: 939

HTTP/1.1 301 Moved Permanently
Connection: Keep-Alive
Keep-Alive: timeout=5, max=100
content-type: text/html
content-length: 707
date: Tue, 19 Mar 2024 08:28:23 GMT
server: LiteSpeed
location: https://www.archiveteam.org/

<!DOCTYPE html>
<html style="height:100%">
<head>
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no" />
<title> 301 Moved Permanently
</title></head>
<body style="color: #444; margin:0;font: normal 14px/20px Arial, Helvetica, sans-serif; height:100%; background-color: #fff;">
<div style="height:auto; min-height:100%; ">     <div style="text-align: center; width:800px; margin-left: -400px; position:absolute; top: 30%; left:50%;">
        <h1 style="margin:0; font-size:150px; line-height:150px; font-weight:bold;">301</h1>
<h2 style="margin-top:20px;font-size: 30px;">Moved Permanently
</h2>
<p>The document has been permanently moved.</p>
</div></div></body></html>

It has all the necessary information such as date, URI, and content length. So yeah, similar to TAR, we could simply collect the offsets for each entry and jump to it. This would avoid parsing the archive from the beginning, which would have to be done with a libarchive backend.

However, this dump already shows multiple problems. Do you have any opinion, expectation, or precedent as to how the mounted view should look?

The WARC records contain a target URI, which could be used as the mount file path, e.g., www.archiveteam.org. But in this case www.archiveteam.org would be a redirect and an HTML file. And if more URIs would be crawled, it would also have to be a folder...
- I guess, URIs that should be folders could be remapped to something like index.html assuming it is HTML, which might not always be the case.
- In order to make the redirect and the HTML itself available at the mount point, the file versions API could be used, i.e., access the redirect via: mounted/www.archiveteam.org/index.html.versions/1.
How to handle the protocol in the URI? Simply strip it? Put it into subfolders, e.g., mounted/https, mounted/http, and mounted/metadata? Especially, the metadata:// URIs necessitate a feature like this. Alternatively, metadata records could simply not be shown.
How to handle the WARC requests? Simply drop them? Show them via the file versions API?
What about the other metadata such as WARC-Record-ID? Should these be exposed via FUSE mount somehow? How? It might be possible to return them as POSIX extended attributes. I have never used those but they seem to be free-form and supported by FUSE.

I guess that these conceptual problems are the reason why archivemount and fuse-archive don't work.

Alternatively, each WARC record could simply be exposed as a file name numbered from 0, or maybe even better the WARC UUID. Then, the mount point would contain no hierarchy and possibly hundreds of thousands of files with only cryptic file names. The URI would then also be exposed via the extended file attributes if that works. This would save a lot of complexity and assumptions on the ratarmount side. Would that be an option for you?

stuartyeates · 2024-03-19T19:44:23Z

I don't have all the answers, but:
(a) the most widely known/used interface to WARC files is the wayback machine, so any choice you make that roughly corresponds to what they do will at least be understood by most WARC-aware users. For example the URL of this page, as harvested this morning is: https://web.archive.org/web/20240319193506/https://github.com/mxmlnkn/ratarmount/issues/128#issuecomment-2006417057
(b) If there are multiple possible use cases, is there potential to expose both the single directory and the file hierarchy?
(c) I'll ask around what people want. See https://cloudisland.nz/@stuartyeates/112124114323588585

jackdos · 2024-03-22T09:24:39Z

There are well established index formats for WARCs that do what you're describing of collecting offsets for various pieces of content, and which are the basis of how the wayback machine (the technology, not to be confused with the Internet Archive a service using similar technology) works (CDX indexes are one text based way of doing this, although I know that there was also a BDB format that was in use at some point). Webrecorder have a tool for generating these: CDXJ-Indexer.

You might also want to check out the concept of WACZ which bundles the index and the warc(s) into a single zip file.

The thing I would caution to bear in mind is that WARCs don't generally contain traditional file system resources. It was probably true in the early days of the web that websites were reflections of some physical filesystem layout on a server, served largely as static content, but that hasn't been true for quite some time. Websites today are more like applications. What you're getting in a WARC (at least to the extent that you're using them as Web-ARChives and not as a generic content + metadata container, which I know some people do) is a full set of requests and responses made when crawling a particular site. Some of those requests are for resources that you could map onto a filesystem-like structure, but lots of them aren't so it would be worth bearing mind what those resources even mean in the context of a mount point like this.

Hope this helps.

mxmlnkn · 2024-03-31T17:13:11Z

While working on #109 / #130, I have a state that can mount WARC files with libarchive. Without doing any special treatment, the file hierarchy for hello-world.warc provided by libarchive looks like this:

python3 ratarmount.py -f -d 3 tests/hello-world.warc mounted
tree mounted

Output:

mounted
└── warc-specifications
    └── primers
        └── web-archive-formats
            └── hello-world.txt

3 directories, 1 file

And the file contents of hello-world.txt:

HTTP/1.1 200 OK
Server: GitHub.com
Content-Type: text/plain; charset=utf-8
Last-Modified: Wed, 08 Jul 2015 21:53:08 GMT
Access-Control-Allow-Origin: *
Expires: Wed, 08 Jul 2015 22:05:13 GMT
Cache-Control: max-age=600
Content-Length: 13
Accept-Ranges: bytes
Date: Wed, 08 Jul 2015 21:55:13 GMT
Via: 1.1 varnish
Age: 0
Connection: keep-alive
X-Served-By: cache-lcy1127-LCY
X-Cache: MISS
X-Cache-Hits: 0
X-Timer: S1436392513.648949,VS0,VE165
Vary: Accept-Encoding

Hello World

Trying to mount the test file created with wget "http://www.archiveteam.org/" --warc-file="archiveteam" results in:

mounted

0 directories, 0 files

Adding debug output also shows nothing and there seem to be no errors, i.e., libarchive behaves as if the file was an empty archive. I'd have to check the libarchive implementation source to see why this happens. Maybe because, as @jackdos said, none of the warc records can be mapped onto a filesystem-like structure.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add WARC support #128

Add WARC support #128

stuartyeates commented Mar 17, 2024

mxmlnkn commented Mar 19, 2024 •

edited

stuartyeates commented Mar 19, 2024 •

edited

jackdos commented Mar 22, 2024

mxmlnkn commented Mar 31, 2024

Add WARC support #128

Add WARC support #128

Comments

stuartyeates commented Mar 17, 2024

mxmlnkn commented Mar 19, 2024 • edited

stuartyeates commented Mar 19, 2024 • edited

jackdos commented Mar 22, 2024

mxmlnkn commented Mar 31, 2024

mxmlnkn commented Mar 19, 2024 •

edited

stuartyeates commented Mar 19, 2024 •

edited