Skip to content

Creating and viewing WARC web archives in Servo

Alan Jeffrey edited this page Feb 23, 2018 · 4 revisions

Why?

WARC web archives are a de facto standard for archiving web content. They are the storage format for the Internet Archive Wayback Machine and supported by the Library of Congress.

Servo can make use of web archives for repeatable performance testing against real-world web content, in particular sites which have a large amount of third-party ad content. Web archives can be played using an http proxy which does not have live access to the internet, so provides a stable platform for performance benchmarking.

This note describes how to record and replay web archives, using Servo as the web engine.

This is used by the Servo WARC tests to produce a dashboard of Servo performance on archived web content.

Installing

The pywb tools support recording and playback of web archives. They can be run using Python 2 or 3, and can be installed using pip (inside a virtualenv if preferred):

pip install git+https://github.com/ikreymer/pywb.git

Docs are at https://pywb.readthedocs.io.

Creating a web archive

To create a web archive, first initialize pywb using wb-manager:

wb-manger init archives

Then start the http server, giving it access to the live internet, and asking it to record and index an archive:

wayback --live --record --autoindex

You can now browse the web, and pages will be recorded in your archives. For example (from the servo build directory):

./mach run -r http://localhost:8080/archives/record/https://nytimes.com/

Once enough of the page is visible, you can quit servo and the wayback server. This should have created an archive file:

ls collections/archives/archive
rec-TIMESTAMP-MACHINE.warc.gz

Replaying a web archive

To replay a web archive already in your collection, first start the wayback server:

wayback

Then view the archived content:

./mach run -r http://localhost:8080/archives/https://nytimes.com/

To replay a web archive recorded elsewhere, first add it to your collection:

wb-manager add archives some-other-archive.warc.gz

Replaying a web archive as an http proxy

Replaying archives this way is simple, but has some problems:

  • It relies on URL rewriting, to add the http://localhost:8080/archives/ prefix to any loaded resources, which does not catch everything, in particular URLs dynamically constructed using JavaScript.

  • Any unwritten URLs will be fetched from the live internet, so not all content is the same between runs.

  • Since URLs are rewritten to be under localhost, they are all considered same-origin by servo, so will all be executed in the same content thread, losing a lot of the benefit of concurrency.

Fortunately, pywb also supports replaying via an HTTP proxy, which removes the need for URL rewriting, since all content is delivered via the proxy.

wayback --proxy archives

Now, as well as serving URL-rewritten content on localhost, the wayback server is also acting as an HTTP proxy, serving the original content without any URL-rewriting. Unfortunately there are some steps to get Servo to view this content:

  • Servo does not have support for HTTP proxies, so needs to be proxified. On Linux this can be done with the proxychains command (installed in Debian-based systems by apt-get install proxychains).

  • Any https content is served using a certificate signed with a key stored in proxy-certs/pywb-ca.pem. This certificate must be added as a root certificate for Servo.

To run Servo with proxychains, first create a proxychains.conf file:

[ProxyList]
http 127.0.0.1 8080

then run:

proxychains ./mach run -r --certificate-path proxy-certs/pywb-ca.pem https://nytimes.com/

This will view the archived content without any URL rewriting.

Unfortunately, there is a gotcha: there are two ways content can be proxied via http: using CONNECT or using GET, and proxychains uses CONNECT but pywb only supports GET. Fortunately, for https content, there is just CONNECT, so this technique works for https content (which these days is most of the web).

At some point, Servo may get native support for HTTP proxies, at which point this should become a non-issue, but for now we're stuck only being able to test https content.

Recording a web archive as an http proxy

To record an archive while running as an http proxy:

wayback --proxy archives --live --proxy-record --autoindex

then run Servo with an http proxy as before.

Clone this wiki locally