Creating and viewing WARC web archives in Servo
WARC web archives are a de facto standard for archiving web content. They are the storage format for the Internet Archive Wayback Machine and supported by the Library of Congress.
Servo can make use of web archives for repeatable performance testing against real-world web content, in particular sites which have a large amount of third-party ad content. Web archives can be played using an http proxy which does not have live access to the internet, so provides a stable platform for performance benchmarking.
This note describes how to record and replay web archives, using Servo as the web engine.
This is used by the Servo WARC tests to produce a dashboard of Servo performance on archived web content.
The pywb tools support recording and playback of web archives. They can be run using Python 2 or 3, and can be installed using pip
(inside a virtualenv if preferred):
pip install git+https://github.com/ikreymer/pywb.git
Docs are at https://pywb.readthedocs.io.
To create a web archive, first initialize pywb
using wb-manager
:
wb-manger init archives
Then start the http server, giving it access to the live internet, and asking it to record and index an archive:
wayback --live --record --autoindex
You can now browse the web, and pages will be recorded in your archives. For example (from the servo build directory):
./mach run -r http://localhost:8080/archives/record/https://nytimes.com/
Once enough of the page is visible, you can quit servo and the wayback server. This should have created an archive file:
ls collections/archives/archive
rec-TIMESTAMP-MACHINE.warc.gz
To replay a web archive already in your collection, first start the wayback server:
wayback
Then view the archived content:
./mach run -r http://localhost:8080/archives/https://nytimes.com/
To replay a web archive recorded elsewhere, first add it to your collection:
wb-manager add archives some-other-archive.warc.gz
Replaying archives this way is simple, but has some problems:
-
It relies on URL rewriting, to add the
http://localhost:8080/archives/
prefix to any loaded resources, which does not catch everything, in particular URLs dynamically constructed using JavaScript. -
Any unwritten URLs will be fetched from the live internet, so not all content is the same between runs.
-
Since URLs are rewritten to be under
localhost
, they are all considered same-origin by servo, so will all be executed in the same content thread, losing a lot of the benefit of concurrency.
Fortunately, pywb
also supports replaying via an HTTP proxy, which removes the need for URL rewriting, since all content is delivered via the proxy.
wayback --proxy archives
Now, as well as serving URL-rewritten content on localhost
, the wayback server is also acting as an HTTP proxy, serving the original content without any URL-rewriting. Unfortunately there are some steps to get Servo to view this content:
-
Servo does not have support for HTTP proxies, so needs to be proxified. On Linux this can be done with the
proxychains
command (installed in Debian-based systems byapt-get install proxychains
). -
Any https content is served using a certificate signed with a key stored in
proxy-certs/pywb-ca.pem
. This certificate must be added as a root certificate for Servo.
To run Servo with proxychains, first create a proxychains.conf
file:
[ProxyList]
http 127.0.0.1 8080
then run:
proxychains ./mach run -r --certificate-path proxy-certs/pywb-ca.pem https://nytimes.com/
This will view the archived content without any URL rewriting.
Unfortunately, there is a gotcha: there are two ways content can be proxied via http: using CONNECT or using GET, and proxychains uses CONNECT but pywb only supports GET. Fortunately, for https content, there is just CONNECT, so this technique works for https content (which these days is most of the web).
At some point, Servo may get native support for HTTP proxies, at which point this should become a non-issue, but for now we're stuck only being able to test https content.
To record an archive while running as an http proxy:
wayback --proxy archives --live --proxy-record --autoindex
then run Servo with an http proxy as before.