Skip to content

harvard-lil/wacz-exhibitor

Repository files navigation

wacz-exhibitor 🏛️

Experimental proxy and wrapper boilerplate for safely and efficiently embedding Web Archives (.warc, .warc.gz, .wacz) into web pages.

This implementation:

  • Wraps Webrecorder's <replay-web-page> client-side playback technology.
  • Serves, proxies and caches web archive files using NGINX.
  • Allows for two-way communication between the embedding website and the embedded archive using post messages.
<!-- Safely embedding "archive.wacz" on https://example.com: -->
<iframe
  src="https://wacz.example.com/?source=archive.wacz&url=https://what-was-archived.ext/path"
  allow="allow-scripts allow-forms allow-same-origin"
>
</iframe>

See also: Live Demo, Blog post

Perma Tools


Summary


Concept

"It's a wrapper"

wacz-exhibitor serves an HTML document containing a pre-configured instance of <replay-web-page>, webrecorder's client-side web archives playback system, pointing at a proxied version of the requested WARC/WACZ file.

The playback will only start if said HTML document is embedded in a cross-origin <iframe> for security reasons (XSS prevention in the context of an <iframe> needing both allow-script and allow-same-origin).

We recommend hosting wacz-exhibitor on a subdomain of the embedding website to avoid third-party cookie limitations:

www.example.com -> Has iframes pointing at wacz.example.com
wacz.example.com -> Hosts wacz-exhibitor

"It's a proxy"

wacz-exhibitor pulls and serves the requested archive file in the format required by <replay-web-page> (right Content-Type, support for range requests, CORS resolution and Content Security Policy).

The requested web archive file can be sourced from either:

  • The local /archives/ folder. This is where the server will look first.
  • A remote location the server will proxy from, defined in nginx.conf.

☝️ Back to summary


Routes

/?source=X&url=Y

Role

Serves an HTML document containing an instance of <replay-web-page>, pointing at a proxied archive file.

Must be embedded in a cross-origin <iframe>, preferably on the same parent domain to avoid third-party cookie limitations.

Methods

GET, HEAD

Query parameters

Name Required ? Description
source Yes Filename of the .warc, .warc.gz or .wacz. Can contain a path, but cannot be a url.
The file must either be present in the /archives/ folder or on the remote server defined in nginx.conf.
url No Url of a page within the archive to display.
ts No Timestamp of the page to retrieve. Can be either a YYYYMMDDHHMMSS-formatted string or a millisecond timestamp or a.
embed No <replay-web-page>'s embed mode. Can be set to replayonly to hide its UI.
deepLink No <replay-web-page>'s deepLink mode.
noSandbox No If set, will remove the sandbox from the <replay-web-page> iframe. May be necessary for certain playbacks; e.g., cross-browser compatible playbacks of PDFs.

Examples

<!-- On https://*.domain.ext: -->
<iframe
  src="https://wacz.domain.ext/?source=archive.warc.gz&url=https://what-was-archived.ext/path"
  allow="allow-scripts allow-forms allow-same-origin allow-downloads"
>
</iframe>

/*.[wacz|warc|warc.gz]

Role

Pulls, caches and serves a given .warc, .warc.gz or .wacz file, with full support for range requests.

Will first look for the path + file given in the local /archives/ folder, and try to proxy it from the remote server defined in nginx.conf.

☝️ Back to summary


Deployment

This project consists of a single Dockerfile derived from the official NGINX Docker image, which can be deployed on any docker-compatible machine.

Example

The following example describes the process of deploying wacz-exhibitor on fly.io, a platform-as-a-service provider.

  1. nginx.conf needs to be edited. See comments starting with EDIT: in the document for instructions.
  2. Install the flyctl client and sign-in, if not already done.
  3. Initialize and deploy the project by running the flyctl launch command (use flyctl deploy for subsequent deploys).
  4. wacz-exhibitor is now live and visible on the fly.io dashboard.
  5. We highly recommend setting up a custom domain and SSL certificate. This can be done directly from the fly.io dashboard. Ideally, the target domain should be a subdomain of the website on which wacz-exhibitor iframes are going to be embedded: for example, www.domain.ext embedding an <iframe> from wacz.domain.ext.

☝️ Back to summary


Local development

Example: Running wacz-exhibitor locally using docker

docker build . -t wacz-exhibitor-local
docker run --rm -p 8080:8080 wacz-exhibitor-local
# wacz-exhibitor is now accessible at http://localhost:8080

Shortcut: start-dev.sh

Development Sandbox

A minimal sandbox is available to test embedding wacz-exhibitor <iframe>s in webpages.

You may edit sandbox/index.html to make it point to a specific web archive file and run the following command to start the sandbox:

# Assuming: wacz-exhibitor is running on port 8080 ...
bash start-sandbox.sh
# The sandbox is now accessible at http://localhost:8000

☝️ Back to summary


Communicating with the embedded archive

wacz-exhibitor allows the embedding website to communicate with the embedded archive playback using post messages. All messages coming from a wacz-exhibitor <iframe> come with a waczExhibitorHref property, helping identify the sender.

This feature can be used to build interactive experiences using web archive files.

Messages interpreted by the wacz-exhibitor <iframe>

wacz-exhibitor will look for the following properties in messages coming from the embedding website and react accordingly:

Property name Expected value Description
updateUrl String If provided, will replace the current url parameter of <replay-web-page>.
updateTs Number If provided, will replace the current ts parameter of <replay-web-page>.
getCollInfo Boolean If provided, will send a post message back with <replay-web-page>'s collInfo object, containing meta information about the currently-loaded archive.
getInited Boolean If provided, will send a post message back with the current value of <replay-web-page>s inited property, indicating whether or not the service worker is ready.
overrideElementAttribute HTMLAttributeOverride If provided, will look for the element with the specified CSS selector inside <replay-web-page> and if found, apply the requested HTML attribute to it. If the element is not found, will send a post message back reporting "status": "timed out", along with a copy of the original message's data.

Messages hoisted from <replay-web-page>

wacz-exhibitor will forward to the embedding website every post message sent by <replay-web-page>'s service worker.

The most common example is the following, which is sent during navigation within an archive:

{
  "waczExhibitorHref": "https://wacz.domain.ext/?source=archive.warc.gz&url=https://what-was-archived.ext/path",
  "url": "https://what-was-archived.ext/new-path/",
  "view": "pages",
  "ts": "20220816162527"
}

Example: Intercepting messages from a wacz-exhibitor <iframe>

// Assuming: there's only 1 <iframe class="wacz-exhibitor">  
const playback = document.querySelector("iframe.wacz-exhibitor");

window.addEventListener("message", (event) => {
  // This message bears data and comes from the `wacz-exhibitor` <iframe>
  if (event?.data && event.source === playback.contentWindow) {
    console.log(event);
  }
});

Example: Sending a message to a wacz-exhibitor <iframe>

// Assuming: there's only 1 <iframe class="wacz-exhibitor">  
const playback = document.querySelector("iframe.wacz-exhibitor");
const playbackOrigin = new URL(playback.src).origin;

playback.contentWindow.postMessage(
  {"updateUrl": "https://what-was-archived.ext/new-path"},
  playbackOrigin
);

☝️ Back to summary