Skip to content

archivetheweb/archiver

Repository files navigation

Archiver

Archive the Web is an open-source website archiving tool that allows you to set up automated archiving stored on Arweave. Our mission at Archive the Web is to create a decentralized backup of the world wide web together.

Website can be found here.

How it works

In its basic form, this application crawls a website up to a specific depth, saves all interactions with the website's servers and resources loaded in a WARC format and uploads it all to the Arweave network.

Archive format

WARC 1.1 is the format chosen for this application. It an international standard used by many archives and thus allows for composable applications.

We rely heavily on Webrecorder's pywb toolkit to capture all requests between our browser and the website's servers to output a WARC file.

Arweave

The permaweb

Data added to Arweave is replicated amongst hundreds or thousands of computers or "miners" making it resilient and easily retrievable. To permanently save data, the Arweave network charges an upfront fee or an "endowment fee". The cost is estimated to incentivize these miners to continue to store the data for at least 200 years. The cost is calculated based on conservative estimates around price reductions for storage over time. For more information please check their yellow paper

Warp Contract

A Warp contract (smart contract on Arweave) is used to update the current state of the archive. Currently it is where an archiver can register, and anyone can create an "Archiving Request" that will be fulfilled by an archiver.

Warp contract address: dD1DuvgM_Vigtnv4vl2H1IYn9CgLvYuhbEWPOL-_4Mw

How to run

First ensure you have an Arweave wallet with AR in it. Also, make sure you fund you Bundlr account with sufficient AR on the Bundlr node of your choice (default is node1).

Make sure that the file is stored at the path ./archiver/.secret/wallet.json.

Third, make sure to register as an archiver. More info to come.

Vanilla

  1. Run git submodule update

  2. Ensure you have redis running on port 6379

  3. Install Google Chrome (latest stable release)

  4. Install pywb by running pip3 install pywb

  5. Run cd archiver && cargo run. If you want to get the debug output, make sure to add RUST_LOG=debug to your environment variables

Using Docker

  1. Run git submodule update

  2. Run docker-compose up