Skip to content

internetarchive/arch

Repository files navigation

ARCH

Archives Research Compute Hub

Scala version Scalatra version License: AGPL v3

About

Web application for distributed compute analysis of Archive-It web archive collections.

Building

Backend

Production

  • sbt "prod/clean" "prod/assembly" "prod/assemblyPackageDependency"

Docker

  1. Create a config (config/config.json) for your Docker setup, e.g., by copying the included template: cp config/docker.json config/config.json
  2. Setup a data directory somewhere with the following sub-directories: cache, collections, in, logging, out, tmp
  3. Build the container: docker build --no-cache -t arch .
  4. Run the container (example): docker run -it --rm -p 54040:54040 -p 12341:12341 -v "/home/nruest/Projects/au/sample-data/ars-cloud:/data" -v "/home/nruest/Projects/au/arch:/app" -v "/home/nruest/Projects/au/sample-data/ars-cloud/logging:/logging" arch

Web application will be available at: http://localhost:12341/ait, and Apache Spark interface will be available at http://localhost:54040.

For the data/input directory, an example directory structure looks like this:

├── in
│   ├── 13529
│   │   └── arcs
│   ├── 13709
│   │   └── arcs
│   ├── 14462
│   │   └── arcs
│   │       ├── ARCHIVEIT-14462-CRAWL_SELECTED_SEEDS-JOB1214854-SEED2299797-20200624234136833-00000-h3.warc.gz
│   │       ├── ARCHIVEIT-14462-CRAWL_SELECTED_SEEDS-JOB1214854-SEED2299798-20200624234136479-00000-h3.warc.gz
│   │       ├── ARCHIVEIT-14462-CRAWL_SELECTED_SEEDS-JOB1214854-SEED2299799-20200624234136645-00000-h3.warc.gz

Frontend

See webapp/src/README.md for information about building the web application.

Citing ARCH

How to cite ARCH in your research:

Helge Holzmann, Nick Ruest, Jefferson Bailey, Alex Dempsey, Samantha Fritz, Peggy Lee, and Ian Milligan. 2022. ABCDEF: the 6 key features behind scalable, multi-tenant web archive processing with ARCH: archive, big data, concurrent, distributed, efficient, flexible. In Proceedings of the 22nd ACM/IEEE Joint Conference on Digital Libraries (JCDL '22). Association for Computing Machinery, New York, NY, USA, Article 13, 1–11. https://doi.org/10.1145/3529372.3530916

Your citations help to further the recognition of using open-source tools for scientific inquiry, assists in growing the web archiving community, and acknowledges the efforts of contributors to this project.

License

AGPL v3

Open-source, not open-contribution

Similar to SQLite, ARCH is open source but closed to contributions.

The level of complexity of this project means that even simple changes can break a lot of other moving parts in our production environment. However, community involvement, bug reports and feature requests are warmly accepted.

Acknowledgments

This work is primarily supported by the Andrew W. Mellon Foundation. Other financial and in-kind support comes from the Social Sciences and Humanities Research Council, Compute Canada, York University Libraries, Start Smart Labs, and the Faculty of Arts at the University of Waterloo.

Any opinions, findings, and conclusions or recommendations expressed are those of the researchers and do not necessarily reflect the views of the sponsors.

About

Web application for distributed compute analysis of Archive-It web archive collections.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published