Skip to content
This repository has been archived by the owner on May 31, 2020. It is now read-only.

archivesunleashed/auk-notebooks

Repository files navigation

Archives Unleashed Cloud: Jupyter Notebooks

Binder Docker Stars Docker Pulls LICENSE Contribution Guidelines

Jupyter notebooks to assist in creating additional analysis and visualizations of Archives Unleashed Cloud derivatives.

notebook screenshot

The following article provides a nice overview:

Deschamps, Ryan, Ruest, Nick, Lin, Jimmy, Fritz, Samantha, Milligan, Ian. The Archives Unleashed Notebook: Madlibs for Jumpstarting Scholarly Exploration. Proceedings of the 2019 IEEE/ACM Joint Conference on Digital Libraries (JCDL 2019), June 2019, Urbana-Champaign, Illinois.

Requirements

We suggest using Anaconda Distribution or Docker.

Usage

Anaconda is a package manager that can help you find packages and dependencies, including some of the most popular ones used in data science research analysis. To run the Jupyter Notebook via Anaconda run the following:

Local (Anaconda)

git clone https://github.com/archivesunleashed/auk-notebooks.git
cd auk-notebooks
pip install -r requirements.txt
python -m nltk.downloader punkt vader_lexicon stopwords
jupyter notebook

Docker

Docker is a container-based virtual machine system that bundles dependencies together, this means you can build the Docker image and it will work out of the box. To run the Jupyter Notebook via Docker, there are two options, Docker Hub and Docker Locally.

Docker Hub

docker run --rm -it -p 8888:8888 archivesunleashed/auk-notebooks

Docker Locally

git clone https://github.com/archivesunleashed/auk-notebooks.git
cd auk-notebooks
docker build -t auk-notebook .
docker run --rm -it -p 8888:8888 auk-notebook

This repository comes with sample data, you can swap out the sample data with your own Archives Unleashed Cloud data.

docker run --rm -it -p 8888:8888 -v "/path/to/own/data:/home/jovyan/data" auk-notebook

Note: You must grant the within-container notebook user or group (NB_UID or NB_GID) write access to the host directory (e.g., sudo chown 1000 /some/host/folder/for/work).

Types of Visualizations

There are several types of visualizations that you can produce in the Jupyter Notebook. A total of 14 outputs can be generated.

  • Domain Analysis: Provides information about what has been crawled (e.g. which domains) and how often.
  • Text Analysis: Highlights the frequency of words through various filters including domain and year.
  • Sentiment Analysis: Visualizes sentiment scores by domain and year.
  • Network Analysis: Shows the connections and relationship among websites through network graph layouts.

Additional Notes

This repository also uses the Jupyter Docker Stacks, which provide several helpful options for customizing the container environment.

License

This application is available as open source under the terms of the Apache License, Version 2.0.

Resources

The example dataset in the data directory was created with the Archives Unleashed Cloud, and is drawn from the B.C. Teachers' Labour Dispute (2014), collected by the University of Victoria Libraries. We are grateful that they've allowed us to use this material. The full-text derivative file is a random sample (37,000 lines) of the complete file because of GitHub file size limitations.

If you use this material, please cite it along the following lines:

  • Archives Unleashed Project. (2018). Archives Unleashed Toolkit (Version 0.17.0). Apache License, Version 2.0.
  • University of Victoria Libraries, B.C. Teachers' Labour Dispute (2014), Archive-It Collection 4867, https://archive-it.org/collections/4867.

Acknowledgments

This work is primarily supported by the Andrew W. Mellon Foundation. Any opinions, findings, and conclusions or recommendations expressed are those of the researchers and do not necessarily reflect the views of the sponsors.

About

Jupyter notebooks to assist in creating additional analysis and visualizations of Archives Unleashed Cloud derivatives.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published