Archive.org collection transfer scripts

The provided Shell/Python scripts use the Internet Archive Python-/CL-interface to transfer selected filetypes from Internet Archive collections to a Hadoop cluster (HDFS).

The transfer happens in two steps:

A list of all files to transfer will be created (create_filelist.py)
While the files are being transfered, the script keeps track of the already transfered files to allow restarting the process anytime and continue where it stopped (download_files.py)

The downloaded files are being separated into different folders corresponding to their filetypes.

During transfer, a specified number of files will be downloaded into a local staging directory and copied to HDFS in a bunch (default 10, max_staging in download_files.py).

Usage

First of all, please install https://github.com/jjjake/internetarchive to have the required ia command available.

Next, please modify download.sh according to your needs to include your paths and required filetypes. download.sh calls the python scripts and should be used to start off the transfer process.

`download.sh`

To be called with ./download.sh <COLLECTION_NAME>.
E.g., ./download.sh ArchiveIt-Collection-1234

`create_filelist.py`

Called by download.sh: ./create_filelist.py <COLLECTION_NAME> <GLOB_PATTERNS>.
E.g., ./create_filelist.py ArchiveIt-Collection-1234 *.cdx *.cdx.gz *.warc.gz

`download_files.py`

Called by download.sh: ./download_files.py <COLLECTION_NAME> <LOCAL_STAGING_PATH> <HDFS_PATH>
E.g., ./download_files.py ArchiveIt-Collection-1234 /mnt/ephemeral0/holzmann/tmp /user/holzmann/ia1

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
LICENSE		LICENSE
README.md		README.md
create_filelist.py		create_filelist.py
download.sh		download.sh
download_files.py		download_files.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LICENSE

LICENSE

README.md

README.md

create_filelist.py

create_filelist.py

download.sh

download.sh

download_files.py

download_files.py

Repository files navigation

Archive.org collection transfer scripts

Usage

`download.sh`

`create_filelist.py`

`download_files.py`

About

Releases

Packages

Contributors 3

Languages

License

helgeho/internetarchive-transfer-scripts

Folders and files

Latest commit

History

Repository files navigation

Archive.org collection transfer scripts

Usage

About

Resources

License

Stars

Watchers

Forks

Languages