Skip to content

ross-spencer/sumfolder1

Repository files navigation

Logo for sumfolder1

sumfolder1 is a utility for use within the archival and digital preservation community to generate checksums for file system directories, and to generate an overall "collection" checksum for a given set of files.

Why?

Conventionally, checksums exist for files, they do not exist for directories. They have no payload that can be summed together to calculate a digest/checksum.

If it were possible to create checksums for folders or a global checksum for a collection of objects, it would become possible to ask the following:

Archival questions

  • What is the collection checksum for a given set of files and folders?
  • What is the checksum for a given folder?
  • Given a collection of objects online, am I looking at an authentic listing?
  • Have I downloaded a collection in its entirety?

Structural questions

  • Is file/folder hash(x) included in the collection set?
  • Is file/folder hash(y) (non-existent) part of the entire set?
  • Is file hash(x) part of folder(y) where the collection has arbitrary depth?
  • Where are duplicate checksums located within a collection?

Forensics questions

  • Has a digital object been removed from the collection?
  • Did the collection contain at least one empty directory?

How?

Given a set of file paths and existing checksums it is possible to compute a checksum for a folder by creating a checksum of the given checksums.

Given checksum 1) 7c1f9f9a4d0ce9a72ee63f37a1b7f694 and checksum 2) aececec0bc3f515039aec9e60c413cd3 an MD5 can be computed as: 82f9e9a4305714fffdd7932783980cbc.

We can see this illustrated for a small collection as follows:

πŸ“ folder_1 82f9e9a4305714fffdd7932783980cbc
    πŸ“„ checksum_1 7c1f9f9a4d0ce9a72ee63f37a1b7f694
    πŸ“„ checksum_2 aececec0bc3f515039aec9e60c413cd3

If we follow this approach through an entire directory structure we can create checksums for all sub-directories and for the collection as a whole.

Reference set

A reference set is provided with this repository: reference set.

We can iterate through the directory tree to create sets of directory checksums and a collection checksum: 52b94608dc70813aa88dae01176dc73b.

The reference set then looks as follows:

πŸ“ collection 93778c524035d5d3e429a2fe43b7700a
   πŸ“„ file_0001 14118ff9ad0344decb37960809b2f17a
   πŸ“„ file_0000 8cfda2609b880a553759cd6200823f3b
   πŸ“„ file_0002 a4501ee1a5c711ea9db78a800a24e830
   πŸ“ sub_dir_1 82301616d7e24f474dbe21de93af0a34
      πŸ“„ file_empty d41d8cd98f00b204e9800998ecf8427e
      πŸ“„ file_0003 dc7f828c5fe622925181d06edada350f
      πŸ“„ file_0004 e3d90a4bf14a9b355f0e69ba08df522d
      πŸ“ sub_1_dir_1 1c7ba27edf1356d097a3f568032430c2
         πŸ“„ file_0005 637a3fb7da1ab61d10e96336d9758416
   πŸ“ sub_dir_2 1ccb49edc4e873f1a8affd4bad5e9b90
   πŸ“ sub_dir_3 2a60541cede91a36e9dc5bab7a97dd6e
      πŸ“ sub_3_empty_1 db9d848b4f83ff3cb3faa4df0a59e3e1
         πŸ“ sub_3_empty_2 1ccb49edc4e873f1a8affd4bad5e9b90
   πŸ“ sub_dir_4 272d45767d534335163f220c1d40e559
      πŸ“„ file_0006 2b43227486ec8744cd5d4c955d269743
      πŸ“„ file_0007 c5a1973a70e08bf1eee13b8090f790ad
      πŸ“„ file_0008 fdffe4dd2d39c7d9986dbf5c6ec5ad39
   πŸ“ sub_dir_5 d818d29b75f89a9b5d8d1c5a4c70dbbb
      πŸ“ sub_5_dir_1 82f9e9a4305714fffdd7932783980cbc
         πŸ“„ file_0009 7c1f9f9a4d0ce9a72ee63f37a1b7f694
         πŸ“„ file_0010 aececec0bc3f515039aec9e60c413cd3
   πŸ“ sub_dir_6 74be16979710d4c4e7c6647856088456
      πŸ“„ file_empty d41d8cd98f00b204e9800998ecf8427e

Reference implementation

The reference implementation for sumfolder1 does the following:

From the lowest sub-directory in the tree:

  1. Check for sub-directories and add the checksums for these to a hash digest in alphabetical order by checksum.
  2. For files in the directory add these to the hash digest in alphabetical order by checksum.
  3. Create a digest for the list of checksums.

Repeat, processing each folder backwards up to the top level.

NB. If a folder is completely empty it is assigned a constant value chosen in the code: 2600_EMPTY_DIRECTORY. This evaluates to an MD5 value of 1ccb49edc4e873f1a8affd4bad5e9b90.

Merkle trees

The concept I have used here is based on Merkle trees and a loose understanding of techniques used in the block-chain and in the source control system GitHub.

A good video summary of Merkle trees can be found on YouTube:

And a Python tutorial I found useful in starting this work:

The technique required for a directory tree is a little more convoluted than that of a Merkle tree which uses binary nodes and evaluates checksums from left to right. I believe the implementation used for sumfolder1 is more closely aligned to that of a "Radix Tree" or "Patricia Tree", however, this is to be explored more.

NB. A merkle tree can be used in its context for performance; sumfolder1 does not yet have a performance use-case.

Terminology

The reference implementation introduces some terminology that helps with understanding the approach:

  • Active-tree: the side of a directory tree that we're querying about a given hash.
  • Non-active-tree: the tree at root node (Rn+1) that do not contain the digital object that we're querying.
  • Root-node (Rn): the name of the top-level node, i.e. collection folder. This is either artificially created for a set of directories all at the same level, or exists as a function of the given collection set.

New folder attributes

Folder objects need to be given additional attributes to enable the algorithm to work.

  • Folder-depth, so directories can be grouped and distinguished from one-another by level in the hierarchy.
  • Hash, the goal of this tool is to enable a hash to be calculated for an entire collection.

Sensitivity

I am trying to make this code as portable as possible, i.e. while it works with DROID-style reports today, it might also work with other checksum-based outputs tomorrow. Additionally, to be able to compare folder structures, this utility may also work with DROID-style reports later on in a transfer workflow; at which point, folders and files may have been renamed, but their content remains consistent.

To calculate a single folder checksum we currently do the following:

  • If there are folders in the directory, order their hashes alphabetically and add to a list.
  • File checksums are then ordered alphabetically and added to the end of the list.
  • The checksums are then summed together to create a new folder-level checksum.

DROID

sumfolder1 uses the DROID format identification report to generate folder level checksums.

DROID can be found at The National Archives UK website:

DROID in Siegfried

sumfolder1 can also be used with DROID compatible reports created by Siegfried using a command such as follows:

sf --hash=md5 --droid <collection_folder>

DROID as an inspiration

File format reports provide a means of statically analyzing collections of digital objects. A DROID report satisfies the pre-conditions required to create reliable folder- and collection-level checksums for digital collections:

  • A collection is static, i.e. unlikely to change.
  • Digital objects within the collection have checksums.

NB: A collection need not be static to be analyzed but it is not the primary use-case of this utility.

More information about the different uses for a file-format identification report can be found in my paper in the Code4Lib journal.

Writing about sumfolder1

I wrote a blog describing the utility on the OPF website.

Installation

sumfolder1 is available on pypi and can be installed as follows:

pip install -U sumfolder1

Usage

sumfolder1 has the following usage instructions:

usage: sumfolder1.py [-h] [--csv CSV] [--demo] [--ref] [-v]

Calculate checksums for folders in a collection of objects using a DROID format
identification report

options:
  -h, --help          show this help message and exit
  --csv CSV           Single DROID CSV to read.
  --demo              Run demo queries and output a tree to demo.txt
  --ref, --reference  Write reference set to stdout.
  -v, --version       Return version information.

Demo output

sumfolder1's demo output can be invoked as follows:

python sumfolder1 --demo

JSON will be output to stdout describing a handful of queries generated using the reference collection.

An visualization of the collection tree will be output (for demo purposes) to stderr.

Use with a DROID csv

Given a DROID csv the tool can be invoked as follows:

python sumfolder1 --csv <droid_csv_file>

Outputting the reference CSV

A reference CSV can be output to stdout. Ideally it is piped to some other file using a command such as follows:

python sumfolder1 --ref > <output_file>

Previous work

Previous work in this area.

  • Check out direct-dedupe-1 from Stefana Breitwieser which I was recently made aware of via the BitCurator Forum 2024 and provides a shell script to calculate checksums for sub-directories providing a very pragmatic way to help users dedupe at the folder level.

License

This work is license using: GNU GENERAL PUBLIC LICENSE Version 3.