Skip to content

aroberts/digestif

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Digestif

An aid for creating hash digests of large files

Synopsis

Digestif lets you create fast checksums of large files by
skipping sections of the file. It was created with compressed media
files in mind, which generally have such a high information density
that we can get away with a checksum that doesn’t actually consider all
the bits. Someday I’d like to understand the likelyhood-of-collision
implications for specific compression algorithms (mp3, h.264, xvid, et al.),
but right now I’m going to settle for guessing at where “good enough for me”
might lie.

One side-effect of this approach is that the error-corrective nature of
digests is, of course, lost. This is really more of an inescapable artifact
of the problem we’re trying to solve. To create a hash of a really large
file, the biggest bottleneck with modern computers is streaming
5-10 gigs off of the disk. The actual checksumming is not hard.
By looking at less data, we speed up the hash process immensely, and
we incur the cost of vulnerability of file corruption. Because the
purpose I have in mind for this tool is identity checking, not
corruption detection, this issue is not a problem for me.

Installation

gem install digestif

Usage

Just like md5 on the command line, but it only works on files, not on
streaming data (can’t seek a stream).

digestif some_large_file

Since this program is designed to get around file limitations specifically, it
didn’t make sense for me to invest in making streams work.

For a detailed look at the options, see

digestif --help

Motivation

I wrote digestif to solve a problem for a media catalogue I was working on.
I wanted a filename-independent way to evaluate whether or not a file was in
the catalogue yet, but the files were so large that streaming the whole file
off of the hard drive was too slow for the response time I was hoping for.
(Interested parties, I was getting 5 gigs hashed using md5 in about 2.4
minutes.)

Author

Copyright 2011 Andrew Roberts

About

A script for computing sparse hash digests of large files

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages