Theory of Algorithms Project

Y4S2 Theory of Algorithms Project

Description

A C program that calculates & outputs the SHA-512 digest of a given input file, in accordance with the NIST FIPS 180-4 specification.

Folder Structure

Below is a summary of this repository's contents.

$ tree -P *.[ch]
.
├── src
│   ├── main.c
│   ├── main.h
│   ├── sha512.c
│   ├── sha512.h
│   ├── utils.c
│   └── utils.h
└── tests
    ├── files
    └── test_sha512.c

Files	Description
`src/main.[ch]`	Main program entry point.
`src/sha512.[ch]`	SHA-512 implementation.
`src/utils.[ch]`	Utility functions.
`tests/test_sha512.c`	Test entry point.
`tests/files`	Test resource files.

Compilation

Prerequisites

First make sure you have the following installed on your system.

gcc
make

Both of these can be installed on Debian-based systems by running the following:

$ sudo apt install build-essential

After cloning the repository, you can then compile and run the program as follows:

$ make
$ ./sha512 FILE [OPTIONS]

Options

Option	Description
`-h, --help`	Print program usage information.
`-c, --compare <file>`	Compare another file's hash against the input file's.
`-v, --verify <hash>`	Verify the input file has the given hash.

Example Output

$ ./sha512 input.txt
ddaf35a193617abacc417349ae20413112e6fa4e89a97ea20a9eeee64b55d39a2192992a274fc1a836ba3c23a3feebbd454d4423643ce80e2a9ac94fa54ca49f

Installation

To install the executable for system-wide use, you could run:

$ make
$ sudo make install

Uninstalling

Alternatively to uninstall, run the following:

$ sudo make uninstall

Testing

Tests are located in the tests/ directory and can be compiled & executed by running the following command from the repository's root:

$ make test

What is SHA-512 & Why is it Important?

The Secure Hash Algorithm (SHA) functions were originally developed by the US National Institute of Standards and Technology (NIST) and published in 1993 as SHA-0, but subsequently revised with the release of SHA-1 in 1995 [1]. SHA-2 is the successor to SHA-1 and is a family of 4 hash functions, with the most complex being SHA-512, so-called because it produces an output 512 bits in length [1].

SHA-512 is an example of a cryptographic hash function. That is, a function which can read in a message and compute a fixed-length string (or digest) which can be seen as a unique representation of that message, much like a fingerprint [2]. The motivation for hash functions like SHA-512 over digital signature schemes like RSA, lies in their ability to produce a short, fixed-length digest for messages of arbitrary length. In other words, even if we had a message that is hundreds of megabytes in size, it should still be relatively fast to compute its digest. This is in contrast to algorithms like RSA where the length of the plaintext is limited [2].

An important characteristic of hash functions like SHA-512 is they are highly sensitive to all input bits [2]. This means that even minor alterations to the message will result in a vastly different digest being produced. This is called the avalanche effect [3]. For example, below the input "abc" produces a SHA-256 output that is markedly distinct from the string "abb", despite only a single letter being modified.

As a result of this property, hash functions like SHA-512 have found important application in the creation of digital signatures and in verifying the integrity of messages, e.g. by providing a way to ensure digital files have not been modified [1]. Just some of the many other real-world use cases of hash functions include the Git revision control system, which uses them to identify files being tracked in a repository [1]. Meanwhile SHA-512's sibling algorithm, SHA-256, is used in popular authentication and encryption protocols, including SSL, TLS, IPsec, SSH, and PGP [4]. In Linux systems, SHA-256 is also used for hashing passwords, while cryptocurrencies like Bitcoin use it for verifying transactions [4].

Q & A

Why can't we reverse the SHA512 algorithm to retrieve the original message from a hash digest?

One of the main distinctions between hashing and encryption is reversibility. Encryption algorithms typically require both an input as well as a key in order to generate the output (or ciphertext). This process is reversible as anyone able to obtain the key will also be able to decrypt the ciphertext and thereby view the original message [5]. In contrast, SHA-512, like all algorithms specified in NIST FIPS 180-4 are "one-way hash functions" [6]. It is therefore not required for the specific hash algorithm used to be kept secret because the digest cannot be converted back to its original form [5]. However, while it is not technically possible to reverse a hash in order to retrieve the original message, the original message may still be retrieved as hash algorithms like SHA-512 are susceptible to collisions – where two messages produce the same hash.

A central characteristic of any secure hash function is that of "preimage resistance", also referred to as "one-wayness". This means that for any given hash, it must be computationally infeasible to find a message that will produce the same hash output [2]. Meanwhile, another property of hash functions is "second preimage resistance", also referred to as (weak) collision resistance. This means that for hash algorithms like SHA-512, it should be infeasible to create two distinct plaintext messages that produce the exact same digest [2]. These are both crucial characteristics of hash algorithms as they are widely used for verifying digital files. If a bad actor were able to feasibly construct a message that produces the same hash output as a given file, they could perform substitution attacks, replacing the file with another malicious file that produces the same digest [2].

It is not possible to design a hash function that is fully resistant to collisions. This is because the input to a hash function is theoretically infinite, while the digest will always be the same length. For this reason, it must be true that multiple inputs will inevitably produce the same hash value [2]. This is illustrated by the Pigeonhole Principle, which states that if a pigeon loop were to contain 100 birds but only 99 holes, at least one of the holes must be occupied by two birds [2]. Because every hash algorithm produces an output with a fixed number of bits (n), there are 2ⁿ possible digests that can be computed by that algorithm, while there is an infinite number of possible inputs [2] (in the case of SHA-512 there are 2⁵¹² possible outputs). In other words, there are an infinite number of possible inputs that can produce the same output. Because of this, even given unlimited computing power it is not possible to retrieve the exact message that produced a given hash value as there are so many inputs that could produce the same hash [1]. It would of course be possible to determine some possible messages that produce that hash (i.e. collisions), given enough time and resources, but it would not be possible to determine precisely what input was fed to the algorithm originally [1].

Can you design an algorithm that, given enough time, will find input messages that give each of the possible 512-bit strings?

As mentioned previously, secure hash algorithms should be "preimage resistant". A preimage of a given hash digest (H), is any message (M) that when run through the hash algorithm will produce the value of H [1]. In other words, a preimage is any message wherein Hash(M) == H [1]. Preimages are distinct from collisions, which is where two distinct input messages produce the same hash.

If given a hash value, one can search for preimages using a brute force technique which involves continuously checking the hash output of different messages until eventually finding a message that produces the same output as the target hash. The below C code is adapted from the pseudo-code shown on page 174 of [1]. Here, the find_preimage() function takes a SHA-512 hash (called target) and will iteratively attempt to find the SHA-512 hash of a random plaintext message (m). It will then use strcasecmp() to check if the hash of that message (hash_m) equals the hash value passed into the function. If so, the function will return the matching input message (m), or otherwise continue looping until a match is eventually found.

#include <stdio.h>
#include <string.h>
#include <strings.h>
#include <stdbool.h>

/* Finds a preimage of `target` using brute-force. */
char* find_preimage(const char* target)
{
    while (true) {
        char* m = some_message();
        char* hash_m = sha512(m);

        if (strcasecmp(hash_m, target) == 0) {
            return m; // Found preimage
        }
    }
}

If given enough time, and provided with each of the 512-bit strings starting from "0" * 128, up to "f" * 128, this function will return an input message for each of the possible 512-bit strings.

In the code below, the for loop (in which the find_preimage() function is called), executes once for each SHA-512 string in the targets[] array. On each iteration, once a preimage is found, the result is subsequently printed to the screen.

#include <stdio.h>
#include <stdlib.h>
#include <inttypes.h>
#include <math.h>

int main(void)
{
    // Array of all possible SHA-512 output strings (shortened for brevity)
    char* targets[] = {
        "0000...",
        // ......,
        "ffff...",
    };

    // The total number of hashes that can be produced by SHA-512 (2^512)
    uint64_t num_hashes = (uint64_t)pow(2, 512);

    // For each hash in targets[], try and brute-force a preimage
    for (int i = 0; i < num_hashes; i++) {
        char* result = find_preimage(targets[i]);
        puts(result);
    }

    return EXIT_SUCCESS;
}

However, this sort of brute force attack wherein we try as many possible inputs in order to find a message corresponding to a given digest would require 2^L evaluations, where L is the length of the digest (for SHA-512 that is 512 bits) [6]. Such a technique can therefore be considered wildly impractical due to the length of time it would take to successfully find even just a single input message, as explained in [7]. (Although note that that particular example involved collision attacks on SHA-256; preimage attacks on SHA-512 would take a considerably longer amount of time.)

How difficult is it to find a hash digest beginning with at least twelve zeros?

The occurrence of leading zeros in hash digests has particular significance in the area of mining cryptocurrencies, like Bitcoin.

In Bitcoin mining, individuals called "miners" compete to find a new block in the blockchain. Each block contains a header, which in turn contains a target hash. The mining process involves adding a random value (called a "nonce") to the header of the current block and then calculating its SHA-256 hash [8]. This procedure is repeated using a brute-force technique wherein the value of the nonce is continuously altered until eventually the hash digest produced is smaller than the value of the target hash for that block [8,9].

Because in the Bitcoin blockchain it is a requirement that each block take about 10 minutes to be mined [9], in order to ensure this remains constant the so-called difficulty level is periodically adjusted by either increasing or decreasing the number of leading zeros in the target hash of a given block [9]. The entire mining process is significantly difficult due to both the length of time and the amount of processing power required, an issue that becomes more apparent as more leading zeros are present in the target hash.

In [10] the author uses a Python script in order to demonstrate how the time to generate an acceptable hash rapidly increases as more leading zeros are added to the target hash. Using just a laptop, generating a hash with 7 leading zeros took about 16 minutes and 675 million iterations. At the time that article was written, the Bitcoin difficulty level required 19 leading zeros, with miners collectively calculating 117 Exahashes per second (one Exahash is 1,000,000,000,000,000,000) [10]. Today Bitcoin miners are collectively calculating at a rate of over 163 Exahashes per second (see below) [11].

When [12] was written in 2014, a successful hash would have needed to start with 17 zeros, meaning just one out of 1.4x10²⁰ hashes would be successful. Finding such a hash would be more difficult than finding a particular grain of sand out of all the grains of sand on Earth [12]. Today, at the time of writing it is even harder as the latest recorded Bitcoin block had a target hash consisting of 20 leading zeros and a difficulty level of about 21.866 trillion [13].

00000000000000000000edfc2d32a0d6302c200b34d195afa1588d4f61789e5a

Although as mentioned previously it takes an average of 10 minutes for one of the miners in the network to discover a block, calculating a hash like the one above requires significant computational power and miners therefore generally require specialised hardware in order to be successful [10].

References

Jean-Philippe Aumasson. Serious Cryptography: A Practical Introduction to Modern Encryption. Chapter 6 – Hash Functions. No Starch Press, 2017. ISBN: 978-1-59327-826-7.
Christof Paar and Jan Pelzl. Understanding Cryptography: A Textbook for Students and Practitioners. Chapter 11 – Hash Functions. Springer Science & Business Media, 2009, ISBN: 978-3-642-04101-3. DOI: 10.1007/978-3-642-04101-3.
Secure Hash Algorithms. Brilliant.org.
SHA-256 Algorithm Overview. SolarWinds MSP. 12 Sep, 2019.
Hashing explained: Why it's your best bet to protect stored passwords. CSO Online. Lucian Constantin. Jan 13, 2021.
Quynh H. Dang. Secure Hash Standard. Federal Inf. Process. Stds. (NIST FIPS) - 180-4. Gaithersburg, MD: US Department of Commerce, National Institute of Standards and Technology, Aug. 2015. DOI: 10.6028/NIST.FIPS.180-4.
Why haven't any SHA-256 collisions been found yet?. user47922. StackExchange Cryptography. May 29, 2017.
How Bitcoin mining really works. Subhan Nadeem. freeCodeCamp. May 31, 2018.
Introduction to Blockchain. Zaid Khaishagi. Jun 16, 2018.
Bitcoin Proof of Work. Bob Peers. May 11, 2020.
Total Hash Rate (TH/s). Accessed March 25, 2021. Blockchain.com.
Mining Bitcoin with pencil and paper: 0.67 hashes per day. Ken Shirriff's blog. Ken Shirriff. Sep, 2014.
Block 676248. Blockchain.com. March 25, 2021.

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
.vscode		.vscode
src		src
tests		tests
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.vscode

.vscode

src

src

tests

tests

.gitignore

.gitignore

Makefile

Makefile

README.md

README.md

Repository files navigation

Theory of Algorithms Project

Description

Folder Structure

Compilation

Prerequisites

Options

Example Output

Installation

Uninstalling

Testing

What is SHA-512 & Why is it Important?

Q & A

Why can't we reverse the SHA512 algorithm to retrieve the original message from a hash digest?

Can you design an algorithm that, given enough time, will find input messages that give each of the possible 512-bit strings?

How difficult is it to find a hash digest beginning with at least twelve zeros?

References

About

Releases

Packages

Languages

daniel-keogh/theory-algos-project

Folders and files

Latest commit

History

Repository files navigation

Theory of Algorithms Project

Description

Folder Structure

Compilation

Prerequisites

Options

Example Output

Installation

Uninstalling

Testing

What is SHA-512 & Why is it Important?

Q & A

Why can't we reverse the SHA512 algorithm to retrieve the original message from a hash digest?

Can you design an algorithm that, given enough time, will find input messages that give each of the possible 512-bit strings?

How difficult is it to find a hash digest beginning with at least twelve zeros?

References

About

Topics

Resources

Stars

Watchers

Forks

Languages