Skip to content

This is a command line huffman compressor/decompressor, which can read text files, generate Huffman trees for those files, encode files into compressed "huffman files", and decompress huffman files into text files. Written in C++

etrizzo/Huffman

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Author: Emily Rizzo

This is a command line huffman compressor/decompressor, which can read text files, generate Huffman trees for those files,
encode files into compressed "huffman files", and decompress huffman files into text files.


Contents of README:
I.      HUFFMAN.CPP - STRUCTURE                     //structure of the program
II.     HUFFMAN.CPP - CLASSES                       //classes contained in the program
III.    HUFFMAN.CPP - FUNCTIONS                     //important functions
IV.     COMPRESSED FILE FORMAT                      //format for the compressed huffman files generated by the program
V.      CHANGING DEFAULT EXTENSIONS/DIRECTORIES     //variables to change the extensions and directories for the files read
                                                        and outputted by this program

===========================
I. HUFFMAN.CPP - STRUCTURE
===========================
main() runs a while loop, which takes in and executes commands from cin.

These commands are:

run <filename>          //runs encode and decode on <filename>
                            (ex: "run file.txt" would run encode and decode on "TXT_DIR/file.txt")
encode <filename>       //only encodes <filename>
decode <filename>       //only decodes <filename>
print                   //prints the huffman tree for the last file to be encoded
help                    //prints a brief summary of all commands
files                   //prints a list of all default files submitted with this assignment, and a brief
                            description of each
exit                    //exits the program


After exiting, main() runs a cleanup function to free memory allocated by the tree.

===========================
II. HUFFMAN.CPP - CLASSES
===========================
    == NODE CLASS ==
        - name (string)         : character or string stored in the node)
        - frequency (string)    : number of times the node occurs in the file
        - code (string)         : bit code generated by the huffman tree (path from root where L adds '0' and R adds '1')
        - left (Node)           : left child of the node
        - right (Node)          : right child of the node
        - isLeaf (bool)         : true if the node is a leaf (leaves are single characters in huffman trees)

        = void toString()
            - outputs node in format " name  ||  freq  || code  "
        = vector<Node *> postOrder()
            - recursively performs post order traversal with current node as the root.
              Returns vector of nodes in post order.
              Also generates/updates "code" for each node traversed.
              Called by tree root to generate codes, or retrieve vector of nodes in post order for encoding.

    == TREE CLASS ==
        - vector<Node *> freqs      : collection of leaves in the tree, with frequencies. Generated when reading a text file.
        - vector<Node *> huffList   : temporary storage used to generate internal nodes and tree structure
                                        by calling Huffman()
        - Node * root               : root node of the tree

        = Node* contains(string name)
            - searches tree for a leaf node with name "name", and returns it if it exists.
                if no such node is in the tree, creates a new node(name) and pushes it to freqs and huffList.

        = void addChar(string name)
            - increments the frequency of the node with name "name", either by retrieving and increment existing node,
                or creating a new node if "name" is new.

        = void Huffman()
            - constructs huffman tree using huffList. Called when all leaves have been added to huffList.
                constructs tree by repeatedly combining two lowest frequency nodes (a) and (b) in huffList
                into a new node (ab), with frequency a.frequency + b.frequency
                When huffList.size() == 1, the remaining node is the root of the tree.
                Then runs postOrder() on the root to determine the bit codes for all nodes in the tree.
        = string Encode()
            - returns an encoding of a post-order traveral of the huffman tree with the format outlined in
                    COMPRESSED FILE FORMAT section below.
              EOF character is encoded as byte "11111111" or 255 as an unsigned char, which is not on the ASCII chart


===========================
III. HUFFMAN.CPP - FUNCTIONS
===========================
Important functions in huffman.cpp are as follows:

    ===  Tree makeTree(string fileName)  ===
        -   Generates and returns a huffman tree for the text file "fileName" by:
                1. adding a node to t for each character in the file, and finding the correct frequencies for each char by
                    reading through the file once
                2. calling t.Huffman() to generate the tree structure and bit codes for the nodes.
            The tree can then be used to encode/compress the characters in the text file.
            When all characters in the file have been read, adds an "EOF" node with frequency 1, to generate EOF bit code.

    ===  long encode(string fileRoot, Tree t)  ===
        -   Encodes/compresses the file "TXT_DIR + fileRoot + TXT_EXT", using the tree t.
            The compressed file is written to the file "HUF_DIR + fileRoot + HUF_EXT"
            Compression is done by converting the string of characters in the file to a string representation of the
                bit codes determined by the tree, and then that string of bits are compressed into characters.
            This function returns the number of bytes written to the compressed file, for comparison with the original file.

    ===  long decode(string fileRoot)  ===
        -   Decodes/decompresses the file "HUF_DIR + fileRoot + HUF_EXT" by first decoding an encoded huffman tree, then
                using that tree to decode the remainder of the file (see COMPRESSED FILE FORMAT).
            The decoded file is written to the file "OUT_DIR + fileROOT + OUT_EXT"
            Decompression is done by converting the string of characters in the file to a string representation of all bits
                in the file, then calling decodeBytes() on that string of bits.
            This function returns the number of bytes written to the decompressed file, for comparisson with the compressed file.

    ===  string decodeBytes(string bytes)  ===
        -   Decodes a string of bytes by first calling decodeTree(bytes) to generate a huffman tree,
                and then using that tree to decode the remainder of the file.
            Because the final byte of the file may have been padded with 0's, decodeBytes stops reading when the EOF
                character is encountered

    ===  Tree decodeTree(string bytes)  ===
        -   Decodes and returns a huffman tree from bytes by:
                1. Pulling the number of leaves (numLeaves) in the tree from the first byte
                2. Generate tree stucture by reading through bits and adding nodes to a stack as follows:
                    - if a '1' is encountered, read the next 8 bits to get the char represented by that leaf. Push this leaf
                      onto the stack. Increment a counter for the number of leaves read (leavesFound).
                    - if a '0' is encountered, pop two nodes off the stack and combine to form an internal node. Push this
                      node onto the stack.
                    When the leavesFound = numLeaves (all leaves have been read), AND the stack contains only one node (the
                        root), the tree has been fully decoded.
                3. Call postOrder() on the root to generate the bit codes for each node in the tree before returning the tree.



===========================
IV. COMPRESSED FILE FORMAT
===========================

This program encodes huffman-compressed files in the following format:

<numLeaves> <encoded tree> <compressed file> <EOF>

where:

<numLeaves>             ==  a single byte representing the number of leaves in the huffman tree.
<encoded tree>          ==  a post-order traversal of the tree, encoded in the following way:
                            - Leaves are represented as a '1' followed by the 8 bits representing the character in that leaf.
                            - Internal nodes are represented as a '0', with no additional bits
<compressed file>       ==  the compressed text from the original file, where each character from the original file has
                            been converted to the bit code determined by the huffman tree
<EOF>                   ==  a single 'character' added to the end of the file to signify when to stop reading bits -
                                the final byte is padded to 8 bits with trailing 0's, which could be read incorrectly without
                                the EOF.


===========================
V. CHANGING DEFAULT EXTENSIONS/DIRECTORIES
===========================

Default directory locations and extensions can be modified in the file huffman.cpp, by changing the global variables:

TXT_EXT     -- the input file extension (by default ".txt")
HUF_EXT     -- the extension added to the root filename for compressed files (by default "-HUF.txt");
OUT_EXT     -- the extension added to the root filename for decompressed files (by default "-o.txt");

TXT_DIR     -- directory of source files (by default "TXT-files/")
HUF_DIR     -- directory for files compressed by the program (by default "HUF-files/")
OUT_DIR     -- directory for files decompressed by the program (by default "OUT-files/")

About

This is a command line huffman compressor/decompressor, which can read text files, generate Huffman trees for those files, encode files into compressed "huffman files", and decompress huffman files into text files. Written in C++

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages