General Lempel–Ziv–Welch (LZW) Compression with 12-Bit Word Size

Author: Stephen Tse <[redacted]@cmu.edu>

This is an implementation of the Lempel–Ziv–Welch (LZW) lossless data compression algorithm, capable of processing any file types (text files, binaries, etc). As you may probably know already, LZW is the first adaptive compression algorithm that introduced the idea of progressive model learning to data compression. As a result, LZW eliminates the need to include a codeword model for each transmission of compressed data, and it normally yields better compression rate than algorithms that must generate a complete model beforehand, such as Huffman coding, due to more accurate modeling. LZW is widely used in Unix compress, GIF, TIFF, V.42bis modem, etc.

Note that the algorithm performs best on large amount of data that haven't been compressed before, or contain a lot of repetitive data patterns. If you try to use it on compressed data types (e.g. mp4 files), you'll probably notice that the size of the compressed file is large than that of the original one. After all, Shannon's theorem is unbeatable even for the greatest compression algorithm! You may be able to further compress your file a bit by using more sophisticated LZW variants such as Deflate/zlib (LZ77 variant + Huffman), but nothing more: there's a theoretical limit on how much a given information can be compressed.

How to Use

# compile
javac LZWCompression.java

# compress
java LZWCompression c [original_file_name] [compressed_file_name]

# de-compress
java LZWCompression d [compressed_file_name] [decompressed_file_name]

Implementation Notes

All the data structures required by this program are self-implemented except arrays. Note that this implementation uses linked list for queues, hash tables, etc. as per project requirement, but this is not the ideal data structure since the implementation involves a large number of add / pop operations. Change it to array based lists if you want a performance boost, and trade memory usage for some (slightly) more performance gains if you load and encode the entire file into memory first before compression / de-compression.
This implementation compresses data into 12-bit words (or data chunks). The codeword model will simply be cleared and start anew when the 12-bit word size is used up. As a result, it doesn't behave well on small amount of data especially with high variance (you may observe a larger file size after compression). This will generally be improved on larger data set (with smaller variance). To improve compression rate on datasets with many distinctive patterns, consider modifying the code to increase the word size; to improve compression rate on general cases (don't know the data pattern beforehand), consider measuring compression effectiveness on the fly and use it to decide when to reset the codeword model instead.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
HashTable.java		HashTable.java
LICENSE		LICENSE
LZWCompression.java		LZWCompression.java
README.md		README.md
SinglyLinkedList.java		SinglyLinkedList.java

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

HashTable.java

HashTable.java

LICENSE

LICENSE

LZWCompression.java

LZWCompression.java

README.md

README.md

SinglyLinkedList.java

SinglyLinkedList.java

Repository files navigation

General Lempel–Ziv–Welch (LZW) Compression with 12-Bit Word Size

How to Use

Implementation Notes

About

Releases

Packages

Languages

License

Stephen-X/General-LZW-Compression

Folders and files

Latest commit

History

Repository files navigation

General Lempel–Ziv–Welch (LZW) Compression with 12-Bit Word Size

How to Use

Implementation Notes

About

Topics

Resources

License

Stars

Watchers

Forks

Languages