lz77 compression algorithm implemented #708

proof185 · 2024-04-30T00:46:10Z

No description provided.

codecov-commenter · 2024-04-30T00:48:38Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 94.81%. Comparing base (20c92d7) to head (3992a4a).

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #708      +/-   ##
==========================================
+ Coverage   94.78%   94.81%   +0.02%     
==========================================
  Files         297      298       +1     
  Lines       22149    22259     +110     
==========================================
+ Hits        20994    21104     +110     
  Misses       1155     1155

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

TruongNhanNguyen · 2024-04-30T01:58:38Z

Please describe the PR along with the problem's descriptions. It helps reviewers understand what the PR does. Could you explain the way you solve the problem in detail? Add docstrings to explain your code.

TruongNhanNguyen · 2024-04-30T03:08:11Z

The overall logic was good, but there are some points we need to improve in the implementation:

Breaking down the lz77_encode function into multiple helper functions:
- The main lz77_encode function now delegates the task of finding the longest match to a helper function called find_longest_match.
- find_longest_match iterates over the input string to find the longest match given a certain index.
- Another helper function get_match_length calculates the length of the match between two substrings.
By breaking down the functionality into smaller, focused functions, the implementation becomes easier to understand and maintain. Each function has a single responsibility, making the code more modular.
We should add module-level documentation and comments for functions and structs. This helps understand the purpose and usage of each component.
Adding a Token struct to encapsulate the information about each token generated during the compression process.
Implementing PartialEq for Token allows for easy comparison of tokens, which is useful for testing and general usage.
Using a macro for defining tests with different inputs and expected outputs is a good practice as it reduces code duplication and improves readability.

TruongNhanNguyen · 2024-04-30T03:14:19Z

src/compression/lz77.rs

+#[cfg(test)]
+mod test {
+    use super::*;
+
+    #[test]
+    fn test_lz77_encode() {
+        let res = lz77_encode("");
+        assert_eq!(res, []);
+
+        let res = lz77_encode("A");
+        let expected: Vec<(usize, usize, char)> = vec![(0, 0, 'A')];
+        assert_eq!(res, expected);
+
+        let res = lz77_encode("AA");
+        let expected: Vec<(usize, usize, char)> = vec![(0, 0, 'A'), (0, 0, 'A')];
+        assert_eq!(res, expected);
+
+        let res = lz77_encode("AAAABBBCCDAA");
+        let expected: Vec<(usize, usize, char)> = vec![
+            (0, 0, 'A'),
+            (1, 1, 'A'),
+            (3, 1, 'B'),
+            (1, 1, 'B'),
+            (0, 0, 'C'),
+            (1, 1, 'D'),
+            (10, 1, 'A'),
+        ];
+        assert_eq!(res, expected);
+
+        let res = lz77_encode("Rust-Trends");
+        let expected: Vec<(usize, usize, char)> = vec![
+            (0, 0, 'R'),
+            (0, 0, 'u'),
+            (0, 0, 's'),
+            (0, 0, 't'),
+            (0, 0, '-'),
+            (0, 0, 'T'),
+            (0, 0, 'r'),
+            (0, 0, 'e'),
+            (0, 0, 'n'),
+            (0, 0, 'd'),
+            (0, 0, 's'),
+        ];
+        assert_eq!(res, expected);
+    }
+
+    #[test]
+    fn test_lz77_decode() {
+        let res = lz77_decode(vec![]);
+        assert_eq!(res, "");
+        let res = lz77_decode(vec![(0, 0, 'A')]);
+        assert_eq!(res, "A");
+        let res = lz77_decode(vec![(0, 0, 'A'), (0, 0, 'A')]);
+        assert_eq!(res, "AA");
+        let res = lz77_decode(vec![
+            (0, 0, 'A'),
+            (1, 1, 'A'),
+            (3, 1, 'B'),
+            (1, 1, 'B'),
+            (0, 0, 'C'),
+            (1, 1, 'D'),
+            (10, 1, 'A'),
+        ]);
+        assert_eq!(res, "AAAABBBCCDAA");
+        let res = lz77_decode(vec![
+            (0, 0, 'R'),
+            (0, 0, 'u'),
+            (0, 0, 's'),
+            (0, 0, 't'),
+            (0, 0, '-'),
+            (0, 0, 'T'),
+            (0, 0, 'r'),
+            (0, 0, 'e'),
+            (0, 0, 'n'),
+            (0, 0, 'd'),
+            (0, 0, 's'),
+        ]);
+        assert_eq!(res, "Rust-Trends");
+    }
+}


Suggested change

#[cfg(test)]

mod test {

use super::*;

#[test]

fn test_lz77_encode() {

let res = lz77_encode("");

assert_eq!(res, []);

let res = lz77_encode("A");

let expected: Vec<(usize, usize, char)> = vec![(0, 0, 'A')];

assert_eq!(res, expected);

let res = lz77_encode("AA");

let expected: Vec<(usize, usize, char)> = vec![(0, 0, 'A'), (0, 0, 'A')];

assert_eq!(res, expected);

let res = lz77_encode("AAAABBBCCDAA");

let expected: Vec<(usize, usize, char)> = vec![

(0, 0, 'A'),

(1, 1, 'A'),

(3, 1, 'B'),

(1, 1, 'B'),

(0, 0, 'C'),

(1, 1, 'D'),

(10, 1, 'A'),

];

assert_eq!(res, expected);

let res = lz77_encode("Rust-Trends");

let expected: Vec<(usize, usize, char)> = vec![

(0, 0, 'R'),

(0, 0, 'u'),

(0, 0, 's'),

(0, 0, 't'),

(0, 0, '-'),

(0, 0, 'T'),

(0, 0, 'r'),

(0, 0, 'e'),

(0, 0, 'n'),

(0, 0, 'd'),

(0, 0, 's'),

];

assert_eq!(res, expected);

}

#[test]

fn test_lz77_decode() {

let res = lz77_decode(vec![]);

assert_eq!(res, "");

let res = lz77_decode(vec![(0, 0, 'A')]);

assert_eq!(res, "A");

let res = lz77_decode(vec![(0, 0, 'A'), (0, 0, 'A')]);

assert_eq!(res, "AA");

let res = lz77_decode(vec![

(0, 0, 'A'),

(1, 1, 'A'),

(3, 1, 'B'),

(1, 1, 'B'),

(0, 0, 'C'),

(1, 1, 'D'),

(10, 1, 'A'),

]);

assert_eq!(res, "AAAABBBCCDAA");

let res = lz77_decode(vec![

(0, 0, 'R'),

(0, 0, 'u'),

(0, 0, 's'),

(0, 0, 't'),

(0, 0, '-'),

(0, 0, 'T'),

(0, 0, 'r'),

(0, 0, 'e'),

(0, 0, 'n'),

(0, 0, 'd'),

(0, 0, 's'),

]);

assert_eq!(res, "Rust-Trends");

}

}

#[cfg(test)]

mod tests {

use super::*;

macro_rules! test_lz77_encode_decode {

($($name:ident: $input:expr, $expected:expr,)*) => {

$(

#[test]

fn $name() {

let input_string = $input;

let expected_tokens = $expected.iter().map(|&(distance, length, next_char)| Token::new(distance, length, next_char)).collect::<Vec<_>>();

let encoded_tokens = lz77_encode(&input_string);

let decoded_string = lz77_decode(encoded_tokens.clone());

assert_eq!(input_string, decoded_string);

assert_eq!(encoded_tokens, expected_tokens);

}

)*

};

}

test_lz77_encode_decode! {

empty_string: "", [],

single_character: "A", [(0, 0, 'A')],

repeated_characters: "AA", [(0, 0, 'A'), (0, 0, 'A')],

mixed_characters: "AAAABBBCCDAA", [

(0, 0, 'A'),

(1, 1, 'A'),

(3, 1, 'B'),

(1, 1, 'B'),

(0, 0, 'C'),

(1, 1, 'D'),

(10, 1, 'A'),

],

alphanumeric_characters_with_dash: "Rust-Trends", [

(0, 0, 'R'),

(0, 0, 'u'),

(0, 0, 's'),

(0, 0, 't'),

(0, 0, '-'),

(0, 0, 'T'),

(0, 0, 'r'),

(0, 0, 'e'),

(0, 0, 'n'),

(0, 0, 'd'),

(0, 0, 's'),

],

}

}

TruongNhanNguyen · 2024-04-30T03:50:46Z

src/compression/lz77.rs

+pub fn lz77_decode(tokens: Vec<(usize, usize, char)>) -> String {
+    let mut result = String::new();
+    for token in tokens {
+        if token.0 != 0 {
+            let start = result.len() - token.0;
+            let length = token.1;
+            let substring: String = result.chars().skip(start).take(length).collect();
+            result += &substring;
+        };
+        result.push(token.2);
+    }
+    result
+}


Suggested change

pub fn lz77_decode(tokens: Vec<(usize, usize, char)>) -> String {

let mut result = String::new();

for token in tokens {

if token.0 != 0 {

let start = result.len() - token.0;

let length = token.1;

let substring: String = result.chars().skip(start).take(length).collect();

result += &substring;

};

result.push(token.2);

}

result

}

/// Decompresses a vector of tokens generated by the LZ77 algorithm and returns the original input string.

///

/// # Arguments

///

/// * `tokens` - A vector of tokens generated by the LZ77 compression algorithm.

///

/// # Returns

///

/// The original input string before compression.

pub fn lz77_decode(tokens: Vec<Token>) -> String {

let mut result = String::new();

for token in tokens {

if token.distance != 0 {

let start = result.len() - token.distance;

let length = token.length;

let substring: String = result.chars().skip(start).take(length).collect();

result += &substring;

};

result.push(token.next_char);

}

result

}

TruongNhanNguyen · 2024-04-30T08:24:16Z

src/compression/lz77.rs

+pub fn lz77_encode(input: &str) -> Vec<(usize, usize, char)> {
+    let mut tokens = Vec::new();
+    let mut index = 0;
+
+    while index < input.len() {
+        let mut best_match = (0, 0);
+        let candidate = &input[index..input.len() - 1];
+
+        for i in 0..index {
+            let search_box = &input[i..index];
+            let match_length = candidate
+                .chars()
+                .zip(search_box.chars())
+                .take_while(|&(a, b)| a == b)
+                .count();
+
+            if match_length > best_match.1 {
+                best_match = (index - i, match_length);
+            }
+        }
+
+        if best_match.1 > 0 {
+            tokens.push((
+                best_match.0,
+                best_match.1,
+                input.chars().nth(index + best_match.1).unwrap(),
+            ));
+            index += best_match.1 + 1;
+        } else {
+            tokens.push((0, 0, input.chars().nth(index).unwrap()));
+            index += 1;
+        }
+    }
+    tokens
+}


Suggested change

pub fn lz77_encode(input: &str) -> Vec<(usize, usize, char)> {

let mut tokens = Vec::new();

let mut index = 0;

while index < input.len() {

let mut best_match = (0, 0);

let candidate = &input[index..input.len() - 1];

for i in 0..index {

let search_box = &input[i..index];

let match_length = candidate

.chars()

.zip(search_box.chars())

.take_while(|&(a, b)| a == b)

.count();

if match_length > best_match.1 {

best_match = (index - i, match_length);

}

}

if best_match.1 > 0 {

tokens.push((

best_match.0,

best_match.1,

input.chars().nth(index + best_match.1).unwrap(),

));

index += best_match.1 + 1;

} else {

tokens.push((0, 0, input.chars().nth(index).unwrap()));

index += 1;

}

}

tokens

}

//! This module provides an implementation of the LZ77 compression algorithm.

//!

//! LZ77 is a lossless data compression algorithm that replaces repeated occurrences

//! of data with references to a single copy of that data existing earlier in the uncompressed data stream.

//! It achieves compression by replacing repeated occurrences of data with references to a dictionary

//! that contains the position and length of the previous occurrence of that data.

//! # References

//!

//! - [LZ77 Compression Algorithm](https://en.wikipedia.org/wiki/LZ77_and_LZ78)

/// ## Token struct

///

/// The `Token` struct represents a token generated during the LZ77 compression process.

/// It consists of three fields: `distance`, `length`, and `next_char`.

///

/// - `distance`: Distance to the previous occurrence of the matched substring.

/// - `length`: Length of the matched substring.

/// - `next_char`: The next character in the input stream after the matched substring.

#[derive(Debug, Clone, Copy)]

pub struct Token {

distance: usize,

length: usize,

next_char: char,

}

impl Token {

fn new(distance: usize, length: usize, next_char: char) -> Self {

Token {

distance,

length,

next_char,

}

}

}

impl PartialEq for Token {

fn eq(&self, other: &Self) -> bool {

self.distance == other.distance

&& self.length == other.length

&& self.next_char == other.next_char

}

}

/// Compresses a given input string using the LZ77 algorithm and returns a vector of tokens.

///

/// # Arguments

///

/// * `input` - The input string to be compressed.

///

/// # Returns

///

/// A vector of tokens representing the compressed input string.

pub fn lz77_encode(input: &str) -> Vec<Token> {

let mut tokens = Vec::new();

let mut index = 0;

while index < input.len() {

let longest_match = find_longest_match(input, index);

if longest_match.1 > 0 {

tokens.push(Token::new(

longest_match.0,

longest_match.1,

input.chars().nth(index + longest_match.1).unwrap(),

));

index += longest_match.1 + 1;

} else {

tokens.push(Token::new(0, 0, input.chars().nth(index).unwrap()));

index += 1;

}

}

tokens

}

/// Finds the longest match for the current index in the input string.

///

/// This function searches for the longest match of the substring starting from

/// the current index within the input string. It iterates through the sliding

/// window to find the longest matching sequence.

///

/// # Arguments

///

/// * `input` - The input string to search within.

/// * `index` - The current index in the input string to start the search from.

///

/// # Returns

///

/// A tuple containing the distance to the previous occurrence of the matched substring

/// and the length of the matched substring.

fn find_longest_match(input: &str, index: usize) -> (usize, usize) {

let candidate = &input[index..input.len() - 1];

let mut longest_match = (0, 0);

for i in 0..index {

let search_box = &input[i..index];

let match_length = get_match_length(candidate, search_box);

if match_length > longest_match.1 {

longest_match = (index - i, match_length);

}

}

longest_match

}

/// Determines the length of the match between two substrings.

///

/// This function calculates the length of the match between the candidate

/// substring and the substring within the sliding window.

///

/// # Arguments

///

/// * `candidate` - The substring being compared from the current index.

/// * `search_box` - The substring within the sliding window to compare with.

///

/// # Returns

///

/// The length of the matching sequence between the candidate and search_box substrings.

fn get_match_length(candidate: &str, search_box: &str) -> usize {

candidate

.chars()

.zip(search_box.chars())

.take_while(|&(a, b)| a == b)

.count()

}

proof185 added 2 commits April 30, 2024 00:09

Update mod.rs

6de73c7

lz77 implemented

3992a4a

proof185 requested review from imp2002 and vil02 as code owners April 30, 2024 00:46

TruongNhanNguyen reviewed Apr 30, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lz77 compression algorithm implemented #708

lz77 compression algorithm implemented #708

proof185 commented Apr 30, 2024

codecov-commenter commented Apr 30, 2024

TruongNhanNguyen commented Apr 30, 2024

TruongNhanNguyen commented Apr 30, 2024 •

edited

TruongNhanNguyen Apr 30, 2024

TruongNhanNguyen Apr 30, 2024 •

edited

TruongNhanNguyen Apr 30, 2024 •

edited

lz77 compression algorithm implemented #708

Are you sure you want to change the base?

lz77 compression algorithm implemented #708

Conversation

proof185 commented Apr 30, 2024

codecov-commenter commented Apr 30, 2024

Codecov Report

TruongNhanNguyen commented Apr 30, 2024

TruongNhanNguyen commented Apr 30, 2024 • edited

TruongNhanNguyen Apr 30, 2024

Choose a reason for hiding this comment

TruongNhanNguyen Apr 30, 2024 • edited

Choose a reason for hiding this comment

TruongNhanNguyen Apr 30, 2024 • edited

Choose a reason for hiding this comment

TruongNhanNguyen commented Apr 30, 2024 •

edited

TruongNhanNguyen Apr 30, 2024 •

edited

TruongNhanNguyen Apr 30, 2024 •

edited