Anagrams

A Quick Take on anagram finding in Rust.

The requirements for the specific task are discussed here. In essense, make a program, that takes a dictionary and a word as arguments and find all the anagrams of the word in the dictionary

Usage

Compatiblity Note: the program will exit with an error if the dictionary is not in UTF-8 encoding and contains non-ASCII characters.

Executing existing Linux binary

The project includes a linux binary (compiled under ubuntu 18.04) int the .bin dir. It also includes the standard "lemmad.txt" file, specified in the requirements.

NB! Included binary only works under linux! Also, if you are using older CPU architectures and the binary fails, then use bin/anagrams_compatiblity executable (newer than Intel 4. gen Core, e.g. "Haswell" should be fine) instead of the bin/anagrams.

Searching for anagrams for a word is done with the following command:

bin/anagrams dicts/lemmad.txt asi
> 4587,ais,asi,isa,sai

where the first argument is the dictionary and the other one is the word to search for.

To use a word with spaces as args do the following:

bin/anagrams dicts/lemmad.txt "aGu isAEEtall"
> 4674,Augeiase tall

When working with huge dictionarys that do not fit to memory (e.g tens of gigabytes), use the streaming mode with the command line flag '-b':

bin/anagrams dicts/megagigalarge.txt "aGu isAEEtall" -b

The output is the execution time in microseconds and the found anagrams as written in requirements.

Compiling from source (any platform)

install rust:

curl https://sh.rustup.rs -sSf | sh

build the project:

cargo build --release

run the executable: (built into the ./target/release folder).

./target/release/anagrams dicts/lemmad.txt anagramm

(or with any other desired parameters)

About the solution:

Algorithm

The naive approach would be, to compare the length word to every string in the dictionary. If it matches sort both strings and validate that they are equal. That however is far from optimal as comparison based sorting has a minimal complexitiy of O(n log n).

Instead we could "abuse" the fact that the possible number of characters is limited and use something similar to counting sort O(n).

E.g. We could create an "alphabet vector" for the first string (containing all the possible characters one could encounter in the dictionary). Then:

Check that the strings have equal length
If true, then traverse the first string adding 1 to the positions of each letter in the vector.
Traverse the 2nd string substracting 1 from the vector for each letter in the string. When encountering a zero return false immediately as the current letter in the 2nd string has more occurrences than in the 1st one
once the entire string is traversed the strngs must be equal, return true

Relevant Javascript code:

// Example code in javascript for simplicity
function isAnagram(word, candidate) {
  const counts = new Array(letters.length).fill(0);
  // add letters of the search word
  for(let letter of word) {
    counts[indexes[letter]] += 1;
  }
  // subtract the letters of the candidate word
  for(let letter of candidate) {
    const idx = indexes[letter];
    const count = memo[idx];
    if(count === 0) return false 
    counts[idx] -= 1;
  }
  return true;
}

Optimizations

Due to the specifics of the task, we can make some rather obvious "optimizations".

As we have a fixed "search word" we don't need the entire alphabet, we can just create a vector containing all the symbols of only that word. If candidate word doesn't have one of these, return false immediately.
We can precalculate the count vector, as it's constantly reused.

That algorithm is closer to:

//Example code in javascript for simplicity
function isAnagram(letterCounts, candidate) {
  const counts = [...letterCounts]; //clone

  for(let letter of candidate) {
    const idx = indexes[letter];
    const count = memo[idx];
    if(count === 0) return false 
    counts[idx] -= 1;
  }
  return true;
}

This has the algorithm settled. The rest is implementation details and I/O stuff (as this excercise is majorly I/O bound).

Why RUST?

Rust is a systems programming language that makes fearless concurrency relatively easy. It's lightweight, has no carbage collection and has modern build-, packaging- and unit-testing tools included out of the box.

On top of that RUST allows pretty simple Thread level Parallelism (via rayon) and Instruction Level Parallelism (via SIMD libraries, like numeric-array) without changing your code.

As the anagram problem is embarrassingly parallel, it probably makes sense to use at least thread-level concurrency.

For instance: the function for streaming the file and finding anagrams sequencially is the following

    let (letter_indexes, letter_counts) = precalc_letter_data(word);

    let mut results: Vec<String> = Vec::new();

    let file = File::open(path)
        .expect("Something went wrong reading the file");

    for line in BufReader::new(file).lines() {
        let candidate = &line?;
        if candidate.len() == word.len() && is_anagram(candidate, &letter_counts,  &letter_indexes) {
            results.push(candidate.clone());
        }

    }
    Ok(results)
}

The only change needed to make to make this run on all the logical processors available (e.g. 12 threads on a modern 6-core i7) is the following:

pub fn find_anagrams_parallel(word: &str, path: &str) -> Vec<String> {

    let (letter_indexes, letter_counts) = precalc_letter_data(word);

    let contents = read_to_string(path)
        .expect("Something went wrong reading the file");

    contents
        .par_lines() // <- this is all the parallelization magic there is
        .filter(|candidate|  {
            candidate.len() == word.len() && is_anagram(candidate, &letter_counts,  &letter_indexes)
        })
        .map(|candidate| candidate.to_string().clone())
        .collect()
}

The only real change is replacing .iter() with .par_iter() and Rayon does the rest! :)

Although one must note, that the parallel doesn't stream. Rather it loads the entire dictionary to memory in the beginning. This is required for two reasons:

The vast majority of time goes to fetching the dictonary. When buffering there is no real benefit from using multiple cores.
Rayon doesn't have a good answer to the problem yet (and probably won't have, until async/await is added to rust).

Rust specific optimizations:

Rayon for parallelizing
HashBrown instead of system hashmaps (uses google's swiss-tables and is considerably more optimized)
[WIP] numeric-arrays or small-vectors instead of system vectors
Enabling Link Time Optimizations in Cargol.toml - No real benefit
mark some methods for inlining - No real benefit
Tired forcing target-native (to allow SIMD on modern processors) - No real benefit.
Tried jemalloc instead of system-allocator - reverted, No benefit

Some benchmarking on the development computer:

The reference version written in node (streaming) finds anagrams of 'Agu isaeetall' in ~37648 microseconds
The rust (in-memory and parallel) version takes: ~4471 microseconds

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
bin		bin
dicts		dicts
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bin

bin

dicts

dicts

src

src

.gitignore

.gitignore

Cargo.lock

Cargo.lock

Cargo.toml

Cargo.toml

README.md

README.md

Repository files navigation

Anagrams

Usage

Executing existing Linux binary

Compiling from source (any platform)

About the solution:

Algorithm

Optimizations

Why RUST?

Rust specific optimizations:

Some benchmarking on the development computer:

About

Releases

Packages

Languages

kaitk/anagrams

Folders and files

Latest commit

History

Repository files navigation

Anagrams

Usage

Executing existing Linux binary

Compiling from source (any platform)

About the solution:

Algorithm

Optimizations

Why RUST?

Rust specific optimizations:

Some benchmarking on the development computer:

About

Topics

Resources

Stars

Watchers

Forks

Languages