Skip to content

🗼 A CLI term frequency analyzer. Counts the number of occurrences of each word in an input and creates formatted output or a histogram.

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT

mre/freq

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

50 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

freq

A commandline tool that counts the number of word occurrences in an input.

James Munns on Twitter

This is just a placeholder repository for now. Please create issues for feature request and collaboration.

Usage

Commandline

echo "b a n a n a" | freq

0.16666667 - 1 - b
0.33333334 - 2 - n
0.5 - 3 - a

Library

use std::error::Error;

fn main() -> Result<(), Box<dyn Error>> {
    let frequencies = freq::count("fixtures/sample.txt")?;
    println!("{:?}", frequencies);
    Ok(())
}

Features

  • Ignore words (regex pattern) [issue 5]
  • Different output formats (plaintext, JSON)
  • freq.toml configuration file
  • Filter stopwords (similar to NLTK's stopwords)
  • Performance (SIMD support, async execution)
  • Recursion support
  • Allow skipping files
  • Allow specifying ignored words in a separate file
  • Generate "heat bars" for words like shell-hist does
  • Split report by file/folder (sort of like sloc does for code)
  • Choose language for stopwords (--lang fr)
  • Format output (e.g. justify counts a la uniq -c)
  • Interactive mode (shows stats while running) (--interactive)
  • Calculate TF-IDF score in a multi-file scenario
  • Limit the output to the top N words (e.g. --top 3)
  • Ignore hidden files (begins with .)
  • Minimize number of allocations
  • No-std support?
  • Ignore "words" only consisting of special characters, e.g. ///
  • Multiple files as inputs
  • Glob input patterns
  • If directory is given, walk contents of folder recursively (walker)
  • Verbose output (show currently analyzed file etc)
  • Library usage
  • https://github.com/jonhoo/evmap
  • Automated abstract generation with Luhn's algorithm Issue #1

Idea contributors:

Similar tools

tot-up

Similar tool written in Rust with nice graphical output https://github.com/payload/tot-up

uniq

A basic version would be

curl -L 'https://github.com/mre/freq/raw/main/README.md' | tr -cs '[:alnum:]' "\n" | grep -vEx 'and|or|for|a|of|to|an|in' | sort | uniq -c | sort

This works, but it's not very extensible by normal users. It would also lack most of the features listed above.

Lucene

Has all the bells and whistles, but there is no official CLI interface and requires a full Java installation.

wordcount

freqword <tab> freq

Nice and simple. Doesn't exclude stopwords and no regex support, though. https://github.com/juditacs/wordcount

word-frequency

Haskell-based approach: Includes features like min length for words, or min occurrences of words in a text. https://github.com/cbzehner/word-frequency

What else?

There must be more tools out there. Can you help me find them?

About

🗼 A CLI term frequency analyzer. Counts the number of occurrences of each word in an input and creates formatted output or a histogram.

Topics

Resources

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT

Stars

Watchers

Forks

Releases

No releases published

Languages