Skip to content

lsauer/entropy

Repository files navigation

#ENT - view string metrics and entropy of arbitrary files

As a command line tool, .NET library or web-application (coming)

author: lo sauer 2011-12; www.lsauer.net
website: https://github.com/lsauer/entropy
license: MIT license
description: quickly plot entropy information and string metrics of arbitrary files or strings from the console Input/Output

Analysing a file:

Cross-Platform:

Plots as Vector graphics (default: .svg):

plot from the 1st screencast

Analysing a twitter-live stream

  • Windows Usage:
    > type file1.ext file2.ext file3.ext | ent -b 2.15
    
    > "teststringdata" | ent -b 2.15 -s
  • Linux, Mac Usage:
    $ cat file1.ext file2.ext file3.ext | ent -b 2.15
    
    $ "sometextdata" | ent -b 64 -s

###Background

In information theory, entropy is a measure of the uncertainty associated with a random variable. Informatically, the term usually refers to the Shannon entropy, which quantifies the expected value of the information contained in a message (a specific instance of the random variable), in units of bits. (See: Entropy )

A file with high entropy shows few repeated patterns and is typically compressed/optimized. Such a highly compressed data-stream will feature an entropy index of greater >5 for a binary-base 2.

###Purpose

The purpose of this tool is quickly analyzing arbitrary files, for instance biological sequence files or serialized JSON data. Naturally powerful investigation options are granted to researchers in the form of the R-statistical language or Matlab and Wolfram Mathematica.

Ad hoc investigation however is much faster with a dedicated command line tool along and the features the console environment provides such as autocompletion (pressing Tab). Virtually no startup times are required and plots can be output in web-compatible vector graphics.

Usage

    Usage: shantropy [<filename1> <fname2>...1st param!] [-f fromBy] [-t toBy] [-o <outfile>] [-h help]
    [-e efficiency] [-m 1,2.. 1st,2nd order markov] [-b base-alphabet]
    [-w width plot] [-h height plot] [-z zoom%] [-fp fileposition]
    [-p plot permutation entropy] [-s <string> as last param!]
    Press CTRL+C or Q to Quit!
    Press PAGEUP / PAGEDOWN to zoom in or out of the file
    Press LEFT / RIGHT Arrow to navigate to the next or previous file-segment

Parameters

  • -m 0 zero-order Markov source: default (pratically identical with Shannon entropy when the log-base is 2)
  • -m 1 first-order Markov source: ...number of linked characters is one
  • -m 2 second-order Markov source e.g. ent -m 2
  • -m <n> n order markov source
  • -b <decimal> "b-ary entropy": a different base can be set with e.g. ent -b 2,15 , default is 256 for ASCII; use 64 for literature-text
  • -s <stringdata> arbitrary string passing: -s must be passed as the last argument!
  • -w <int> width of the plot
  • -h <int> height of the plot
  • -f,-t <int> define a file segment in Byte (from/to). Both are optional
  • -z file-segment zoom in percent
  • -fp file-segment position (0-n)
  • -o outfile plot data to a given file and create or append to the file outfile,
  • use ent .... > myfilen.out to capture the entire console output
  • use ent .... > mygraphics.svg to plot to an svg file
  • -e ...plot the efficiency of the data
  • -p ...plot and compute the permutation efficiency

note: files have to be passed as first arguments: To calculate metrics for several files put them in sequence e.g. ent explain.nfo markdownsharp-20100703-v113.7z -b 3,6

###Todo:

  • make and use an console argument hash map or struct params
  • code cleanup

###Fixes

  • slow -> fixed; Readline loop was replaced by ReadAll; up to 200x speedup
  • navigation of the file for chunked data processing
  • incorrect results -> fixed: for text files set -b 64, to get meaninful results

###Example Example for a typical info (.nfo) file: the ordinate(y-axis) shows the entropy and the abscisse (x-axis) shows the file-segment position in percent

result: the text is highly compressible and clearly shows structuring

0,60 |                                  ▓▓▓▓▓▓▓▓▓      ▓▓▓▓▓
0,54 |                                  ▓▓▓▓▓▓▓▓▓    ▓▓▓▓▓▓▓    ▓▓
0,48 |                                  ▓▓▓▓▓▓▓▓▓    ▓▓▓▓▓▓▓    ▓▓▓
0,42 |           ▓    ▓▓  ▓             ▓▓▓▓▓▓▓▓▓▓   ▓▓▓▓▓▓▓    ▓▓▓
0,36 |           ▓    ▓▓▓▓▓    ▓        ▓▓▓▓▓▓▓▓▓▓   ▓▓▓▓▓▓▓    ▓▓▓
0,30 |           ▓▓   ▓▓▓▓▓    ▓ ▓      ▓▓▓▓▓▓▓▓▓▓   ▓▓▓▓▓▓▓    ▓▓▓
0,24 | ▓       ▓ ▓▓ ▓ ▓▓▓▓▓  ▓ ▓ ▓   ▓ ▓▓▓▓▓▓▓▓▓▓▓   ▓▓▓▓▓▓▓ ▓  ▓▓▓
0,18 | ▓       ▓▓▓▓ ▓ ▓▓▓▓▓  ▓▓▓ ▓▓  ▓ ▓▓▓▓▓▓▓▓▓▓▓ ▓ ▓▓▓▓▓▓▓ ▓  ▓▓▓
0,12 | ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ ▓▓▓▓▓ ▓▓▓▓▓▓▓▓▓▓▓ ▓ ▓▓▓▓▓▓▓ ▓  ▓▓▓
0,06 | ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ ▓▓▓▓▓ ▓▓▓▓▓▓▓▓▓▓▓▓▓ ▓▓▓▓▓▓▓ ▓▓ ▓▓▓
0,00 | ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ ▓▓▓▓▓▓▓▓▓▓▓▓▓ ▓▓▓▓▓▓▓ ▓▓▓▓▓▓
------------------------------------------------------------------
     0%        16%        33%        50%        66%        83%

###Useful links:

###Case studies:

Fork it on github: https://github.com/lsauer/entropy

Have fun! In fact don't use this program for anything else yet...

About

ent is a small, fast command line utility, plotting various entropy related metrics of files or pipe/stdin streams

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages