-
Notifications
You must be signed in to change notification settings - Fork 0
/
histogram
518 lines (414 loc) · 15.1 KB
/
histogram
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
#!/usr/bin/perl
use strict;
=pod
=head1 NAME
histogram - output a bar-graph representing frequency of inputs
=head1 SYNOPSIS
Normal operation:
B<histogram> [I<options>] ['I<awk-program>'] [<input-file> ...]
Provide summary usage information concerning options:
B<histogram> -h
Provide detailed usage information concerning options:
B<histogram> -H
=head1 DESCRIPTION
Creates a histogram of the input data, using an awk-defined filter,
and outputs to the tty a horizontal bar-graph representing the histogram.
After all I<options> are processed, the first non-option argument is
tested to see if it is a file that exists; if not, it is treated as an
I<awk-program> and is provided to B<awk>. Otherwise, the I<awk-program>
provided to awk is simply C<{print}>, so that awk behaves like B<cat>. All
arguments, including any awk-specific options and input-files, are
passed as arguments to awk. Standard-input is also passed to awk. The
output from awk is then read in by B<histogram>, and each input line is
grouped and counted. After all inputs have been consumed, B<histogram>
outputs a bar graph, where each output line represents one of the input
lines and the frequency of that line's occurrence in the input stream.
Unless otherwise specified (B<-w>), the entire width of the TTY is used,
with the graph for the item of highest frequency using the full line
width. If an item is found only once, its frequency will be represented
by at least one glyph. A threshold (B<-t>) can be set so that
only items which appear I<n> times in the input will appear in the output.
The output is by default sorted in alphanumeric order by the input line,
or optionally by frequency (B<-q>) of occurrence. This output order
can be reversed (B<-r>). Optionally, the output can display an exact
count (B<-c>) beside each graph. Unless otherwise specified (with either
B<-x> or B<-k>), the keys are right-aligned to a column which is the lesser of
the longest key-length or 20% of the TTY width. If the longest key is
greater than this 20%, the output for each item is split onto two lines,
one for the key and the other for the graph, but one-line output can be
forced (B<-x>, B<-1>). The graph will be scaled from 1 to the maximum
frequency, but it can be also scaled from the minimum frequency found
to the maximum (B<-m>). The maximum can be treated as the total number
of items found and scaled accordingly; in this case, a graph representing
this total is shown for comparison purposes (B<-T>).
An I<awk-program> looks like this:
<expression> { I<program> }
See the B<awk> manual for how to formulate I<expression> and
I<program>. It's usually as simple as a regular expression. See
L<EXAMPLES> below.
=head1 OPTIONS
Many, some of which are passed to awk. See the output of either -h or -H.
The input-files, if any, are passed to awk; if none are provided, awk
will use B<histogram>'s standard-input. The options -f, -F, -v, and -W are
typical awk options; B<histogram> always passes these arguments on to awk.
=head1 ENVIRONMENT
=over 5
=item AWKPATH
The path of the awk program to be used.
=item COLUMNS
The width of the TTY, usually changed by the shell upon SIGWINCH, but can
be overridden on the command line with B(<-w>).
=back
=head1 RETURN VALUE
Zero on help and if something was output and no error.
One if there was no input from awk to be processed (grep-like behavior).
Otherwise, the error-code from awk is given.
=head1 SEE ALSO
awk
=cut
sub usage {
my $detail = (shift || 0);
local $0;
$0 =~ s:^.*/::g;
print <<"USAGE";
Usage: $0 [<options>] [<awk-options>] ['<awk-filter>'] [<input-files>]
Outputs to the tty a simple horizontal bar graph, each bar representing
the number of occurences of unique items found in the input files or stream.
Where <awk-options> are anything awk recognizes, the awk-filter is the
usual '/pattern/ {program}' syntax of awk, and any input-files (or standard
input if none given).
USAGE
print <<"SUMMARY"
Option summary. Use -H for detailed help.
-t N|N% -N -N% set threshold to N or N%
-q -r sort by freQuency, and/or Rerverse the order
-w N set output width
-k N set maximum key width
-T -c -m -1 -x other output options -- run $0 with -H
SUMMARY
if $detail == 0 ;
print <<'DETAIL'
Counting options:
-t N Threshold - an item must be present N times
-t N% Threshold - an item must be present in at least N% of total
-N shortcut for -t N (where N is an integer)
-N% shortcut for -t N% (where N is a decimal number)
An item must be present at least once, so -t 1 means nothing,
and thus -1 is used below for the "single-line" option.
Output options:
-T Act as if the total number of items is an item;
the last output line is a graph representing the total;
all graphs are scaled to this one.
-c Show counts at end of each graph
-m Scale graph from min to max - default is 1 to max
-q Sort by freuQency - default is sort by keyname (ignoring case)
-r reverse sort order
Display options:
-w N display width (or use COLUMNS environment variable or 80).
-k N display k characters of key - default is term-width minus 60
-1 Force output to use a single line for each item, and set the key-
width to 20% of -w, truncating longer keys. This is useful because
if -k is not specified and if the largest key is greater than the
default key-width, output is separated onto two lines.
-x Like -1, but keys are not right-aligned, but flush-left and never
truncated. Scale the graph as if each key had 0 characters.
This is useful for outputting to a file for further processing.
Typical AWK options:
-F<r> Fields are separated by the regular expression <r>. By default,
awk uses any whitespace to separate fields.
Use \\\\t for tab-separated.
-v <var>=<value>
Assign <var> a <value>. <var> is then used in the script and
replaced with <value>. For instance:
awk -v pi=3.14 '{ print pi*$1*$1}'
would print the area of a circle whose radius is the first field
of input.
DETAIL
if $detail == 1 ;
print "\n";
0;
}
###############################################################################
###############################################################################
## INTERNAL DOCUMENTATION
###############################################################################
###############################################################################
# Program structure
# 1. Parse options and handle environment variables
# a. Extract and validate options we use internally
# b. Handle awk options and command line arguments; validate
# c. Handle TTY-related environment settings and options
# 2. Read input lines and store keys/counts in hash
# Track max key-width and total item-count
# Close input file handle
# 3. Scale data based on input key-width, max frequency,
# TTY width, and user-specified options.
# 4. Output
#
# #1a# - Parse Options
# for various reasons, the optargs module won't work here.
# so we build our own. To make 'strict' mode happy, we
# specify the possible variables with "our"
#
our($opt_c,$opt_q,$opt_r,$opt_1,$opt_m,$opt_x,$opt_T,$opt_h,$opt_H);
our($opt_t,$opt_w,$opt_k,$opt_n);
while ($ARGV[0]) {
$_=$ARGV[0];
if ( /^-(c|q|r|1|m|x|T|h|H)$/ ) {
eval "\$opt_$1 = 1";
}
elsif ( /^-(\d+|[0-9]+(?:\.[0-9]*)?%)$/ ) {
$opt_t = $1;
}
elsif ( /^-(w|k|t|n)$/ ) {
eval "\$opt_$1 = q($ARGV[1])";
shift @ARGV;
}
else {
last;
}
shift @ARGV;
}
# User needs help. Give it and exit.
if ($opt_h) { &usage; exit 0; }
if ($opt_H) { &usage(1); exit 0; }
# Warn about incompatible options
if (defined $opt_x && ( defined $opt_k || defined $opt_1 )) {
print STDERR "$0: WARNING: Option -x overrides key-width settings (-k,-1)\n";
}
#
# #1b# - Now handle awk arguments ...
#
# define variables for this part -- again for strict mode
our $awkcommand;
my ($AWKPATH,@awkfiles,@awkopts,$awkfilter);
my $uses_external_script;
# Get the path to the awk that will be used.
$AWKPATH=($ENV{'AWKPATH'} or "awk");
#
# #1b.1# - find awk options and arguments on command line
#
# First, assume the arguments at the end are filenames -- if they exist.
# If not, leave them alone.
while (-f $ARGV[-1]) {
push @awkfiles, pop @ARGV;
}
# first, any awk options. Handle Multipart options too
while ($ARGV[0] =~ /^-/) {
$_=shift @ARGV;
push @awkopts,qq('$_');
if ( /^--$/ ) {
last;
}
#elsif (/^--\w+/) { } # do nothing
#elsif (/^-F.+/) { } # do nothing
elsif (/^-f$/) {
$uses_external_script=1;
push @awkopts,"'".(shift @ARGV)."'";
}
elsif ( /^-(v|W|m[fr]|F)$/ ) {
push @awkopts,"'".(shift @ARGV)."'";
}
}
#
# #1b.2# - Build awk command
#
# Finally, the remaining arggument is the filter/program. if not, use "{print}"
# so that awk acts basically like "cat". If external program is specified with
# -f option, then don't use any filter (unless one is also provided).
# lastly, put single-quotes around any filter so that it's handled properly by
# the shell.
$awkfilter=$ARGV[0];
$awkfilter="{print}" if (!defined $ARGV[0] && !$uses_external_script);
$awkfilter="'$awkfilter'" if defined $awkfilter;
# build the entire command string
$awkcommand=join(" ",$AWKPATH,@awkopts,$awkfilter,@awkfiles);
#
# #1b.3# - Validate awk command
#
# Now open the awkcommand with perl's pipe-open.
# Do it now in to detect failure ASAP.
open(FILTER,$awkcommand." |") || do {
print "$0: ERROR: Failed to run awk command: $!\n";
print " ".$awkcommand."\n";
exit ($? >> 8);
};
#
# #1c# - Handle TTY environment and options
#
# Process internal values from other arguments, such as terminal width.
# Again, define variables to make "strict" happy.
my ($termwidth, $keywidthmax, $lines, $cwidth, $scale, $use_two_lines);
$termwidth = (defined $opt_w ? $opt_w : ($ENV{'COLUMNS'} || 80));
if ($opt_1 && !defined $opt_k) { $opt_k = $termwidth / 5; }
$keywidthmax = (defined $opt_k ? $opt_k : ($termwidth - 60));
# $cwidth defined once max-count is known
#
# #2# - read output from awk and collect statistics
# each line is the key, and its value is the count
# track max and min.
#
my $min = 1;
my $max = 1;
my $kwidth = 0;
my %h;
while (<FILTER>) {
chop;
++$h{$_};
$max = $h{$_} if ($h{$_} > $max);
if (!defined $opt_x && !defined $opt_t && length($_) > $kwidth) {
$kwidth = length($_);
}
}
$lines=$.; # $. goes away once we do a close. close() needed to get $?
close (FILTER);
exit $?>>8 if ($?); # exit if awk failed
if ($lines == 0) {
print STDERR "$0: WARNING: No output from awk\n";
exit 1;
}
#
# #3# - post-processing
#
# #3a# - threshold handling
# Go through all entries and remove those that don't meet the threshold.
# reset kwidth to match those that make the cut.
#
if ($opt_t) {
$kwidth = 0;
if (substr($opt_t,-1,1) eq "%") {
# User provided a percentage. Determine actual based on known max
$opt_t = ($max + 0.0) * substr($opt_t,0,length($opt_t)-1) / 100.0;
}
while ( my($key,$val) = each(%h) ) {
if ($opt_t > $val) {
delete $h{$key}
}
elsif (!defined $opt_x && length($key) > $kwidth) {
# recalc kwidth again, since we removed keys
$kwidth = length($key);
}
}
}
#
# #3b# - output scaling
# Find min for scaling -- cannot do this in the read-loop
# start with the known max and find the lowest of the values
# (separate loop from opt_t loop to avoid messy code)
#
if ($opt_m) {
$min = $max;
while ( my($key,$val) = each(%h) ) {
$min = $val if ($val < $min);
}
}
if (defined $opt_x) {
# If opt_x, set kwidth to 0 for $scale to work properly
# then subtract 1 to negate the offset used for the colon
$kwidth = -1;
}
elsif (defined $opt_k) {
# if -k was specified, its value is in keywidthmax
$kwidth = $keywidthmax;
}
elsif ($kwidth > $keywidthmax) {
# it won't fit on one line, so separate into two.
$kwidth=0;
$use_two_lines=1;
}
# if -T, the max is the total
$max = $lines if $opt_T;
# for proper scaling, adjust max and min downward by min.
$max -= $min;
$cwidth = (defined $opt_c ? length("".$lines) + 3 : 0);
$scale= ($termwidth - 1 - $kwidth - $cwidth) / $max;
# |
# for the colon ------^
$kwidth=1 if $kwidth==0; # kwidth needs to be at least 1 from now on.
#
# #4# - output graph
#
foreach (&do_sort(\%h, $opt_r, $opt_q)) {
if ($opt_x) {
print $_,":";
}
elsif ($use_two_lines) {
print $_,":\n";
}
else {
printf "%*s:",$kwidth,substr($_,0,$kwidth);
}
my $graphlen = int(0.9999 + $scale * ($h{$_} - $min));
print "-" x ($graphlen ? $graphlen : 1);
print " (".$h{$_}.")" if $opt_c;
print "\n";
}
if ($opt_c || $opt_T) {
printf "%*s:%s",$kwidth,"(TOTAL)",($opt_T ? "-" x int($scale * $max) : "");
printf " (%d)",$lines if $opt_c;
print "\n";
}
exit 0;
#
# Internal Sort routine
# Arg1 -- ref to hash containing histogram key/count pairs
# Arg2 -- whether output should be reversed (1) or not (0)
# Arg3 -- sort by frequency (1) or key (0)
#
sub do_sort {
my $rhash = shift;
my $opt_r = shift;
my $opt_q = shift;
if ($opt_q) {
# sort by frequency of ocurrence
# (optimize for spbeed, not code size)
if ($opt_r) {
return sort { ($rhash->{$b} <=> $rhash->{$a} || lc($rhash->{$b}) cmp lc($rhash->{$a})) } keys %$rhash;
} else {
return sort { ($rhash->{$a} <=> $rhash->{$b} || lc($rhash->{$a}) cmp lc($rhash->{$b})) } keys %$rhash;
}
}
else {
# sort by key (lexigraphical order)
if ($opt_r) {
return sort { ($b <=> $a || lc($b) cmp lc($a)) } keys %$rhash;
} else {
return sort { ($a <=> $b || lc($a) cmp lc($b)) } keys %$rhash;
}
}
}
=head1 EXAMPLES
=over 3
=item Group error messages from /var/log/messages.
grep 'error' /var/log/messages | histogram
=item Use histogram's built-in awk filter to histogram the hours of those messages:
grep 'error' /var/log/messages | histogram '{ print substr($3,1,2) "h"; }'
=item The above command, where the graph is sorted by decreasing frequency, not the hour.
grep 'error' /var/log/messages | histogram -q -r '{ print substr($3,1,2) "h"; }'
=item Just like the first command, now grouped by hostname; sort & truncate hostnames to 10 characters.
histogram -q -k 10 '/error/ { print $4 }' /var/log/messages
=item Same as the previous example, but include a graph representing the total items found:
histogram -q -k 10 -T '/error/ { print $4 }' /var/log/messages
=item Same as before, but don't truncate or align the keys.
histogram -q -T -x '/error/ { print $4 }' /var/log/messages
=back
=head1 VERSION
Version 1.2.1
=head1 AUTHOR & COPYRIGHT
Otheus <otheus@gmail.com>
=head1 LICENSE
Licensed under GNU Public License (GPL) 2.0 or greater.
=cut
## CHANGELOG
## 0.1 initial version
##
## 0.2 force minimum of 1 dash; right-flush keys to column
## Try sorting with int, then alnum
##
## 1.0 add option handling for sorting, threshold, display parameters.
##
## 1.1 other fixes, better awk option handling
##
## 1.2 expanded help
##
## 1.2.1 bug fix (-q option)