Skip to content

Technology-Notes/awk-reference

 
 

Repository files navigation

AWK Reference

Program Structure

BEGIN {}       # executed once, optional
/pattern/ { }  # executed for each input line matching pattern
END {}         # executed once, optional

Where pattern can be:

  • /regular expression/

  • relational expression ($2 > $1)

  • pattern-matching expression ($2 ~ /test/, $2 !~ /test/)

Patterns can be complex:

  • pattern && pattern - logical AND

  • pattern || pattern - logical OR

  • ! pattern - logical NOT

  • pattern ? pattern : pattern - conditional operator, like in C

  • pattern1, pattern2 - a range pattern, matches all input records starting with a record that matches pattern1, and continuing until a record that matches pattern2, inclusive.

Fields

Each input record is split into fields, that are accessible via $.. variables:

$0 - the whole record (line)
$1, $2, ... - fields

Default separator is space.

Example:

echo a b c d | awk '{ print $1 $2 }'

ab

Any expression that evaluates to integer can be used as field number, the following outputs c:

echo a b c d | awk 'BEGIN { one = 1; two = 2 }
{ print $(one + two) }'

c

Default field separator is space, can be changed with command line flag awk -F:(':' as separator), or awk -F"\t" …​ to use tab as a separator.

Can also be changed inside the script by setting FS variable:

echo a,b,c,d | awk 'BEGIN { FS="," }
{ print $2 }'

b

Field separator can be an expression:

echo a_b:c d | awk 'BEGIN { FS="[_: ]" }
{ print $1 "-" $2 "-" $3 "-" $4}'

a-b-c-d

Pattern Matching

echo '1

test' | awk '
    /[0-9]+/ { print "That is an integer" }
    /[A-Za-z]+/ { print "This is a string" }
    /^$/ { print "This is a blank line" }
    { print }
'
That is an integer
1
This is a blank line.

This is a string
test

We can match the specific field (by default each string is split into fields by space):

echo '1 test description
     2 script description' | awk '
$2 ~ /script/ { print $1 ", " $3 }'

2, description

Reverse the meaning of the rule by using bang-tilde (!~): $2 !~ /script/.

echo '1 test description
      2 script description' | awk '
$2 !~ /script/ { print $1 ", " $3 }'

1, description

It is possible to use comparison operators too, for example NF == 6 { print $1, $6 } will make sure that we have 6 fields before printing them:

echo '1 2 3 4 5 6
      1 2 3
      1 2 3 4 5' | awk '
NF == 6 { print $1, $6 }'

1, 6

More complex expressions can be used as well, for example NR > 1 && (NF >= 2 || $1 ̃ /\t/).

Variables And Expressions

There are two types of constants: string or numeric ("red" or 1).

Variables:

  • assignment: name = value

  • name is case sensitive

  • default value is zero

  • each variable has string and integer value

    • strings that are not numbers evaluate to zero

There are `/`-`, etc arithmetic operators. There are `=, -=, ++ (both postfix and infix), -- assignment operators.

echo '1

2' | awk '
# Count blank lines.
/^$/ {
    ++x  # Default value is 0, so we don't initialize x, just start incrementing
}
END {
    print x
}'

1

Average calculation:

echo 'john 85 92 78 94 88
andrea 89 90 75 90 86
jasper 84 88 80 92 84' | awk '
# average five grades
{ total = $2 + $3 + $4 + $5 + $6
avg = total / 5
print $1, avg }'

john 87.4
andrea 86
jasper 85.6

We can use expression to define the part of the record to match, for example:

echo 'john 10 15
andrea 5 3
jasper 2 20' | awk '
    # print only lines where $2 + $3 > 20
    $2 + $3 > 20 { print $1 " " $2+$3}
'

john 25
jasper 22

Strings

A string must be quoted in an expression.

The space is the string concatenation operator:

# Assigns “HelloWorld” to the variable z.
z = "Hello" "World"

Strings can make use of the escape sequences:

  • \a Alert character, usually ASCII BEL character

  • \b Backspace

  • \f Formfeed

  • \n Newline

  • \r Carriage return

  • \t Horizontal tab

  • \v Vertical tab

  • \ddd Character repr esented as 1 to 3 digit octal value

  • \xhex Character repr esented as hexadecimal value a

  • \c Any literal character c (e.g., \" for ") b

echo a_b:c d | awk 'BEGIN { FS="[_: ]" }
{ print $1 "\v" $2 "\t" $3 "\"" $4}'

a
 b      c"d

System Variables

  • FS - input field separator (space by default)

    • Note: usually FS is assigned in the BEGIN block, but can be changed anywhere new FS value will take effect on the next line (not on the current line)

  • OFS - output field separator (space by default)

  • NF - number of fields (so { print $NF } outputs last field)

    • Note: NF is mutable, can be changed (as well as $0 or fields)

  • RS - record separator, default is newline

  • ORS - output record separator

  • NR - current record number

  • FILENAME - current file name

  • FNR - current record number in current file (useful when there are many files)

  • CONVFMT - printf-style number-to-string conversion format, "%.6g" by default

    • Used when we do str = (5.5 + 3.2) " is a nice value"

  • OFMT - printf style number-to-string conversion when number is printed

    • Used when we do print 5.5

  • ARGC - the number of command line arguments (does not include options to awk)

  • ARGIND - the index in ARGV of the current file being processed.

  • ARGV - array of command line arguments indexed from 0 to ARGC - 1.

    • Dynamically changing the contents of ARGV can control the files used for data.

  • ENVIRON - array of environment variables.

See more in man awk.

The SYMTAB variable is an array whose indices are the names of all currently defined global variables and arrays in the program. The array may be used for indirect access to read or write the value of a variable:

foo = 5
SYMTAB["foo"] = 4
print foo    # prints 4

The isarray() function may be used to test if an element in SYMTAB is an array. You may not use the delete statement with the SYMTAB array.

Example - average calculation with auto-numbering:

echo 'john 85 92 78 94 88
andrea 89 90 75 90 86
jasper 84 88 80 92 84' | awk '
# We will have tabs as output fields separator.
BEGIN { OFS = "\t" }
# average five grades
{
  total = $2 + $3 + $4 + $5 + $6
  avg = total / 5
  print NR ".", $1, avg
}
END {
  print ""
  print NR, "records processed."
}'

1.      john    87.4
2.      andrea  86
3.      jasper  85.6

3       records processed.

Processing Multiline Records

echo 'John Robinson
Boston MA 01760

Phyllis Chapman
Amesbury MA 01881' | awk '
# set field separator to a newline and record separator to the empty string
BEGIN { FS = "\n"; RS = "" }
{ print $1, $NF}'

John Robinson Boston MA 01760
Phyllis Chapman Amesbury MA 01881

Also split the output to multiple lines:

echo 'John Robinson
Boston MA 01760

Phyllis Chapman
Amesbury MA 01881' | awk '
# set field separator to a newline and record separator to the empty string
BEGIN { FS = "\n"; RS = ""; OFS = "\n"; ORS = "\n\n" }
{ print $1, $NF}'

John Robinson
Boston MA 01760

Phyllis Chapman
Amesbury MA 01881

Special Processing: Based On Row Number And Using next And exit

We can use expression like NR == 1 to apply special rule for the first record. Inside that rule we can use next to skip following rules:

echo '1000
125	 Market	 -125.45
126	 Hardware Store	 -34.95156' | awk '
BEGIN { FS="\t" }

# First line is the initial balance.
NR == 1 {
    balance=$1;
    print "Initial balance: ", balance;
    next  # get the next record and start over (do not proceed with next rule)
}
# Update balance.
{ balance += $3 }
# Show the result.
END { print "Final balance: ", balance }'

Initial balance:  1000
Final balance:  839.598

The next statement causes the next line to be read and resumes execution from the top of the script.

The nextfile statement stops current file processing and moves to the next file.

The exit statement exits the main loop and passes control to END section (stops execution if used in END of if there is no END section). The exit takes an expression as an argument. It will be used as script exit status code, by default exit status is 0.

Similar example with interesting trick to remove header and footer (source: https://stackoverflow.com/a/7148801/4612064). Here we extract a list of file names from the 7z l output which looks like this:

7-Zip [64] 9.20  Copyright (c) 1999-2010 Igor Pavlov  2010-11-18
p7zip Version 9.20 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,4 CPUs)

Listing archive: output/folder/7z_1.7z

--
Path = output/folder/7z_1.7z
Type = 7z
Solid = -
Blocks = 0
Physical Size = 141
Headers Size = 141

   Date      Time    Attr         Size   Compressed  Name
------------------- ----- ------------ ------------  ------------------------
2017-11-10 17:33:18 ....A            0            0  (E).txt
2017-11-10 17:33:18 ....A            0            0  (J) [!].txt
2017-11-10 17:33:18 ....A            0            0  (J).txt
2017-11-10 17:33:18 ....A            0            0  (U) [!].txt
2017-11-10 17:33:18 ....A            0            0  (U).txt
------------------- ----- ------------ ------------  ------------------------
                                     0            0  5 files, 0 folders

And the awk script to get only file names:

/----/ {p = ++p % 2; print "p: ", p; next}
$NF == "Name" {pos = index($0,"Name")}
p {print p, substr($0,pos)}

Initially p is zero, so the last rule with print doesn’t work. Second line cacluates the position where the file name starts (by checking the position of "Name" in the header. Once we meet first "----", the p value becomes 1 (1 % 2 = 1) and we start processing filenames. And when we get to the next "----", the p value becomes 0 (2 % 2 = 0) and we stop the processing.

Formatting Output: printf

Syntax:

printf ( format-expression [, arguments] )

The parentheses are optional.

Format specifiers:

  • c ASCII character

  • d Decimal integer

  • i Decimal integer. (Added in POSIX)

  • e Floating-point format ([-]d.pr ecisione[+-]dd)

  • E Floating-point format ([-]d.pr ecisionE[+-]dd)

  • f Floating-point format ([-]ddd.pr ecision)

  • g e or f conversion, whichever is shortest, with trailing zeros removed

  • G E or f conversion, whichever is shortest, with trailing zeros removed

  • o Unsigned octal value

  • s String

  • u Unsigned decimal value

  • x Unsigned hexadecimal number. Uses a-f for 10 to 15

  • X Unsigned hexadecimal number. Uses A-F for 10 to 15

  • % Literal %

A format expression can take three optional modifiers following “%” and preceding the format specifier:

%-width.precision format-specifier
  • width - numeric value, the contents will be right-justified, use '-' to get left-justification.

    • echo '5' | awk '{ printf("%20s", $1) }'* 5*

    • echo '5' | awk '{ printf("%-20s", $1) }'*5 *

  • precision:

    • for decimal or floating-point values - the number of digits to the right of the decimal point;

    • for string values - the maximum number of characters that will be printed.

echo '3.1415' | awk '{ printf("%.3g", $1) }'

3.14

Default format: %.6g.

With and precision can be specified dynamically:

echo '3.1415' | awk '{ printf("%*.*g", 5, 3, $1) }'

 3.14

Passing Parameters Into a Script

Variables can be passed using var=value parameters:

awk ’script’ var=value inputfile

For example:

$ awk -f scriptfile high=100 low=60 datafile
# Use env variable as value:
$ awk ’{ ... }’ directory=$cwd file1 ...
# Use `pwd` output as value:
$ awk ’{ ... }’ directory=‘pwd‘ file1 ...

It is possible to use command-line parameters to define system variables:

$ awk ’{ print NR, $0 }’ OFS=’. ’ names

Note: command-line parameters is that they are not available in the BEGIN procedure. BEGIN is evaluated before the input is read.

awk 'BEGIN {
  # Here `n` is not set.
  print "Begin: " n
}
{
  # Will print "Reading the first file" for each line in `test` file.
  if (n == 1) print "Reading the first file"
  # Will print "Reading the second file" for each line in `test2` file.
  if (n == 2) print "Reading the second file"
}' n=1 test n=2 test2

The -v options allows to specify parameters that are evaluated early and available in BEGIN:

# The -v option must be specified before the script itself.
awk -v n=1 'BEGIN {
  # prints "Begin: 1"
  print "Begin: " n
}'

The -v option can be used for system variables too (here we set RS): awk -F"\n" -v RS="" '{ print }' …​.

echo 'test
test

test2
test2' | awk -F"\n" -v RS="" -v n=1 '{
    # We use new line as filed separator and
    # empty line as record separator
    print n, $1, "-", $2
}'

1 test - test
1 test2 - test2

Awk also provides the system variables ARGC and ARGV, similar to C.

Conditional Statements

if ( expression )
  action1
[else
  action2 ]
if ( expression ) action1 ; [else action2 ]
if (avg >= 90) grade = "A"
else if (avg >= 80) grade = "B"
else if (avg >= 70) grade = "C"
else if (avg >= 60) grade = "D"
else grade = "F"

Conditional operator:

expr ? action1 : action2
grade = (avg >= 65) ? "Pass" : "Fail"

Looping

# While loop
while ( condition )
  action
i = 1
while ( i <= 4 ) {
  print $i
  ++i
}
# Do loop
do
  action
while ( condition )
do {
  ++x
  print x
} while ( x <= 4 )
# For loop
for ( set_counter ; test_counter ; increment_counter )
  action
for ( i = 1; i <= NF; i++ )
  print $i

Prompt the user for a number and calculate factorial:

awk '# factorial: return factorial of user-supplied number
  BEGIN {
    # prompt user; use printf, not print, to avoid the newline
    printf("Enter number: ")
  }
  # check that user enters a number
  $1 ~ /^[0-9]+$/ {
    # assign value of $1 to number & fact
    number = $1
    if (number == 0)
      fact = 1
    else
      fact = number
    # loop to multiply fact*x until x = 1
    for (x = number - 1; x > 1; x--)
      fact *= x
    printf("The factorial of %d is %g\n", number, fact)
    # exit -- saves user from typing CRTL-D.
    exit
  }
  # if not a number, prompt again.
  { printf(" \nInvalid entry. Enter a number: ")
}' -

Loops support break (exit the loop) and continue (start the next iteration).

Arrays

array [ subscript ] = value
student_avg[NR] = avg
...
END {
  for ( x = 1; x <= NR; x++ )
    class_avg_total += student_avg[x]
  class_average = class_avg_total / NR
}

All arrays are associative - the index can either be a string or a number.

# grade = "A", "B", "C", "D"
++class_grade[grade]
...
# To iterate the array we can use `for (item in array)` loop.
for (letter_grade in class_grade)
  # We also pipe output to "sort".
  print letter_grade ":", class_grade[letter_grade] | "sort"

To iterate the array we can use for (item in array) loop and to test for membership we can use if (item in array).

Multidimensional arrays doesn’t have to be rectangular as in C and C++:

a[1] = 5
a[2][1] = 6
a[2][2] = 7
file_array[NR, i] = $i
file_array[2, 4]

Note: Multidimensional arrays are simulated, all indices are concatenated together separated by the value of the system variable SUBSEP (by default "\034", an unprintable character):

awk 'BEGIN { x[1][2] = 2; print x[1][2]; }'
2

$ awk 'BEGIN { x[1,2] = 2; print x[1,2]; }'
2

$ awk 'BEGIN { x[1,2] = 2; print x["1" "\034" "2"]; }'
2

The multidimensional array syntax is also supported in testing for array membership: if ((i, j) in array).

Looping over a multidimensional array is the same as with one-dimensional arrays: for (item in array), split( ) function can be used then to access individual subscript components: split(item, subscr, SUBSEP).

The split function can be used to create arrays:

n = split(string, array, separator)
where:
  n - number of items in the array
  string - the string to split
  array - the array (function output)
  separator - delimiter to use when splitting the string
z = split($1, array, " ")
for (i = 1; i <= z; ++i)
  print i, array[i]

Remove an item from the array:

delete array [subscript]

An array of command-line parameters:

BEGIN { for (x = 0; x < ARGC; ++x)
  print ARGV[x]
  print ARGC
}

Standard Functions

Math:

  • cos(x) - cosine of x (x is in radians).

  • exp(x) - e to the power x.

  • int(x) - truncated value of x.

  • log(x) - natural logarithm (base-e) of x.

  • sin(x) - sine of x (x is in radians).

  • sqr t(x) - square root of x.

  • atan2(y,x) - arctangent of y/x in the range - π to π .

  • rand( ) - pseudo-random number r, wher e 0 ⇐ r < 1.

  • srand(x) Establishes new seed for rand( ). If no seed is specified, uses time of day. Returns the old seed.

Strings:

  • length(s) - length of string s or length of $0 if no string is supplied.

  • index(s,t) - position of substring t in string s or zero if not present.

    • pos = index("Mississippi", "is")

  • split(s,a,sep) - parses string s into elements of array a using field separator sep; returns number of elements. If sep is not supplied, FS is used. Array splitting works the same way as field splitting.

  • substr(s,p,n) - returns substring of string s at beginning position p up to a maximum length of n. If n is not supplied, the rest of the string from p is used.

    • awk 'BEGIN { print substr("707-555-1111", 5) }'555-1111

    • awk 'BEGIN { print substr("707-555-1111", 1, 3) }'707

  • tolower(s) - translates all uppercase characters in string s to lowercase and returns the new string.

  • toupper(s) - translates all lowercase characters in string s to uppercase and returns the new string.

  • sprintf("fmt",expr) - uses printf format specification for expr.

  • match(s,r) - either the position in s where the regular expression r begins, or 0 if no occurrences are found. Sets the values of RSTART and RLENGTH.

  • gsub(r,s,t) - globally substitutes s for each match of the regular expression r in the string t. Returns the number of substitutions.

    • If t is not supplied, defaults to $0, so by default it works on current input line.

  • sub(r,s,t) - substitutes s for first match of the regular expression r in the string t. Returns 1 if successful; 0 otherwise. If t is not supplied, defaults to $0.

An example of match usage:

echo 'test
match' | awk '
  # match -- print string that matches line
  # for lines match pattern
  match($0, pattern) {
    # extract string matching pattern using
    # starting position and length of string in $0
    # print string
    print substr($0, RSTART, RLENGTH)
}' pattern="ma"

ma

The match() function returns 0 if the pattern is not found, and a non-zero value (RSTART) if it is found, allowing the return value to be used as a condition:

In gawk there are additional functions:

  • gensub(r, s, h, t) - if h is a string starting with g or G, globally substitutes s for r in t. Otherwise, h is a number: substitutes for the h’th occurrence. Returns the new value, `t is unchanged. If t is not supplied, defaults to $0.

    • It improves gsub / sub: it is possible to replace Nth occurrence, source string is not changed - the result is returned instead,

    • The pattern can have subpatterns delimited by parentheses. For example, it can have /(part) (one|two|three)/. Within the replacement string, a backslash followed by a digit represents the text that matched the Nth sub-pattern: echo part two | gawk ’{ print gensub(/(part) (one|two|three)/, "\\2", "g") }two

  • systime( ) - returns the current time of day in seconds since the Epoch (00:00 a.m., January 1, 1970 UTC).

  • strftime(format, timestamp) - Formats timestamp (of the same form returned by systime()) according to format. Default format - similar to the date command, default timestamp - current time.

echo 'TeSt' | awk '
  # lower - change upper case to lower case
  # note: we could use `tolower` to convert the case.
  #
  # initialize strings
  BEGIN {
    upper = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
    lower = "abcdefghijklmnopqrstuvwxyz"
  }
  # for each input line
  {
    # see if there is a match for all caps
    while (match($0, /[A-Z]+/))
      # get each cap letter
      for (x = RSTART; x < RSTART+RLENGTH; ++x) {
        CAP = substr($0, x, 1)
        CHAR = index(upper, CAP)
        # substitute lowercase for upper, we don't provide third
        # parameter to `gsub`, so it acts on the input ($0).
        gsub(CAP, substr(lower, CHAR, 1))
      }
      # print record
      print $0
}'

test

User Functions

function name (parameter-list) {
  statements
  return expression
}
function insert(STRING, POS, INS) {
    before_tmp = substr(STRING, 1, POS)
    after_tmp = substr(STRING, POS + 1)
    return before_tmp INS after_tmp
}
# "Hello" -> "HellXXo"
print insert($1, 4, "XX")

Note: variables declared inside the function are global (available outside the function). To make them local, we need to define them as parameters (and don’t use these parameters when we are calling the function):

function insert(STRING, POS, INS, before_tmp, after_tmp) {
    ...
}

Note: there are some pre-defined "external" functions, under /user/share/awk on my system:

$ ls /usr/share/awk
assert.awk      ftrans.awk   inplace.awk   ord.awk           readable.awk  shellquote.awk
bits2str.awk    getopt.awk   join.awk      passwd.awk        readfile.awk  strtonum.awk
cliff_rand.awk  gettime.awk  libintl.awk   processarray.awk  rewind.awk    walkarray.awk
ctime.awk       group.awk    noassign.awk  quicksort.awk     round.awk     zerofile.awk

To use external functions, pass the path to the source using -f flag:

awk -f myscript.awk -f /usr/share/awk/ctime.awk input.txt

getline - Read the Data From Files and Pipes

The getline function is used to read another line of input. It is similar to next, but it doesn’t pass the control back to the top of the script.

It reads the line and returns: * 1 - If it was able to read a line. * 0 - If it encounters the end-of-file. * -1 - If it encounters an error.

echo 'first
test
second' | awk '
/test/ {
  getline # get next line
  print $1 # print $1 of new line.
}'

second

The getline can also be used to read data from a file or a pipe:

# Read lines from the file "data" and print them.
while ( (getline < "data") > 0 )
  print
# Read from standard input (prompt the user to enter the name):
BEGIN {
  printf "Enter your name: "
  getline < "-"
  print
}
# We can also assign the data we read to the variable:
BEGIN {
  printf "Enter your name: "
  # Here we assign the input to `name` variable
  getline name < "-"
  print name
}

It is possible to pipe output of a command to getline:

awk '# getname - print users fullname from /etc/passwd
  BEGIN {
    # `who am i` outputs single string, user name is the first word
    "who am i" | getline
    name = $1
    FS = ":"
  }
  name ~$1 { print $5 }
' /etc/passwd
# subdate.awk -- replace @date with todays date
/@date/ {
  "date +’%a., %h %d, %Y’" | getline today
  gsub(/@date/, today)
}
{ print }

The close() function allows to close open files and pipes, it takes single argument - same expression that was used to create the pipe:

close("who")

Using close we free the resources; we can use the same command more than once; if we are using output pipe (like some processing of $0 | "sort > tmpfile"), we need to do close("sort > tmpfile") before using the tmpfile (for example in getline < "tmpfile"):

{ some processing of $0 | "sort > tmpfile" }
END {
  close("sort > tmpfile")
  while ((getline < "tmpfile") > 0) {
    do more work
  }
}

Output to Files and Pipes

It is possible to redirect output to the file:

print "a =", a, "b =", b, "max =", (a > b ? a : b) > "data.out"

Similarly, the output can be redirected to the pipe:

print | command
awk 'BEGIN { print "test example" | "wc -w" }'
2
echo "test example" | awk '{ print | "wc -w" }'
2

system() - Execute System Commands

The system( ) function executes a command supplied as an expression. It does not make the output of the command available within the program for processing. It returns the exit status of the command that was executed.

BEGIN {
  if (system("mkdir test") != 0)
    print "Command Failed"
}

The command output goes to the script output:

echo 'test' | awk '
{
  # print the line using `echo`
  system("echo " $0)
}'

test

We can check the command result:

# test returns 1 if file does not exist (and 0 if exists).
if (system("test -r " file)) {
    print file " not found"
}

Reference Sources

Releases

No releases published

Packages

No packages published

Languages

  • Awk 60.6%
  • Shell 39.4%