Unix Text Processing Command Reference

Nathan Schneider, 2013-01-29

This is intended as a quick reference for text processing commands built into Unix. It is terse and not necessarily comprehensive—YMMV.

Suggestions? Contact the author or submit a pull request.

Notes about the commands below:

None of these commands actually modify the input files; rather, they manipulate input text and produce output text, typically writing to standard output.
ALLCAPS indicates a metavariable.
The descriptions are selective. For more comprehensive documentation of options, see the command’s man page. For tutorials and examples, search the Web.

Some tutorials and references:

Ken Church’s Unix™ for Poets
Jim Notwell’s Introduction to Text-Processing
Na-Rae Han’s Command-line Magic
Advanced Bash-Scripting Guide: Text Processing Commands
GNU Coreutils
- for Mac OS X: coreutils, sed (BSD implementations are built-in on OS X)

More powerful tools for advanced text processing operations:

AWK
pyp, grep/sed/AWK for the Python-inclined
- Python tips and tricks

No input stream

`yes`

Repeats a line (by default, y) infinitely.

yes LINE | head -n 10 repeats LINE 10 times.

One or more input files/streams

`cat`

Concatenate the input files together in sequence.

-s: suppress/squeeze multiple consecutive blank lines
-n: number all lines (cf. nl)
-b: number non-blank lines

`zcat`

Like cat, but for gzipped files.

`tac`

Like cat, in reverse: lines are printed in reverse order. (GNU but not BSD.)

To print the last line of (contiguous) groups sharing all but the first 2 fields in common: tac FILE | uniq -f 2 | tac

`wc`

Counts lines/words/characters in the specified file(s), individually and in total. By default, displays lines, then words, then characters.

-l: count lines
-w: count words
-c: count characters

Typically a single input stream

Encoding

`file`

Determines the encoding of a text file, or indicates that the argument is a directory or pipe.

`iconv`

Converts the encoding of a text file.

iconv -f ISO-8859-1 -t UTF-8 FILE converts from ISO-8859-1 to UTF-8

Filtering/extracting by position

`cut`

Extracts fields from a file, based on delimiters or character positions.

cut -f1 FILE retrieves the first (tab-separated) column from the file
cut -d' ' -f1,3 FILE retrieves the first and third space-separated tokens from each line
cut -d'
' -f20-30 FILE (with a line break) supposedly retrieves the 20th-30th lines of the file, though this doesn’t seem to work in OS X. Equivalently: head -n 30 | tail -n 20
-s to omit lines without any delimiter

`head`, `tail`

Extracts a certain amount of text from the beginning or end of a file.

If multiple files are matched by the argument(s), a header indicating the filename will be displayed.

-n N: number of lines to retrieve (default: 10)
-c N: number of characters (bytes) to retrieve
tail -n +N, tail -c +N: N indicates an offset relative to the beginning of the file; the rest of the file after that offset will be extracted
head -n -N, head -c -N: offset relative to the end of the file (GNU but not BSD implementation)
head -n 100 FILE | tail -n 1 retrieves the 100th line of the file
tail -f FILE monitors the end of the file, writing to stdout as the file is appended to

Filtering/extracting by content

`uniq`

Filters out duplicate lines of input.

-c: prefix each line with a count
-i: case-insensitive
-f N: ignore the first N (whitespace-separated) fields of each line
-s N: ignore the first N characters of each line
-w N: ignore all but the first N characters of each line
Note: If some parts of the line are ignored, the kept and discarded lines may differ. The first line with a given “key” will be the one that is kept.
other options for filtering repeated or non-repeated lines

`grep` + friends

Searches text by regular expression.

-i: case-insensitive
-o: only show the matched part of the line (if multiple matches on an input line, these will be on separate output lines)
-w: match only whole words
-l: list only files in which matches were found
-r: recursive
-n: include matching lines and line numbers
-v (--invert-match): filter out matches
-c (--count): give counts of the matches within each file instead of the matches themselves
-E or egrep: extended regex syntax: unescaped +, (, and ) serve as operators
-F or fgrep: literal string matching (no regexes)
-H: suppress filename when displaying matches
zgrep searches zip files
bzgrep searches bz files

Augmenting

`nl`

Adds line numbers to a file. (Cf. cat -n.) Options control formatting and counting of the line numbers, including:

-v STARTNUM: initial counter value (default: 1)
-i INCREMENT (default: 1)
-w WIDTH: number of characters to be occupied by line numbers (default: 6)
-s SEP: separator to follow every line number (default: tab)
-b t: number non-empty lines (default); -b a: number all lines; -b pREGEX: number lines matching the regular expression pattern REGEX

Reordering

`sort`

Sorts the input lines.

PUNCTUATION/SPECIAL CHARS MAY BE IGNORED depending on the value of the LC_COLLATE environment variable
-f: ignores case
-n: “string numerical value”
-g: general numeric sort
-i: ignores nonprinting characters
-r: reverse
-u: unique
-k: sort key (field offsets)
-t: field delimiter

To sort by the first column, keeping only the last record for each: tac FILE | sort -k1,1 -u

`shuf`

Randomly permutes the input lines. (GNU but not BSD)

`rev`

Reverses each line of the file.

Rearranging (changing spacing)

`expand`, `unexpand`

Convert tabs to spaces and vice versa.

`fold`

Wrap input text so no line is more than a specified number of characters wide.

-w WIDTH: wrap lines in a file to be no more than WIDTH characters long (default: 80)
-s: breaks lines at word spaces

`column`

Formats text into columns (by default, based on whitespace delimiters).

Dividing

`split`

split FILE OUTPREFIX breaks up the input into smaller files by size.

Output files are named by the specified prefix and some number of lowercase alphabetic characters (configurable with -a; defaults to 2, i.e. aa, ab, etc.). In some implementations -d can be provided to request decimal rather than alphabetic suffixes. One of the following may be provided to determine the splitting behavior:

-l NUMLINES: number of lines in each output file (default: 1000)
-b BYTESIZE: size of each output file; BYTESIZE can even be in kilobytes (10k) or megabytes (10m)

`csplit`

Breaks up the input into smaller files by content.

Main arguments are the file, followed by one or more patterns indicating split points. Each pattern may be a line number, a regexp (optionally with a line offset), or a number of lines followed by {REPEATS} to indicate REPEATS blocks of the specified number of lines. The line matching the pattern begins a new output file. Output files are numbered with decimal digits.

-f OUTPREFIX (default: xx)
-n NUMDIGITS (default: 2)

Replacing

`tr`

Translates characters, e.g. lowercasing text in a file or replacing newlines with spaces.

`sed`

Text substitution by regular expression matching.

sed 's/K\.? ?V?\.? ?/K/g' FILE replaces all matches of the pattern in the file
sed '/^$/d' FILE filters out blank lines of the file
Depending on the implementation, sed may or may not not support backslash-denoted characters and character classes such as \t, \s, and [[:space:]]. (\t and \s are supported in GNU but not BSD implementations.) Tabs and newlines can be entered as literals.
Unless -E is specified, the plus operator and parenthesized subexpressions MUST HAVE BACKSLASH ESCAPES when using sed/grep: e.g. $.*$.\+

Two input streams

Call them FILE1 and FILE2, respectively.

`diff`

Compare two files line-by-line. (Cf. diff3 for three files.) Can also compare directories.

-i: ignore case
-w: ignore whitespace
-B: ignore blank lines
-I REGEX: ignore differences among lines that all match the regular expression pattern REGEX
-x REGEX: exclude files that match the pattern
-r: recursively compare subdirectories
-s: report identical files (file comparison only)
-u: unified diff display format
-y: side-by-side display (cf. sdiff)
--suppress-common-lines

Other options are useful when comparing code, e.g. -p, -F, -D, and -E.

`sdiff`

Compare files side-by-side, as with diff -y. Many of the diff options are available; -s is short for --suppress-common-lines.

`comm`

Displays lines common or unique to two SORTED files, organized into columns according to their commonality.

-1 to suppress the first column (lines only in FILE1)
-2 to suppress the second column (lines only in FILE2)
-3 to suppress the third column (lines common to both files)
-i for case-insensitive comparison
if the files are sorted, join -t'
' FILE1 FILE2 is much faster than comm -1 -2 FILE1 FILE2

`join`

Merges the lines of two sorted text files based on the presence of a common field.

-t CHAR to indicate a delimiter (by default, it is any sequence of whitespace). For instance, join -t'
' FILE1 FILE2 lists all common lines, assuming the two files are sorted.

Typically two or more input streams

`paste`

Combines/merges lines of multiple files.

paste FILE1 FILE2 … joins corresponding lines with tabs (vertically)
paste -s FILE1 FILE2 … joins horizontally, with corresponding lines of the input files displayed as columns
paste -d'\n' FILE1 FILE2 … interleaves lines of the files

Three input streams

`diff3`

Like diff, but for three files.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
README.md		README.md

nschneid/unix-text-commands

Folders and files

Latest commit

History

Repository files navigation

Unix Text Processing Command Reference

Nathan Schneider, 2013-01-29

No input stream

One or more input files/streams

Typically a single input stream

Encoding

Filtering/extracting by position

head, tail

Filtering/extracting by content

grep + friends

Augmenting

Reordering

rev

Rearranging (changing spacing)

expand, unexpand

column

Dividing

Replacing

Two input streams

sdiff

Typically two or more input streams

Three input streams

diff3

About

Topics

Resources

Stars

Watchers

Forks