The tiny AWK program I recently wrote1 is the following. It prints each line after having passed it through a regex replacement2.
grep -R TRIGGER . | awk '{gsub(/:.*/," ")}1'
This program looks really obscure to me. What does the {...}1
mean.
How does gsub
know what to match against?
Turns out that AWK3 is quite a large language - at least if you go by the size of the official documentation. The goals of this post is to cover just enough of it for you to be comfortable writing small programs. If you just want to know how the example above works you can jump directly to the Explaining the example section.
tl;dr AWK runs actions against streams of textual data. The stream is split into records according to a specified record separator and each record is split into fields according to the field separator. An AWK program is a sequence of pattern-action statements. If the pattern holds the action is executed against the current record.
Table of contents
- Invoking AWK
- Key concepts
- Program structure
- Patterns
- Actions
- Useful built-in variables & functions
- Explaining the example
- Further reading
Invoking AWK
Imagine you have the following AWK program in a file named program.awk
.
{print $4}
And the following text in a file named text.txt
.
This is the first line
This is the second line
This is the third line
Then you can invoke awk
like this
awk -f program.awk text.txt
# first
# second
# third
However, I almost exclusively use AWK as part of a pipe, which looks like this
cat text.txt | awk '{print $4}'
Key concepts
Before we start looking at the concrete syntax let’s get an overview of the key concepts.
Records The text stream is split into a sequence of records
according to the record separator. By default this is set to be
newlines but you can change it by setting the RS
variable - in the
following example the -v
flag is used to set the value of the RS
variable before the AWK program is executed.
# The following prints b d f separated by newlines
echo "a b,c d,e f" | awk -v RS=, '{print $2}'
Fields Each record is split into fields according to the field
separator which can be changed by setting the FS
variable or by
using the -F
option. The default field separator is space. You can
read more about it here.
echo "a,b,c" | awk -F, '{print $2}'` # prints b
echo "a,b,c" | awk -v FS=, '{print $2}' # prints b
Rules A rule consists of a pattern and an action, either of which (but not both) may be omitted. The action is executed if the pattern matches the current input record. An action consists of one or more awk statements.
With the main concepts in place let’s have a look at how to write an AWK program.
Program structure
As mentioned above an AWK program consists of a sequence of pattern-action statements which are called rules. Concretely that looks like this.
pattern { action }
...
pattern { action }
Either the pattern or the action can be omitted. If the pattern is
omitted the empty pattern is used. If the action is omitted the
default action is used which is {print $0}
, that is, it will print
the current record.
Each record is tested against all of rules. Let’s have a look at how to define patterns.
Patterns
There are five different kinds of patterns
-
Empty pattern: This pattern is used if you omitted a pattern in your rule. It matches every record.
-
Regular expression pattern: These allow you to use regular expressions to define your patterns.
-
Expression pattern: If the value of the expression is non-zero or non-null the pattern is considered a match.
-
Range pattern: These patterns consists of two patterns separated by a comma. The first pattern denotes the start of the range and the second pattern denotes the end of the range.
-
Built in special patterns. These patterns are built into AWK. The two most commonly used are
BEGIN
andEND
. They allow you to run actions before the first and after the last record is read. This is useful to initialize variables, perform cleanup, and such.
Have a look at an example of each of the patterns below.
# The empty pattern
{print "The empty pattern is always true: " $0}
# Regular expression pattern
/second/ {print "A line matched the regex: " $0 }
# An expression pattern
NR == 1 { print "The first line: " $0}
# A range pattern consisting of two expression patterns
NR == 2, NR == 3 { print "In range as NR is " NR ": " $0}
# Special patterns
BEGIN { print "Before everything else" }
END { print "After everything else" }
If we run that on the example input from before we get the following output.
Before everything else
The empty pattern is always true: This is the first line
The empty pattern is always true: This is the second line
A line matched the regex: This is the second line
In range as NR is 2: This is the second line
The empty pattern is always true: This is the third line
In range as NR is 3: This is the third line
After everything else
Actions
The purpose of an action is to tell awk what to do once a match for the pattern is found.
An action consists of one or more AWK statements, enclosed in braces. The statements are separated by newlines or semicolons.
There are six kinds of statements. This is where it becomes apparent that AWK is a full-blown programming language, so I’ll keep it short.
-
Expressions. Calling functions and assigning values to variables.
-
Control statements. Your usual control-flow constructs like
if
,for
,while
, etc. -
Compound statements. Simply used to group statements together, which you’ll most likely need if you use any control statements. It’s simply a pair of braces with statements separated by newlines or semicolons.
-
Input statements. These statements allow you to read input from stdin or files.
-
Output statements. These are used to print to stdout. You’ll already seen the use of
print
in all of the examples so far. -
Deletion statements. These are for deleting array elements.
Here’s an examples I made up to make it concrete. The program counts the total number of words in lines that contain six or more words. The Wikipedia page on AWK also has a few sample programs that might be useful.
BEGIN {
skipped = 0
long_lines = 0
count = 0
}
NF <= 5 { skipped++ }
NF > 5 {
long_lines++
count += NF
}
END {
print "Skipped " skipped " lines as they were too short"
print "Found " long_lines " with a total of " count " words!"
}
Have a look at the manual if you want to know more about actions.
Useful built-in variables & functions
You can find the full list of built-in variables here. These are the ones that you’ll most likely need.
FS
: The field separatorRS
: The record separatorNF
: The number of fields in the current recordNR
: The number of the current record, starts at 1.
You can find the full list of built-in functions here.
-
gsub(regexp, replacement [, target])
Thetaget
is optional, if left out it will default to$0
. An example would begsub(/my-regex/, "replacement", $3)
-
length([string])
.
Computes the length of the input string. The argument is optional, it will default to$0
Explaining the example
Alright, let’s try to dissect the example to see if we can make any sense of it.
{gsub(/:.*/," ")}1
This AWK program consists of two rules. {gsub(/:.*/," ")}
and 1
.
The first rule has omitted the pattern and as such it uses the empty
pattern. The second rule has omitted the action which means AWK will
use the default action. Additionally the taget
argument to gsub
has been omitted which means that it will default to $0
. Let’s make
all of this explicit.
{gsub(/:.*/," ", $0)}
1 { print $0 }
Now that everything is explicit it’s a bit more clear that AWK will run the regular expression substitution on the current record and then print it.
Further reading
Everything you could possibly want to know can be found somewhere in The GNU AWK User’s Guide.
If you want something that’s smaller than the manual but contains more information than this post I’d suggest having a look AWK - A Tutorial and Introduction
The Wikipedia page on AWK is quite good as well.