These are my notes as I work my way through Effective Awk 4th Ed.
/li/ { print $0 }
length($0) > 80 # print long lines
NF > 0 # print lines with at least one field
$6 == "Nov" { sum += $5 }
# max line length
{ if( length($0) > max ) max = length($0) }
END { print max }
# n.b. expand command to expand tabs to spaces
BEGIN { for( i=0; i<=7; i++ ) print int(101*rand()) }
ls -l files | awk '{ x += $5 } END { print x }'
ls -l files | awk '{ x += $5 } END { print x/1024 }'
awk -F: '{ print $1 }' /etc/passwd
awk 'END { print NR }'
awk 'NR % 2 == 0' # print even numbered lines
Awk processes all rules in order. If more than one rule matches, awk will execute all of them, possibly printing a line more than once.
args
-F fs
--field-separator fs
-f source-files
--file source-file
-v var=val
--assign var=val
(can be used more than once)
-W gawk-opt
POSIX convention for implementation specific args
-- end of options
-b
--characters-as-bytes
-c
--traditional
-C
--copyright # print GPL
-dfile
--dump-variables=file
-Dfile
--debug=file
-e program-text
--source program-text
-E file
--exec file
var=val disallowed
should be used with #!
-g
--gen-pot
generate GNU gettext portable object template
-h
--help
-i source-file
--include source-file
equivalent to @include
source is not loaded if alread loaded,
whereas -f includes every time
(like include vs include_once in PHP)
-l ext
--load ext
load .so extension
-Lvalue
--lint=value
-M
--bignum
use arbitrary precision arithmetic
-n
--non-decimal-data
enable interpretation of octal and hex
-N
--use-lc-numeric
use locale's decimal separator
-ofile
--pretty-print=file
-O
--optimize
-pfile
--profile=file
Enable profiling
-P
--posix
operate in strict POSIX mode
-r
--re-interval
allow interval expressions in regexes
(this is gawk's default behaviour)
-S
--sandbox
disable system() and redirections
-t
--lint-old
warn about constructs not available in original awk
-V
--version
Args
command line args of the form var=val do variable assignment rather than naming an input file
command line args available via ARGV
ARGIND is the index of the current index
-v assignments happen before BEGIN
env vars
AWKPATH for awk source
AWKLIBPATH for -l
GAWK_MSEC_SLEEP
GAWK_READ_TIMEOUT
GAWK_SOCK_RETRIES
POSIXLY_CORRECT
Regexes
Char classes
| Class | Meaning |
|---|---|
| [:alnum:] | Alphanumeric characters |
| [:alpha:] | Alphabetic characters |
| [:blank:] | Space and TAB characters |
| [:cntrl:] | Control characters |
| [:digit:] | Numeric characters |
| [:graph:] | Characters that are both printable and visible (a space is printable but not visible, whereas an ‘a’ is both) |
| [:lower:] | Lowercase alphabetic characters |
| [:print:] | Printable characters (characters that are not control characters) |
| [:punct:] | Punctuation characters (characters that are not letters, digits, control characters, or space characters) |
| [:space:] | Space characters (such as space, TAB, and formfeed, to name a few) |
| [:upper:] | Uppercase alphabetic characters |
| [:xdigit:] | Characters that are hexadecimal digits |
for ascii, use [\x00-\x7F]
Equiv classes and collation symbols
[.ch.] treats ch differently than c followed by h
[=e=] matches e.g. é,ë as well as plain e
These are useful in non-English locales.
Greedy
By default + and * are greedy.
non regex
BEGIN { digits_regexp = "[[:digit:]]+" }
$0 ~ digits_regexp { print }
here digits_regexp is expanded to /[[:digit:]]/ so that the second line is treated as
$0 ~ /[[:digit:]]/ { print }
gawk specific
\s,\S,\w,\W match like in Perl
\<,\> match begin and end of word
\y empty string at beginning or end of a word
\B opposite of \y
\` empty string at start of buffer (same as ^)
\' empty string at end of buffer (same as $)
IGNORECASE makes all regexes case insensitive. Else
tolower($0) ~ /[a-f]/ { print }
Predefined Variables 1
| Variable | Description |
|---|---|
| NR | total number of records so far |
| FNR | number of records in current file so far |
| RS | record separator, by default a newline, using "\0" works in gawk but is not portable. setting RS to empty string means records are separated by one or more blank lines |
| RT | record separator matched -- if RS is a regex, this is the actual matched string, set to null string if no RS matches |
| NF | number of fields in current record - decrementing this number throws away fields |
| FS | Field separator (not IFS), set FS="\n" to treat whole record as a single field |
| OFS | Output field separator |
Note that if no fields are modified, then $0 is the record as input, with the field separators as the input.
If you assign e.g. $1=$1, then $0 will then have field separators as OFS rather than FS.
Fields
| Variable | Description |
|---|---|
| $0 | whole record |
| $1 | first field |
| $n | n'th field (counting from 1) |
| $NF | last field (NF == number of fields in record) |
| $var | use contents of variable var to select field |
| $(2*2) | evaluate expression 2*2 and use that as field number - note that a negative field number will terminate the program |
As an example of $var
BEGIN { x = 3 }
{ print $x }
is equivalent to
{ print $3 }
Field contents can be changed. e.g
{ $3 = "bobbins"; print $0 }
turns
drwxrwxr-x 28 john john 4096 Sep 24 2022 anaconda3
into
drwxrwxr-x 28 bobbins john 4096 Sep 24 2022 anaconda3
Note that arithmetic can be used:
{ $2 += 400; print $0 }
turns the ls -l line into
drwxrwxr-x 428 john john 4096 Sep 24 2022 anaconda3
(the 28 has turned into 428)
Set FIELDWIDTHS to a sequence of numbers, e.g 10 12 14 to parse records as fixed length fields with no delimiters.
PROCINFO["FS"] tells you what sort of field splitting is happening (FS or FIELDWIDTHS for example)
Content based splitting
For example
BEGIN { FPAT = "([^,]+)|(\"[^\"]+\")" }
will parse csv data, including if there are quotes.
getline
For example
seq 20 | awk '{ x = $0; getline; y = $0; printf "%s -- %s\n",x,y }'
results in
1 -- 2
3 -- 4
5 -- 6
7 -- 8
9 -- 10
11 -- 12
13 -- 14
15 -- 16
17 -- 18
19 -- 20
You can use getline var to get the next line into a variable. So
seq 20 | awk '{ getline x; getline y; print $0,x,y }'
results in
1 2 3
4 5 6
7 8 9
10 11 12
13 14 15
16 17 18
19 20 18
(so getline var does not change $0).
getline from file
getline < file # gets a line from a file
Silly example
<kx.sh awk '{ x = $0; getline < ".bashrc"; print NR, x, "--", $0 }'
which prints "a -- b" for each corresponding line of the input file kx.sh and the file .bashrc.
You can use piles, another silly example
cat code_blocks_install.sh | awk '{
cmd = "ls rs";
while((cmd | getline) > 0) print;
close(cmd) }'
Use |& for coprocesses
print "some query" |& "db_server"
"db_server" |& getline