These are my notes as I work my way through Effective Awk 4th Ed.

/li/ { print $0 }
length($0) > 80    # print long lines
NF > 0             # print lines with at least one field
$6 == "Nov" { sum += $5 }


# max line length
{ if( length($0) > max ) max = length($0) }
END { print max }
# n.b. expand command to expand tabs to spaces

BEGIN { for( i=0; i<=7; i++ ) print int(101*rand()) }

ls -l files | awk '{ x += $5 } END { print x }'
ls -l files | awk '{ x += $5 } END { print x/1024 }'

awk -F: '{ print $1 }' /etc/passwd
awk 'END { print NR }'
awk 'NR % 2 == 0' # print even numbered lines

Awk processes all rules in order. If more than one rule matches, awk will execute all of them, possibly printing a line more than once.

args

-F fs
--field-separator fs

-f source-files
--file source-file

-v var=val
--assign var=val
(can be used more than once)

-W gawk-opt
POSIX convention for implementation specific args

-- end of options

-b 
--characters-as-bytes

-c
--traditional

-C
--copyright # print GPL

-dfile
--dump-variables=file

-Dfile
--debug=file

-e program-text
--source program-text

-E file
--exec file
var=val disallowed
should be used with #!

-g
--gen-pot
generate GNU gettext portable object template

-h
--help

-i source-file
--include source-file
equivalent to @include
source is not loaded if alread loaded,
whereas -f includes every time
(like include vs include_once in PHP)

-l ext
--load ext
load .so extension

-Lvalue
--lint=value

-M
--bignum
use arbitrary precision arithmetic

-n
--non-decimal-data
enable interpretation of octal and hex

-N
--use-lc-numeric
use locale's decimal separator

-ofile
--pretty-print=file

-O
--optimize

-pfile
--profile=file
Enable profiling

-P
--posix
operate in strict POSIX mode

-r
--re-interval
allow interval expressions in regexes
(this is gawk's default behaviour)

-S
--sandbox
disable system() and redirections

-t
--lint-old
warn about constructs not available in original awk

-V
--version

Args

command line args of the form var=val do variable assignment rather than naming an input file

command line args available via ARGV ARGIND is the index of the current index

-v assignments happen before BEGIN

env vars

AWKPATH for awk source
AWKLIBPATH for -l
GAWK_MSEC_SLEEP
GAWK_READ_TIMEOUT
GAWK_SOCK_RETRIES
POSIXLY_CORRECT

Regexes

Char classes

Class	Meaning
[:alnum:]	Alphanumeric characters
[:alpha:]	Alphabetic characters
[:blank:]	Space and TAB characters
[:cntrl:]	Control characters
[:digit:]	Numeric characters
[:graph:]	Characters that are both printable and visible (a space is printable but not visible, whereas an ‘a’ is both)
[:lower:]	Lowercase alphabetic characters
[:print:]	Printable characters (characters that are not control characters)
[:punct:]	Punctuation characters (characters that are not letters, digits, control characters, or space characters)
[:space:]	Space characters (such as space, TAB, and formfeed, to name a few)
[:upper:]	Uppercase alphabetic characters
[:xdigit:]	Characters that are hexadecimal digits

for ascii, use [\x00-\x7F]

Equiv classes and collation symbols

[.ch.] treats ch differently than c followed by h

[=e=] matches e.g. é,ë as well as plain e

These are useful in non-English locales.

Greedy

By default + and * are greedy.

non regex

BEGIN { digits_regexp = "[[:digit:]]+" }
$0 ~ digits_regexp { print }

here digits_regexp is expanded to /[[:digit:]]/ so that the second line is treated as

$0 ~ /[[:digit:]]/ { print }

gawk specific

\s,\S,\w,\W match like in Perl
\<,\> match begin and end of word
\y empty string at beginning or end of a word
\B opposite of \y
\` empty string at start of buffer (same as ^)
\' empty string at end of buffer (same as $)

IGNORECASE makes all regexes case insensitive. Else

tolower($0) ~ /[a-f]/ { print }

Predefined Variables 1

Variable	Description
NR	total number of records so far
FNR	number of records in current file so far
RS	record separator, by default a newline, using "\0" works in gawk but is not portable. setting RS to empty string means records are separated by one or more blank lines
RT	record separator matched -- if RS is a regex, this is the actual matched string, set to null string if no RS matches
NF	number of fields in current record - decrementing this number throws away fields
FS	Field separator (not IFS), set `FS="\n"` to treat whole record as a single field
OFS	Output field separator

Note that if no fields are modified, then $0 is the record as input, with the field separators as the input. If you assign e.g. $1=$1, then $0 will then have field separators as OFS rather than FS.

Fields

Variable	Description
$0	whole record
$1	first field
$n	n'th field (counting from 1)
$NF	last field (NF == number of fields in record)
$var	use contents of variable var to select field
$(2*2)	evaluate expression `2*2` and use that as field number - note that a negative field number will terminate the program

As an example of $var

BEGIN { x = 3 }
{ print $x }

is equivalent to

{ print $3 }

Field contents can be changed. e.g

{ $3 = "bobbins"; print $0 }

turns

drwxrwxr-x 28 john john    4096 Sep 24  2022 anaconda3

into

drwxrwxr-x 28 bobbins john 4096 Sep 24 2022 anaconda3

Note that arithmetic can be used:

{ $2 += 400; print $0 }

turns the ls -l line into

drwxrwxr-x 428 john john 4096 Sep 24 2022 anaconda3

(the 28 has turned into 428)

Set FIELDWIDTHS to a sequence of numbers, e.g 10 12 14 to parse records as fixed length fields with no delimiters.

PROCINFO["FS"] tells you what sort of field splitting is happening (FS or FIELDWIDTHS for example)

Content based splitting

For example

BEGIN { FPAT = "([^,]+)|(\"[^\"]+\")" }

will parse csv data, including if there are quotes.

getline

For example

seq 20 | awk '{ x = $0; getline; y = $0; printf "%s -- %s\n",x,y }'

results in

1 -- 2
3 -- 4
5 -- 6
7 -- 8
9 -- 10
11 -- 12
13 -- 14
15 -- 16
17 -- 18
19 -- 20

You can use getline var to get the next line into a variable. So

seq 20 | awk '{ getline x; getline y; print $0,x,y }'

results in

(so getline var does not change $0).

getline from file

getline < file      # gets a line from a file

Silly example

<kx.sh awk '{ x = $0; getline < ".bashrc"; print NR, x, "--", $0 }'

which prints "a -- b" for each corresponding line of the input file kx.sh and the file .bashrc.

You can use piles, another silly example

cat code_blocks_install.sh | awk '{ 
    cmd = "ls rs"; 
    while((cmd | getline) > 0) print; 
    close(cmd) }'

Use |& for coprocesses

print "some query" |& "db_server"
"db_server" |& getline