title: awk notes
tags: awk

These are my notes as I work my way through Effective Awk 4th Ed.

```awk
/li/ { print $0 }
length($0) > 80    # print long lines
NF > 0             # print lines with at least one field
$6 == "Nov" { sum += $5 }


# max line length
{ if( length($0) > max ) max = length($0) }
END { print max }
# n.b. expand command to expand tabs to spaces

BEGIN { for( i=0; i<=7; i++ ) print int(101*rand()) }
```
```
ls -l files | awk '{ x += $5 } END { print x }'
ls -l files | awk '{ x += $5 } END { print x/1024 }'

awk -F: '{ print $1 }' /etc/passwd
awk 'END { print NR }'
awk 'NR % 2 == 0' # print even numbered lines
```

Awk processes all rules in order.
If more than one rule matches, awk will execute all of them, possibly printing a line more than once.

# args
```
-F fs
--field-separator fs

-f source-files
--file source-file

-v var=val
--assign var=val
(can be used more than once)

-W gawk-opt
POSIX convention for implementation specific args

-- end of options

-b 
--characters-as-bytes

-c
--traditional

-C
--copyright # print GPL

-dfile
--dump-variables=file

-Dfile
--debug=file

-e program-text
--source program-text

-E file
--exec file
var=val disallowed
should be used with #!

-g
--gen-pot
generate GNU gettext portable object template

-h
--help

-i source-file
--include source-file
equivalent to @include
source is not loaded if alread loaded,
whereas -f includes every time
(like include vs include_once in PHP)

-l ext
--load ext
load .so extension

-Lvalue
--lint=value

-M
--bignum
use arbitrary precision arithmetic

-n
--non-decimal-data
enable interpretation of octal and hex

-N
--use-lc-numeric
use locale's decimal separator

-ofile
--pretty-print=file

-O
--optimize

-pfile
--profile=file
Enable profiling

-P
--posix
operate in strict POSIX mode

-r
--re-interval
allow interval expressions in regexes
(this is gawk's default behaviour)

-S
--sandbox
disable system() and redirections

-t
--lint-old
warn about constructs not available in original awk

-V
--version
```

## Args
command line args of the form `var=val` do variable assignment rather than naming an input file

command line args available via `ARGV`
`ARGIND` is the index of the current index

`-v` assignments happen before BEGIN

## env vars
```
AWKPATH for awk source
AWKLIBPATH for -l
GAWK_MSEC_SLEEP
GAWK_READ_TIMEOUT
GAWK_SOCK_RETRIES
POSIXLY_CORRECT
```

# Regexes
## Char classes
```csv sep=|
csvhead: Class|Meaning
[:alnum:]|Alphanumeric characters
[:alpha:]|Alphabetic characters
[:blank:]|Space and TAB characters
[:cntrl:]|Control characters
[:digit:]|Numeric characters
[:graph:]|Characters that are both printable and visible (a space is printable but not visible, whereas an ‘a’ is both)
[:lower:]|Lowercase alphabetic characters
[:print:]|Printable characters (characters that are not control characters)
[:punct:]|Punctuation characters (characters that are not letters, digits, control characters, or space characters)
[:space:]|Space characters (such as space, TAB, and formfeed, to name a few)
[:upper:]|Uppercase alphabetic characters
[:xdigit:]|Characters that are hexadecimal digits
```
for ascii, use `[\x00-\x7F]`

## Equiv classes and collation symbols
```
[.ch.] treats ch differently than c followed by h
```
```
[=e=] matches e.g. é,ë as well as plain e
```
These are useful in non-English locales.

## Greedy
By default `+` and `*` are greedy.

## non regex
```
BEGIN { digits_regexp = "[[:digit:]]+" }
$0 ~ digits_regexp { print }
```
here `digits_regexp` is expanded to `/[[:digit:]]/` so that the second line is treated as
```
$0 ~ /[[:digit:]]/ { print }
```

## gawk specific
```
\s,\S,\w,\W match like in Perl
\<,\> match begin and end of word
\y empty string at beginning or end of a word
\B opposite of \y
\` empty string at start of buffer (same as ^)
\' empty string at end of buffer (same as $)
```

`IGNORECASE` makes all regexes case insensitive. Else
```
tolower($0) ~ /[a-f]/ { print }
```

## Predefined Variables 1
```csv sep=- vert cols=2
csvhead: Variable - Description
NR  - total number of records so far
FNR - number of records in current file so far
RS  - record separator, by default a newline, using "\0" works in gawk but is not portable. setting RS to empty string means records are separated by one or more blank lines
RT  - record separator matched -- if RS is a regex, this is the actual matched string, set to null string if no RS matches
NF  - number of fields in current record - decrementing this number throws away fields
FS  - Field separator (not IFS), set `FS="\n"` to treat whole record as a single field
OFS - Output field separator
```
Note that if no fields are modified, then `$0` is the record as input, with the field separators as the input.
If you assign e.g. `$1=$1`, then `$0` will then have field separators as `OFS` rather than `FS`.

## Fields
```csv sep=- vert cols=2
csvhead: Variable - Description
$0 - whole record
$1 - first field
$n - n'th field (counting from 1)
$NF - last field (NF == number of fields in record)
$var - use contents of variable var to select field
$(2*2) - evaluate expression `2*2` and use that as field number - note that a negative field number will terminate the program
```
As an example of $var
```awk
BEGIN { x = 3 }
{ print $x }
```
is equivalent to
```awk
{ print $3 }
```
Field contents can be changed. e.g
```awk
{ $3 = "bobbins"; print $0 }
```
turns
```
drwxrwxr-x 28 john john    4096 Sep 24  2022 anaconda3
```
into
```
drwxrwxr-x 28 bobbins john 4096 Sep 24 2022 anaconda3
```
Note that arithmetic can be used:
```awk
{ $2 += 400; print $0 }
```
turns the `ls -l` line into
```
drwxrwxr-x 428 john john 4096 Sep 24 2022 anaconda3
```
(the 28 has turned into 428)


Set `FIELDWIDTHS` to a sequence of numbers, e.g `10 12 14` to parse records as fixed length fields with no delimiters.

`PROCINFO["FS"]` tells you what sort of field splitting is happening (`FS` or `FIELDWIDTHS` for example)

### Content based splitting
For example
```
BEGIN { FPAT = "([^,]+)|(\"[^\"]+\")" }
```
will parse csv data, including if there are quotes.

## getline
For example
```
seq 20 | awk '{ x = $0; getline; y = $0; printf "%s -- %s\n",x,y }'
```
results in
```
1 -- 2
3 -- 4
5 -- 6
7 -- 8
9 -- 10
11 -- 12
13 -- 14
15 -- 16
17 -- 18
19 -- 20
```

You can use `getline var` to get the next line into a variable. So
```
seq 20 | awk '{ getline x; getline y; print $0,x,y }'
```
results in
```
1 2 3
4 5 6
7 8 9
10 11 12
13 14 15
16 17 18
19 20 18
```
(so `getline var` does not change `$0`).

### getline from file
```
getline < file      # gets a line from a file
```
Silly example
```
<kx.sh awk '{ x = $0; getline < ".bashrc"; print NR, x, "--", $0 }'
```
which prints "a -- b" for each corresponding line of the input file `kx.sh` and the file `.bashrc`.

You can use piles, another silly example
```
cat code_blocks_install.sh | awk '{ 
	cmd = "ls rs"; 
    while((cmd | getline) > 0) print; 
    close(cmd) }'
```
Use `|&` for coprocesses
```
print "some query" |& "db_server"
"db_server" |& getline
```