title: awk notes tags: awk These are my notes as I work my way through Effective Awk 4th Ed. ```awk /li/ { print $0 } length($0) > 80 # print long lines NF > 0 # print lines with at least one field $6 == "Nov" { sum += $5 } # max line length { if( length($0) > max ) max = length($0) } END { print max } # n.b. expand command to expand tabs to spaces BEGIN { for( i=0; i<=7; i++ ) print int(101*rand()) } ``` ``` ls -l files | awk '{ x += $5 } END { print x }' ls -l files | awk '{ x += $5 } END { print x/1024 }' awk -F: '{ print $1 }' /etc/passwd awk 'END { print NR }' awk 'NR % 2 == 0' # print even numbered lines ``` Awk processes all rules in order. If more than one rule matches, awk will execute all of them, possibly printing a line more than once. # args ``` -F fs --field-separator fs -f source-files --file source-file -v var=val --assign var=val (can be used more than once) -W gawk-opt POSIX convention for implementation specific args -- end of options -b --characters-as-bytes -c --traditional -C --copyright # print GPL -dfile --dump-variables=file -Dfile --debug=file -e program-text --source program-text -E file --exec file var=val disallowed should be used with #! -g --gen-pot generate GNU gettext portable object template -h --help -i source-file --include source-file equivalent to @include source is not loaded if alread loaded, whereas -f includes every time (like include vs include_once in PHP) -l ext --load ext load .so extension -Lvalue --lint=value -M --bignum use arbitrary precision arithmetic -n --non-decimal-data enable interpretation of octal and hex -N --use-lc-numeric use locale's decimal separator -ofile --pretty-print=file -O --optimize -pfile --profile=file Enable profiling -P --posix operate in strict POSIX mode -r --re-interval allow interval expressions in regexes (this is gawk's default behaviour) -S --sandbox disable system() and redirections -t --lint-old warn about constructs not available in original awk -V --version ``` ## Args command line args of the form `var=val` do variable assignment rather than naming an input file command line args available via `ARGV` `ARGIND` is the index of the current index `-v` assignments happen before BEGIN ## env vars ``` AWKPATH for awk source AWKLIBPATH for -l GAWK_MSEC_SLEEP GAWK_READ_TIMEOUT GAWK_SOCK_RETRIES POSIXLY_CORRECT ``` # Regexes ## Char classes ```csv sep=| csvhead: Class|Meaning [:alnum:]|Alphanumeric characters [:alpha:]|Alphabetic characters [:blank:]|Space and TAB characters [:cntrl:]|Control characters [:digit:]|Numeric characters [:graph:]|Characters that are both printable and visible (a space is printable but not visible, whereas an ‘a’ is both) [:lower:]|Lowercase alphabetic characters [:print:]|Printable characters (characters that are not control characters) [:punct:]|Punctuation characters (characters that are not letters, digits, control characters, or space characters) [:space:]|Space characters (such as space, TAB, and formfeed, to name a few) [:upper:]|Uppercase alphabetic characters [:xdigit:]|Characters that are hexadecimal digits ``` for ascii, use `[\x00-\x7F]` ## Equiv classes and collation symbols ``` [.ch.] treats ch differently than c followed by h ``` ``` [=e=] matches e.g. é,ë as well as plain e ``` These are useful in non-English locales. ## Greedy By default `+` and `*` are greedy. ## non regex ``` BEGIN { digits_regexp = "[[:digit:]]+" } $0 ~ digits_regexp { print } ``` here `digits_regexp` is expanded to `/[[:digit:]]/` so that the second line is treated as ``` $0 ~ /[[:digit:]]/ { print } ``` ## gawk specific ``` \s,\S,\w,\W match like in Perl \<,\> match begin and end of word \y empty string at beginning or end of a word \B opposite of \y \` empty string at start of buffer (same as ^) \' empty string at end of buffer (same as $) ``` `IGNORECASE` makes all regexes case insensitive. Else ``` tolower($0) ~ /[a-f]/ { print } ``` ## Predefined Variables 1 ```csv sep=- vert cols=2 csvhead: Variable - Description NR - total number of records so far FNR - number of records in current file so far RS - record separator, by default a newline, using "\0" works in gawk but is not portable. setting RS to empty string means records are separated by one or more blank lines RT - record separator matched -- if RS is a regex, this is the actual matched string, set to null string if no RS matches NF - number of fields in current record - decrementing this number throws away fields FS - Field separator (not IFS), set `FS="\n"` to treat whole record as a single field OFS - Output field separator ``` Note that if no fields are modified, then `$0` is the record as input, with the field separators as the input. If you assign e.g. `$1=$1`, then `$0` will then have field separators as `OFS` rather than `FS`. ## Fields ```csv sep=- vert cols=2 csvhead: Variable - Description $0 - whole record $1 - first field $n - n'th field (counting from 1) $NF - last field (NF == number of fields in record) $var - use contents of variable var to select field $(2*2) - evaluate expression `2*2` and use that as field number - note that a negative field number will terminate the program ``` As an example of $var ```awk BEGIN { x = 3 } { print $x } ``` is equivalent to ```awk { print $3 } ``` Field contents can be changed. e.g ```awk { $3 = "bobbins"; print $0 } ``` turns ``` drwxrwxr-x 28 john john 4096 Sep 24 2022 anaconda3 ``` into ``` drwxrwxr-x 28 bobbins john 4096 Sep 24 2022 anaconda3 ``` Note that arithmetic can be used: ```awk { $2 += 400; print $0 } ``` turns the `ls -l` line into ``` drwxrwxr-x 428 john john 4096 Sep 24 2022 anaconda3 ``` (the 28 has turned into 428) Set `FIELDWIDTHS` to a sequence of numbers, e.g `10 12 14` to parse records as fixed length fields with no delimiters. `PROCINFO["FS"]` tells you what sort of field splitting is happening (`FS` or `FIELDWIDTHS` for example) ### Content based splitting For example ``` BEGIN { FPAT = "([^,]+)|(\"[^\"]+\")" } ``` will parse csv data, including if there are quotes. ## getline For example ``` seq 20 | awk '{ x = $0; getline; y = $0; printf "%s -- %s\n",x,y }' ``` results in ``` 1 -- 2 3 -- 4 5 -- 6 7 -- 8 9 -- 10 11 -- 12 13 -- 14 15 -- 16 17 -- 18 19 -- 20 ``` You can use `getline var` to get the next line into a variable. So ``` seq 20 | awk '{ getline x; getline y; print $0,x,y }' ``` results in ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 18 ``` (so `getline var` does not change `$0`). ### getline from file ``` getline < file # gets a line from a file ``` Silly example ``` 0) print; close(cmd) }' ``` Use `|&` for coprocesses ``` print "some query" |& "db_server" "db_server" |& getline ```