Awk is a Command Line Interfacetool for splicing text.

Good guide on awk here. Another good tutorial here. Should always use single quotes. $0 means the current line.

Example Data

Date        Open        High        Low         Close       Volume     Adj Close
2016-03-24  98.639999   98.849998   97.07       98.360001   10646900   98.360001
2016-03-23  99.75       100.389999  98.809998   99.589996   8292300    99.589996
2016-03-22  100.480003  101.519997  99.199997   99.839996   9039500    99.839996
2016-03-21  101.150002  102.099998  99.50       101.059998  9562900    101.059998
2016-03-18  100.50      102.410004  100.010002  101.120003  15437300   101.120003

Read in data not tab separated

Can specify separator with flag -F:

cat mydata.csv | awk -F, '{print $5}'     # split columns on commas

Can also specify a regex for -F:

cat mydata.csv | awk -F'[,-]' '{print $3, "--", $0}'

Print 2nd Column

cat mydata.tsv | awk '{print $2}'

Print 1st, 6th, 5th Column

Commas between values will insert spaces:

cat mydata.tsv | awk '{print $1, $6, $5}'

Print CSV or another format

Commas (or another seperator) in quotes will print out:

cat mydata.tsv | awk '{print $1 "," $6}'

Proper way to specify output separator in AWK is to use OFS:

cat mydata.tsv | awk 'BEGIN {OFS=","} {print $1, $6}'

Match Text

Can be used as an alternative to grep.

echo -e "Test\nMatch" | awk '/Test/'

Regex Match Text

Can be used as an alternative to grep.

echo -e "2015-04-05\n2016-07-08\n2015-12-31" | awk '/^2015-/'

Regex Match Text on a column

Will only match specific lines with the column that matches:

cat mydata.tsv | awk '$1 ~ /^2015-/'

Everything but matched Text

Can be used as an alternative to grep.

echo -e "Test\nMatch" | awk '! /Test/'

Comparing Values

$2 == 124.47   # equality
$2 != 124.47   # inequality

$2 > 124.47    # greater than
$2 >= 124.47   # greater than or equal
$2 < 124.47    # smaller than
$2 <= 124.47   # smaller than or equal

$2 ~ /^10.$/   # regex match
$2 !~ /^10.$/  # regex negated match  -- this one might be new

Logical Operators

$1 ~ /^2015/ && $6 > 20000000  # and -- high volume in 2015
$6 < 1000000 || $6 > 20000000  # or  -- low or high volume
! /^2015/                      # not -- not in 2015

Built-in variables

For one file:

  • NR: number of records (lines) processed since AWK started
  • NF: the number of fields (columns) on the current line

On multiple files:

  • FNR: see NR above, but resets to 1 when it hits a new file
  • FILENAME: the name of the current file being processed

User defined variables

No need to declare variables with AWK, it is created on use.

Print line number before each result:

cat mydata.tsv | awk '/^2015-/ {count++; print count, $0}'

What do variables initialize to?

awk 'begin {print x + 2}'          # => 2
awk 'begin {x = x + 2; print x}'   # => 2
awk 'begin {print x}'              # => <blank> -- empty string, really

Initialise variables on the command line (with -v):

cat mydata.tsv | awk -v col=6 '{print $col}'

Also can do same for OFS (what separator to print out with):

cat netflix.tsv | awk -v OFS=, '{print $1, $6}'

Special Patterns - Print at Start & End

If you want to print a header or footer, you can use BEGIN or END. BEGIN is triggered before processing any line. END is triggered after all lines are processed.

Example:

cat mydata.tsv | awk 'BEGIN {print "REPORT for XXXX"} {print}'
cat mydata.tsv | awk '{print} END {print NR}'

Multiple Conditions

cat mydata.tsv | awk '/^2016-03-24/ {print} $4 == 96.43 {print}' # Will print both lines that match
# Can also do the same this way:
cat mydata.tsv | awk '/^2016-03-24/; $4 == 96.43'

If you condition matches two lines, you can seperate by order using next keyword:

cat mydata.tsv | awk '/^2016-03-24/ {print; next} $4 == 97.07 {print}'

Arrays

To sum volume by year:

cat mydata.csv | awk -F'[,-]' '{volume[$1] += $8} END { for(year in volume) print year, volume[year]}'

Explanation:

  • Split on commas or hyphens
  • accumulates volume (1)
  • Volume dictionary is created automatically, as we use it that way
  • At the END print each year and volume sum

Embedding in shell scripts

Same as above example in arrays, but in a script:

#!/bin/bash

cat "$@" | awk -F'[,-]' '

{volume[$1] += $8}

END {
  for(year in volume) {
    print year, volume[year]
  }
}
'

Explanation:

  • normal bash script
  • cat ”$@” will pass through filename or content as piped to the script
  • awk will keep going between quotes ’ to ’ allowing multiple lines
  • can make code more readable
  • can add a pipe after last quote(’) if needed for next program

See also: