Awk is a Command Line Interfacetool for splicing text.
Good guide on awk here. Another good tutorial here.
Should always use single quotes.
$0 means the current line.
Example Data
Date Open High Low Close Volume Adj Close
2016-03-24 98.639999 98.849998 97.07 98.360001 10646900 98.360001
2016-03-23 99.75 100.389999 98.809998 99.589996 8292300 99.589996
2016-03-22 100.480003 101.519997 99.199997 99.839996 9039500 99.839996
2016-03-21 101.150002 102.099998 99.50 101.059998 9562900 101.059998
2016-03-18 100.50 102.410004 100.010002 101.120003 15437300 101.120003
Read in data not tab separated
Can specify separator with flag -F:
cat mydata.csv | awk -F, '{print $5}' # split columns on commas
Can also specify a regex for -F:
cat mydata.csv | awk -F'[,-]' '{print $3, "--", $0}'
Print 2nd Column
cat mydata.tsv | awk '{print $2}'
Print 1st, 6th, 5th Column
Commas between values will insert spaces:
cat mydata.tsv | awk '{print $1, $6, $5}'
Print CSV or another format
Commas (or another seperator) in quotes will print out:
cat mydata.tsv | awk '{print $1 "," $6}'
Proper way to specify output separator in AWK is to use OFS:
cat mydata.tsv | awk 'BEGIN {OFS=","} {print $1, $6}'
Match Text
Can be used as an alternative to grep.
echo -e "Test\nMatch" | awk '/Test/'
Regex Match Text
Can be used as an alternative to grep.
echo -e "2015-04-05\n2016-07-08\n2015-12-31" | awk '/^2015-/'
Regex Match Text on a column
Will only match specific lines with the column that matches:
cat mydata.tsv | awk '$1 ~ /^2015-/'
Everything but matched Text
Can be used as an alternative to grep.
echo -e "Test\nMatch" | awk '! /Test/'
Comparing Values
$2 == 124.47 # equality
$2 != 124.47 # inequality
$2 > 124.47 # greater than
$2 >= 124.47 # greater than or equal
$2 < 124.47 # smaller than
$2 <= 124.47 # smaller than or equal
$2 ~ /^10.$/ # regex match
$2 !~ /^10.$/ # regex negated match -- this one might be new
Logical Operators
$1 ~ /^2015/ && $6 > 20000000 # and -- high volume in 2015
$6 < 1000000 || $6 > 20000000 # or -- low or high volume
! /^2015/ # not -- not in 2015
Built-in variables
For one file:
- NR: number of records (lines) processed since AWK started
- NF: the number of fields (columns) on the current line
On multiple files:
- FNR: see NR above, but resets to 1 when it hits a new file
- FILENAME: the name of the current file being processed
User defined variables
No need to declare variables with AWK, it is created on use.
Print line number before each result:
cat mydata.tsv | awk '/^2015-/ {count++; print count, $0}'
What do variables initialize to?
awk 'begin {print x + 2}' # => 2
awk 'begin {x = x + 2; print x}' # => 2
awk 'begin {print x}' # => <blank> -- empty string, really
Initialise variables on the command line (with -v):
cat mydata.tsv | awk -v col=6 '{print $col}'
Also can do same for OFS (what separator to print out with):
cat netflix.tsv | awk -v OFS=, '{print $1, $6}'
Special Patterns - Print at Start & End
If you want to print a header or footer, you can use BEGIN or END.
BEGIN is triggered before processing any line.
END is triggered after all lines are processed.
Example:
cat mydata.tsv | awk 'BEGIN {print "REPORT for XXXX"} {print}'
cat mydata.tsv | awk '{print} END {print NR}'
Multiple Conditions
cat mydata.tsv | awk '/^2016-03-24/ {print} $4 == 96.43 {print}' # Will print both lines that match
# Can also do the same this way:
cat mydata.tsv | awk '/^2016-03-24/; $4 == 96.43'
If you condition matches two lines, you can seperate by order using next keyword:
cat mydata.tsv | awk '/^2016-03-24/ {print; next} $4 == 97.07 {print}'
Arrays
To sum volume by year:
cat mydata.csv | awk -F'[,-]' '{volume[$1] += $8} END { for(year in volume) print year, volume[year]}'
Explanation:
- Split on commas or hyphens
- accumulates volume (1)
- Volume dictionary is created automatically, as we use it that way
- At the
ENDprint each year and volume sum
Embedding in shell scripts
Same as above example in arrays, but in a script:
#!/bin/bash
cat "$@" | awk -F'[,-]' '
{volume[$1] += $8}
END {
for(year in volume) {
print year, volume[year]
}
}
'
Explanation:
- normal bash script
- cat ”$@” will pass through filename or content as piped to the script
- awk will keep going between quotes ’ to ’ allowing multiple lines
- can make code more readable
- can add a pipe after last quote(’) if needed for next program
See also: