What AWK Is and When to Reach for It
AWK is a pattern-scanning and text-processing language built into every Linux distribution. It reads input line by line, splits each line into fields, and lets you match patterns and run actions on those fields. The name comes from its three creators: Aho, Weinberger, and Kernighan.
You already know grep and sed. Grep finds lines that match a pattern. Sed edits a stream of text with substitution rules. AWK goes further. It understands columns, does arithmetic, supports variables and arrays, and can produce formatted reports. When you need to pull the third column from a log file, sum up disk usage numbers, or reformat CSV data, AWK is the right tool.
A simple way to think about it: use grep to find, sed to replace, and AWK to extract and compute.
AWK Syntax Basics
The Pattern-Action Model
Every AWK program follows this structure:
awk 'pattern { action }' input_file
If you omit the pattern, the action runs on every line. If you omit the action, AWK prints the entire matching line. Here is a minimal example:
# Print every line (no pattern, default action)
awk '{ print }' /etc/hostname
# Print lines containing "root" (pattern, default action)
awk '/root/' /etc/passwd
Fields and Separators
AWK splits each input line into fields. By default, the separator is whitespace (spaces and tabs). The fields are numbered starting at $1. $0 holds the entire line.
# Print the first and third fields from a space-delimited file
echo "alice 29 engineering" | awk '{ print $1, $3 }'
# Output: alice engineering
To change the field separator, use the -F flag. Colon-separated files like /etc/passwd need -F:.
# Print usernames and shells from /etc/passwd
awk -F: '{ print $1, $7 }' /etc/passwd
Built-in Variables
AWK provides several variables that track state as it processes input:
| Variable | Meaning |
|---|---|
NR |
Current record (line) number |
NF |
Number of fields in the current record |
FS |
Field separator (default: whitespace) |
OFS |
Output field separator (default: space) |
RS |
Record separator (default: newline) |
FILENAME |
Name of the current input file |
# Print line numbers alongside content
awk '{ print NR, $0 }' /etc/hostname
# Print only lines with more than 5 fields
awk 'NF > 5' /var/log/syslog | head -3
Practical Examples with Real Output
Extracting Columns from /etc/passwd
The /etc/passwd file uses colons as delimiters. Each line has seven fields: username, password placeholder, UID, GID, comment, home directory, and shell.
# List all users with UID >= 1000 (regular users)
awk -F: '$3 >= 1000 { print $1, $3, $6 }' /etc/passwd
# Example output:
# nobody 65534 /nonexistent
# max 1000 /home/max
# deploy 1001 /home/deploy
# Find accounts with /bin/bash as their shell
awk -F: '$7 == "/bin/bash" { print $1 }' /etc/passwd
# Example output:
# root
# max
Filtering Log Files
System logs contain timestamps, hostnames, service names, and messages. AWK can isolate specific fields or filter by patterns.
# Show only ERROR lines from an application log, with timestamp and message
awk '/ERROR/ { print $1, $2, $NF }' /var/log/app/server.log
# Example output:
# 2026-02-27 14:33:01 timeout
# 2026-02-27 14:33:18 refused
# Count how many times each HTTP status code appears in an nginx access log
awk '{ print $9 }' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head -5
# Example output:
# 14523 200
# 2841 304
# 987 404
# 142 301
# 38 500
You can also do the counting entirely in AWK without piping to sort and uniq:
awk '{ count[$9]++ } END { for (code in count) print count[code], code }' /var/log/nginx/access.log | sort -rn | head -5
Summing Numbers in a Column
AWK handles arithmetic naturally. This is where it pulls ahead of grep and sed.
# Sum the sizes of all files listed by ls -l
ls -l /var/log/*.log | awk '{ total += $5 } END { print "Total bytes:", total }'
# Example output:
# Total bytes: 4829104
# Average response time from a CSV log (time is in column 4)
awk -F, '{ sum += $4; n++ } END { print "Average:", sum/n, "ms" }' response_times.csv
# Example output:
# Average: 142.37 ms
Reformatting Output
The printf function in AWK gives you precise control over output formatting.
# Format /etc/passwd as a neat table
awk -F: '{ printf "%-15s %-6s %s\n", $1, $3, $6 }' /etc/passwd | head -5
# Example output:
# root 0 /root
# daemon 1 /usr/sbin
# bin 2 /bin
# sys 3 /dev
# sync 4 /bin
# Convert bytes to human-readable MB
df | awk 'NR > 1 { printf "%-20s %8.1f MB\n", $1, $3/1024 }'
# Example output:
# /dev/sda1 15234.8 MB
# /dev/sda2 8891.2 MB
# tmpfs 512.0 MB
AWK One-Liners Sysadmins Actually Use
These are commands you can paste directly into a terminal session. Each one solves a specific operational problem.
Disk Usage Analysis
# Show filesystems over 80% usage
df -h | awk 'NR > 1 { gsub(/%/, "", $5); if ($5+0 > 80) print $6, $5"%"}'
# Example output:
# / 84%
# /var 91%
# Find the 10 largest directories under /var
du -sh /var/*/ 2>/dev/null | sort -rh | head -10 | awk '{ printf "%-10s %s\n", $1, $2 }'
Process Monitoring
# List processes consuming more than 1% CPU
ps aux | awk '$3 > 1.0 { printf "%-10s %-8s %5.1f%% %s\n", $1, $2, $3, $11 }'
# Example output:
# root 1842 3.2% /usr/bin/dockerd
# www 2901 1.8% php-fpm:
# Count processes per user
ps aux | awk 'NR > 1 { count[$1]++ } END { for (user in count) printf "%-15s %d\n", user, count[user] }' | sort -t' ' -k2 -rn
# Example output:
# root 142
# www 38
# max 7
Log Analysis
# Top 10 IP addresses hitting your web server
awk '{ print $1 }' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head -10
# Extract requests that took longer than 2 seconds (assuming response time is the last field)
awk '{ if ($(NF) > 2.0) print $4, $7, $(NF)"s" }' /var/log/nginx/access.log | tail -20
# Show failed SSH login attempts with username and source IP
awk '/Failed password/ { print $(NF-5), $NF }' /var/log/auth.log | sort | uniq -c | sort -rn | head -10
# Example output:
# 47 admin 203.0.113.42
# 23 root 198.51.100.7
# 11 test 192.0.2.15
AWK Scripts: Multi-line Programs with BEGIN and END
One-liners are fine for quick tasks. When the logic grows, write an AWK script file. AWK programs have three sections:
BEGINruns once before any input is read. Use it to set up variables, print headers, or configure separators.- The main body runs once for every input line that matches the pattern.
ENDruns once after all input is processed. Use it for summaries, totals, and final output.
# Save this as disk_report.awk
BEGIN {
printf "%-25s %10s %10s %6s\n", "FILESYSTEM", "USED", "AVAIL", "PCT"
printf "%-25s %10s %10s %6s\n", "---------", "----", "-----", "---"
}
NR > 1 && /^\// {
used = $3 / 1048576
avail = $4 / 1048576
gsub(/%/, "", $5)
pct = $5 + 0
printf "%-25s %8.1f GB %8.1f GB %4d%%\n", $1, used, avail, pct
total_used += used
total_avail += avail
}
END {
printf "\n%-25s %8.1f GB %8.1f GB\n", "TOTAL", total_used, total_avail
}
Run it with:
df | awk -f disk_report.awk
# Example output:
# FILESYSTEM USED AVAIL PCT
# --------- ---- ----- ---
# /dev/sda1 14.9 GB 23.4 GB 39%
# /dev/sda2 8.7 GB 1.1 GB 89%
#
# TOTAL 23.6 GB 24.5 GB
Here is another script that generates a summary report from an Apache/nginx access log:
# Save as access_report.awk
BEGIN {
print "=== Web Access Report ==="
print ""
}
{
ip[$1]++
status[$9]++
total++
}
END {
print "Total requests:", total
print ""
print "Top 5 IPs:"
PROCINFO["sorted_in"] = "@val_num_desc"
n = 0
for (addr in ip) {
if (++n > 5) break
printf " %-18s %d requests\n", addr, ip[addr]
}
print ""
print "Status code breakdown:"
for (code in status) {
printf " %s: %d (%.1f%%)\n", code, status[code], (status[code]/total)*100
}
}
# Run against your access log
awk -f access_report.awk /var/log/nginx/access.log
# Example output:
# === Web Access Report ===
#
# Total requests: 18492
#
# Top 5 IPs:
# 192.168.1.50 4201 requests
# 10.0.0.12 2893 requests
# 203.0.113.42 1547 requests
# 198.51.100.7 891 requests
# 172.16.0.5 743 requests
#
# Status code breakdown:
# 200: 14523 (78.5%)
# 304: 2841 (15.4%)
# 404: 987 (5.3%)
# 500: 38 (0.2%)
Note: the PROCINFO["sorted_in"] variable is available in GNU AWK (gawk), which is the default on most Linux distributions. If you are on a system with a different AWK implementation, pipe the output through sort -rn instead.
AWK vs sed vs grep: When to Use Which
All three tools process text line by line, but they solve different problems. Picking the right one saves time and avoids overengineered solutions.
| Task | Best Tool | Why |
|---|---|---|
| Find lines matching a pattern | grep |
Fastest for simple pattern matching. Supports regex and recursive search. |
| Find and replace text in a stream | sed |
Built for substitutions. Edit files in-place with -i flag. |
| Extract specific columns | awk |
Understands fields natively. No need for cut or tr workarounds. |
| Arithmetic on column values | awk |
Built-in math. Grep and sed have no arithmetic support. |
| Delete or insert lines by position | sed |
Addressing by line number is simpler in sed. |
| Generate reports from structured data | awk |
Variables, arrays, and printf make AWK a reporting language. |
| Quick count of matching lines | grep -c |
Simpler and faster than awk for a single count. |
In practice, you often combine them in a pipeline. Grep narrows the input, AWK extracts and computes, and sed handles last-mile formatting. For example:
# Find failed logins, extract IPs, count occurrences
grep "Failed password" /var/log/auth.log | awk '{ print $NF }' | sort | uniq -c | sort -rn
If you catch yourself writing long chains of cut, tr, and sed to parse columns, that is a sign AWK would handle the job in a single command.
Summary
AWK is a column-aware text processing language that fills the gap between simple pattern matching (grep) and stream editing (sed). It handles field extraction, arithmetic, conditional logic, and formatted output in a single tool. The pattern-action model is straightforward once you internalize it: for each line, check the pattern and run the action.
For daily sysadmin work, a handful of AWK one-liners covers disk monitoring, process analysis, and log parsing. When the logic outgrows a one-liner, move it into a script file with BEGIN and END blocks. AWK scripts are portable across Linux distributions since GNU AWK (gawk) ships as a default package on nearly all of them.
The examples in this guide are starting points. Modify the field numbers, patterns, and output formats to match your actual log files and data sources. The best way to get comfortable with AWK is to use it on real data during real tasks, not contrived exercises.
