Shell Commands

AWK on Linux: Text Processing and Data Extraction Guide

Maximilian B. 9 min read 20 views

What AWK Is and When to Reach for It

AWK is a pattern-scanning and text-processing language built into every Linux distribution. It reads input line by line, splits each line into fields, and lets you match patterns and run actions on those fields. The name comes from its three creators: Aho, Weinberger, and Kernighan.

You already know grep and sed. Grep finds lines that match a pattern. Sed edits a stream of text with substitution rules. AWK goes further. It understands columns, does arithmetic, supports variables and arrays, and can produce formatted reports. When you need to pull the third column from a log file, sum up disk usage numbers, or reformat CSV data, AWK is the right tool.

A simple way to think about it: use grep to find, sed to replace, and AWK to extract and compute.

AWK Syntax Basics

The Pattern-Action Model

Every AWK program follows this structure:

awk 'pattern { action }' input_file

If you omit the pattern, the action runs on every line. If you omit the action, AWK prints the entire matching line. Here is a minimal example:

# Print every line (no pattern, default action)
awk '{ print }' /etc/hostname

# Print lines containing "root" (pattern, default action)
awk '/root/' /etc/passwd

Fields and Separators

AWK splits each input line into fields. By default, the separator is whitespace (spaces and tabs). The fields are numbered starting at $1. $0 holds the entire line.

# Print the first and third fields from a space-delimited file
echo "alice 29 engineering" | awk '{ print $1, $3 }'
# Output: alice engineering

To change the field separator, use the -F flag. Colon-separated files like /etc/passwd need -F:.

# Print usernames and shells from /etc/passwd
awk -F: '{ print $1, $7 }' /etc/passwd

Built-in Variables

AWK provides several variables that track state as it processes input:

Variable Meaning
NR Current record (line) number
NF Number of fields in the current record
FS Field separator (default: whitespace)
OFS Output field separator (default: space)
RS Record separator (default: newline)
FILENAME Name of the current input file
# Print line numbers alongside content
awk '{ print NR, $0 }' /etc/hostname

# Print only lines with more than 5 fields
awk 'NF > 5' /var/log/syslog | head -3

Practical Examples with Real Output

Extracting Columns from /etc/passwd

The /etc/passwd file uses colons as delimiters. Each line has seven fields: username, password placeholder, UID, GID, comment, home directory, and shell.

# List all users with UID >= 1000 (regular users)
awk -F: '$3 >= 1000 { print $1, $3, $6 }' /etc/passwd

# Example output:
# nobody 65534 /nonexistent
# max 1000 /home/max
# deploy 1001 /home/deploy
# Find accounts with /bin/bash as their shell
awk -F: '$7 == "/bin/bash" { print $1 }' /etc/passwd

# Example output:
# root
# max

Filtering Log Files

System logs contain timestamps, hostnames, service names, and messages. AWK can isolate specific fields or filter by patterns.

# Show only ERROR lines from an application log, with timestamp and message
awk '/ERROR/ { print $1, $2, $NF }' /var/log/app/server.log

# Example output:
# 2026-02-27 14:33:01 timeout
# 2026-02-27 14:33:18 refused
# Count how many times each HTTP status code appears in an nginx access log
awk '{ print $9 }' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head -5

# Example output:
#   14523 200
#    2841 304
#     987 404
#     142 301
#      38 500

You can also do the counting entirely in AWK without piping to sort and uniq:

awk '{ count[$9]++ } END { for (code in count) print count[code], code }' /var/log/nginx/access.log | sort -rn | head -5

Summing Numbers in a Column

AWK handles arithmetic naturally. This is where it pulls ahead of grep and sed.

# Sum the sizes of all files listed by ls -l
ls -l /var/log/*.log | awk '{ total += $5 } END { print "Total bytes:", total }'

# Example output:
# Total bytes: 4829104
# Average response time from a CSV log (time is in column 4)
awk -F, '{ sum += $4; n++ } END { print "Average:", sum/n, "ms" }' response_times.csv

# Example output:
# Average: 142.37 ms

Reformatting Output

The printf function in AWK gives you precise control over output formatting.

# Format /etc/passwd as a neat table
awk -F: '{ printf "%-15s %-6s %s\n", $1, $3, $6 }' /etc/passwd | head -5

# Example output:
# root            0      /root
# daemon          1      /usr/sbin
# bin             2      /bin
# sys             3      /dev
# sync            4      /bin
# Convert bytes to human-readable MB
df | awk 'NR > 1 { printf "%-20s %8.1f MB\n", $1, $3/1024 }'

# Example output:
# /dev/sda1              15234.8 MB
# /dev/sda2               8891.2 MB
# tmpfs                    512.0 MB

AWK One-Liners Sysadmins Actually Use

These are commands you can paste directly into a terminal session. Each one solves a specific operational problem.

Disk Usage Analysis

# Show filesystems over 80% usage
df -h | awk 'NR > 1 { gsub(/%/, "", $5); if ($5+0 > 80) print $6, $5"%"}'

# Example output:
# / 84%
# /var 91%
# Find the 10 largest directories under /var
du -sh /var/*/ 2>/dev/null | sort -rh | head -10 | awk '{ printf "%-10s %s\n", $1, $2 }'

Process Monitoring

# List processes consuming more than 1% CPU
ps aux | awk '$3 > 1.0 { printf "%-10s %-8s %5.1f%% %s\n", $1, $2, $3, $11 }'

# Example output:
# root       1842    3.2% /usr/bin/dockerd
# www        2901    1.8% php-fpm:
# Count processes per user
ps aux | awk 'NR > 1 { count[$1]++ } END { for (user in count) printf "%-15s %d\n", user, count[user] }' | sort -t' ' -k2 -rn

# Example output:
# root            142
# www              38
# max               7

Log Analysis

# Top 10 IP addresses hitting your web server
awk '{ print $1 }' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head -10

# Extract requests that took longer than 2 seconds (assuming response time is the last field)
awk '{ if ($(NF) > 2.0) print $4, $7, $(NF)"s" }' /var/log/nginx/access.log | tail -20
# Show failed SSH login attempts with username and source IP
awk '/Failed password/ { print $(NF-5), $NF }' /var/log/auth.log | sort | uniq -c | sort -rn | head -10

# Example output:
#      47 admin 203.0.113.42
#      23 root 198.51.100.7
#      11 test 192.0.2.15

AWK Scripts: Multi-line Programs with BEGIN and END

One-liners are fine for quick tasks. When the logic grows, write an AWK script file. AWK programs have three sections:

# Save this as disk_report.awk
BEGIN {
    printf "%-25s %10s %10s %6s\n", "FILESYSTEM", "USED", "AVAIL", "PCT"
    printf "%-25s %10s %10s %6s\n", "---------", "----", "-----", "---"
}
NR > 1 && /^\// {
    used = $3 / 1048576
    avail = $4 / 1048576
    gsub(/%/, "", $5)
    pct = $5 + 0
    printf "%-25s %8.1f GB %8.1f GB %4d%%\n", $1, used, avail, pct
    total_used += used
    total_avail += avail
}
END {
    printf "\n%-25s %8.1f GB %8.1f GB\n", "TOTAL", total_used, total_avail
}

Run it with:

df | awk -f disk_report.awk

# Example output:
# FILESYSTEM                    USED      AVAIL    PCT
# ---------                     ----      -----    ---
# /dev/sda1                    14.9 GB    23.4 GB   39%
# /dev/sda2                     8.7 GB     1.1 GB   89%
#
# TOTAL                        23.6 GB    24.5 GB

Here is another script that generates a summary report from an Apache/nginx access log:

# Save as access_report.awk
BEGIN {
    print "=== Web Access Report ==="
    print ""
}
{
    ip[$1]++
    status[$9]++
    total++
}
END {
    print "Total requests:", total
    print ""
    print "Top 5 IPs:"
    PROCINFO["sorted_in"] = "@val_num_desc"
    n = 0
    for (addr in ip) {
        if (++n > 5) break
        printf "  %-18s %d requests\n", addr, ip[addr]
    }
    print ""
    print "Status code breakdown:"
    for (code in status) {
        printf "  %s: %d (%.1f%%)\n", code, status[code], (status[code]/total)*100
    }
}
# Run against your access log
awk -f access_report.awk /var/log/nginx/access.log

# Example output:
# === Web Access Report ===
#
# Total requests: 18492
#
# Top 5 IPs:
#   192.168.1.50       4201 requests
#   10.0.0.12          2893 requests
#   203.0.113.42       1547 requests
#   198.51.100.7        891 requests
#   172.16.0.5          743 requests
#
# Status code breakdown:
#   200: 14523 (78.5%)
#   304: 2841 (15.4%)
#   404: 987 (5.3%)
#   500: 38 (0.2%)

Note: the PROCINFO["sorted_in"] variable is available in GNU AWK (gawk), which is the default on most Linux distributions. If you are on a system with a different AWK implementation, pipe the output through sort -rn instead.

AWK vs sed vs grep: When to Use Which

All three tools process text line by line, but they solve different problems. Picking the right one saves time and avoids overengineered solutions.

Task Best Tool Why
Find lines matching a pattern grep Fastest for simple pattern matching. Supports regex and recursive search.
Find and replace text in a stream sed Built for substitutions. Edit files in-place with -i flag.
Extract specific columns awk Understands fields natively. No need for cut or tr workarounds.
Arithmetic on column values awk Built-in math. Grep and sed have no arithmetic support.
Delete or insert lines by position sed Addressing by line number is simpler in sed.
Generate reports from structured data awk Variables, arrays, and printf make AWK a reporting language.
Quick count of matching lines grep -c Simpler and faster than awk for a single count.

In practice, you often combine them in a pipeline. Grep narrows the input, AWK extracts and computes, and sed handles last-mile formatting. For example:

# Find failed logins, extract IPs, count occurrences
grep "Failed password" /var/log/auth.log | awk '{ print $NF }' | sort | uniq -c | sort -rn

If you catch yourself writing long chains of cut, tr, and sed to parse columns, that is a sign AWK would handle the job in a single command.

Summary

AWK is a column-aware text processing language that fills the gap between simple pattern matching (grep) and stream editing (sed). It handles field extraction, arithmetic, conditional logic, and formatted output in a single tool. The pattern-action model is straightforward once you internalize it: for each line, check the pattern and run the action.

For daily sysadmin work, a handful of AWK one-liners covers disk monitoring, process analysis, and log parsing. When the logic outgrows a one-liner, move it into a script file with BEGIN and END blocks. AWK scripts are portable across Linux distributions since GNU AWK (gawk) ships as a default package on nearly all of them.

The examples in this guide are starting points. Modify the field numbers, patterns, and output formats to match your actual log files and data sources. The best way to get comfortable with AWK is to use it on real data during real tasks, not contrived exercises.

Share this article
X / Twitter LinkedIn Reddit