Process text data with cut sort uniq tr and wc

Text processing is daily work for Linux technicians. You read logs, check CSV exports, count repeated values, and clean input before a script uses it. The core tools for this job are cut, sort, uniq, tr, and wc. They are small commands, but small mistakes can cause wrong reports, missed alerts, or slow batch jobs. This guide explains what each command does and how to chain them safely.

What each command does and where it fits

Process text data with cut sort uniq tr and wc visual summary diagram — Visual summary of the key concepts in this guide.

These commands are part of GNU coreutils on most Linux systems.

cut: extract selected byte positions or delimiter-separated fields.
sort: order lines by text, numbers, or keys.
uniq: remove adjacent duplicate lines, or count them.
tr: translate or delete characters in a stream.
wc: count lines, words, bytes, or characters.

The important detail is that uniq only compares neighboring lines. If duplicates are far apart, run sort first. That one detail explains many "wrong count" tickets.

# Build a small sample file
cat > /tmp/users.txt <<'EOF'
alex
sam
alex
maria
sam
sam
EOF

# Wrong for total duplicate count: uniq sees only adjacent duplicates
uniq -c /tmp/users.txt

# Correct: sort first, then count
sort /tmp/users.txt | uniq -c | sort -nr

For beginners: think in stages. Extract, normalize, order, count. For operators: this staged approach makes pipelines easy to review during incidents.

Extracting fields with cut without corrupting data

cut is fast and simple when data has a stable delimiter. System account data in /etc/passwd is a good example because fields are colon-separated.

# username, UID, shell from /etc/passwd
cut -d: -f1,3,7 /etc/passwd | head

# only usernames for accounts with UID >= 1000 (pair with awk for filter)
awk -F: '$3 >= 1000 {print $0}' /etc/passwd | cut -d: -f1

Common mistake: using cut -d' ' on files with variable spaces or tabs. cut does not treat repeated whitespace as one separator. You can end up with shifted columns and wrong values.

# Logins report with inconsistent spacing
cat > /tmp/login-report.txt <<'EOF'
user=alex   status=ok   host=web01
user=sam status=fail    host=web02
EOF

# Fragile parsing with cut on a single-space delimiter
cut -d' ' -f2 /tmp/login-report.txt

# Better: normalize spaces first, then extract
tr -s ' ' < /tmp/login-report.txt | cut -d' ' -f2

Production consequence: a fragile parser can label failed logins as successful if fields shift. If data format is not strict, use awk or a structured parser instead of forcing cut.

Cleaning and normalizing text with tr before counting

Real data often contains uppercase and lowercase differences, Windows carriage returns, or punctuation noise. If you count before cleanup, totals are misleading.

# Example list copied from different systems
cat > /tmp/hosts.txt <<'EOF'
WEB-01
web-01
web-02
web-02
web_03
EOF

# Normalize case, remove CR, map underscore to dash
tr '[:upper:]' '[:lower:]' < /tmp/hosts.txt | \
  tr -d '\r' | \
  tr '_' '-' > /tmp/hosts.normalized.txt

cat /tmp/hosts.normalized.txt

tr is byte-oriented and predictable. It is excellent for simple cleanup steps before analysis jobs. It is not a full text parser, so avoid using it for multi-character patterns that need context.

Counting frequencies with sort, uniq, and wc

Once data is normalized, frequency analysis is straightforward. A standard pattern is sort | uniq -c | sort -nr.

# Top repeated values
sort /tmp/hosts.normalized.txt | uniq -c | sort -nr

# Number of unique values
sort /tmp/hosts.normalized.txt | uniq | wc -l

# Total lines, words, bytes in one command
wc -l -w -c /tmp/hosts.normalized.txt

If you need only duplicate lines, use uniq -d. If you need only single-occurrence lines, use uniq -u. Keep in mind both still require sorted input for full-file correctness.

# Duplicates only
sort /tmp/hosts.normalized.txt | uniq -d

# Values that appear exactly once
sort /tmp/hosts.normalized.txt | uniq -u

Production consequence: teams often build allowlists or blocklists from counts. If sorting is skipped or locale behavior changes unexpectedly, the generated list can be wrong and affect firewall or access-control automation.

Practical pipeline: summarize failed SSH users from logs

The following pipeline is realistic for first-pass incident triage. It extracts usernames from failed SSH attempts and counts top offenders.

# Debian/Ubuntu often store auth events in /var/log/auth.log
grep -E 'Failed password for' /var/log/auth.log | \
  cut -d' ' -f9 | \
  tr '[:upper:]' '[:lower:]' | \
  sort | uniq -c | sort -nr | head

# Fedora/RHEL often use /var/log/secure
grep -E 'Failed password for' /var/log/secure | \
  cut -d' ' -f9 | \
  tr '[:upper:]' '[:lower:]' | \
  sort | uniq -c | sort -nr | head

Check one raw line first before trusting field -f9, because log formats can differ by SSH version and PAM settings. In automation, add a validation step so a format change fails loudly instead of silently producing bad data.

Performance and compatibility notes for current distributions

The commands in this article work on Debian 13.3, Ubuntu 24.04.3 LTS, Ubuntu 25.10, Fedora 43, and RHEL 10.1. They are also compatible with RHEL 9.7 for normal shell workflows. Package names are consistent because these tools come from coreutils and related base packages.

For predictable sorting in scripts, set LC_ALL=C. Locale-aware sorting can reorder characters differently across environments.
Large sort jobs may spill to disk. Use sort -S to set memory and -T for temp directory placement.
If you parse logs from mixed OS sources, strip \r with tr -d '\r' before cut or wc.
On minimal containers, verify coreutils options. BusyBox variants may not support every GNU option used in production scripts.

# Stable collation and controlled sort memory for big files
LC_ALL=C sort -S 50% -T /var/tmp /var/log/app/events.log | uniq -c | sort -nr | head

For operators, these details matter during high-volume jobs. A default sort on a huge file can fill /tmp or run slowly under memory pressure. Planning temp space and locale gives repeatable output.

Summary

cut, sort, uniq, tr, and wc are enough for many day-to-day text workflows. Use cut only when delimiters are stable, clean input with tr, sort before uniq, and verify totals with wc. In production, accuracy comes from small checks: inspect sample lines, normalize data, and keep locale behavior explicit. These habits transfer cleanly across Debian 13.3, Ubuntu 24.04.3 LTS and 25.10, Fedora 43, RHEL 10.1, and RHEL 9.7.

Beginner linux text-processing cut sort uniq

What each command does and where it fits

Extracting fields with cut without corrupting data

Cleaning and normalizing text with tr before counting

Counting frequencies with sort, uniq, and wc

Practical pipeline: summarize failed SSH users from logs

Performance and compatibility notes for current distributions

Summary

Continue Reading

Viewing and editing text with less nano and vim

User and group administration passwd shadow and sudo

Troubleshoot network issues with ss ip ping and tracepath

Stay in the Loop