Text processing is daily work for Linux technicians. You read logs, check CSV exports, count repeated values, and clean input before a script uses it. The core tools for this job are cut, sort, uniq, tr, and wc. They are small commands, but small mistakes can cause wrong reports, missed alerts, or slow batch jobs. This guide explains what each command does and how to chain them safely.
What each command does and where it fits
These commands are part of GNU coreutils on most Linux systems.
cut: extract selected byte positions or delimiter-separated fields.sort: order lines by text, numbers, or keys.uniq: remove adjacent duplicate lines, or count them.tr: translate or delete characters in a stream.wc: count lines, words, bytes, or characters.
The important detail is that uniq only compares neighboring lines. If duplicates are far apart, run sort first. That one detail explains many "wrong count" tickets.
# Build a small sample file
cat > /tmp/users.txt <<'EOF'
alex
sam
alex
maria
sam
sam
EOF
# Wrong for total duplicate count: uniq sees only adjacent duplicates
uniq -c /tmp/users.txt
# Correct: sort first, then count
sort /tmp/users.txt | uniq -c | sort -nr
For beginners: think in stages. Extract, normalize, order, count. For operators: this staged approach makes pipelines easy to review during incidents.
Extracting fields with cut without corrupting data
cut is fast and simple when data has a stable delimiter. System account data in /etc/passwd is a good example because fields are colon-separated.
# username, UID, shell from /etc/passwd
cut -d: -f1,3,7 /etc/passwd | head
# only usernames for accounts with UID >= 1000 (pair with awk for filter)
awk -F: '$3 >= 1000 {print $0}' /etc/passwd | cut -d: -f1
Common mistake: using cut -d' ' on files with variable spaces or tabs. cut does not treat repeated whitespace as one separator. You can end up with shifted columns and wrong values.
# Logins report with inconsistent spacing
cat > /tmp/login-report.txt <<'EOF'
user=alex status=ok host=web01
user=sam status=fail host=web02
EOF
# Fragile parsing with cut on a single-space delimiter
cut -d' ' -f2 /tmp/login-report.txt
# Better: normalize spaces first, then extract
tr -s ' ' < /tmp/login-report.txt | cut -d' ' -f2
Production consequence: a fragile parser can label failed logins as successful if fields shift. If data format is not strict, use awk or a structured parser instead of forcing cut.
Cleaning and normalizing text with tr before counting
Real data often contains uppercase and lowercase differences, Windows carriage returns, or punctuation noise. If you count before cleanup, totals are misleading.
# Example list copied from different systems
cat > /tmp/hosts.txt <<'EOF'
WEB-01
web-01
web-02
web-02
web_03
EOF
# Normalize case, remove CR, map underscore to dash
tr '[:upper:]' '[:lower:]' < /tmp/hosts.txt | \
tr -d '\r' | \
tr '_' '-' > /tmp/hosts.normalized.txt
cat /tmp/hosts.normalized.txt
tr is byte-oriented and predictable. It is excellent for simple cleanup steps before analysis jobs. It is not a full text parser, so avoid using it for multi-character patterns that need context.
Counting frequencies with sort, uniq, and wc
Once data is normalized, frequency analysis is straightforward. A standard pattern is sort | uniq -c | sort -nr.
# Top repeated values
sort /tmp/hosts.normalized.txt | uniq -c | sort -nr
# Number of unique values
sort /tmp/hosts.normalized.txt | uniq | wc -l
# Total lines, words, bytes in one command
wc -l -w -c /tmp/hosts.normalized.txt
If you need only duplicate lines, use uniq -d. If you need only single-occurrence lines, use uniq -u. Keep in mind both still require sorted input for full-file correctness.
# Duplicates only
sort /tmp/hosts.normalized.txt | uniq -d
# Values that appear exactly once
sort /tmp/hosts.normalized.txt | uniq -u
Production consequence: teams often build allowlists or blocklists from counts. If sorting is skipped or locale behavior changes unexpectedly, the generated list can be wrong and affect firewall or access-control automation.
Practical pipeline: summarize failed SSH users from logs
The following pipeline is realistic for first-pass incident triage. It extracts usernames from failed SSH attempts and counts top offenders.
# Debian/Ubuntu often store auth events in /var/log/auth.log
grep -E 'Failed password for' /var/log/auth.log | \
cut -d' ' -f9 | \
tr '[:upper:]' '[:lower:]' | \
sort | uniq -c | sort -nr | head
# Fedora/RHEL often use /var/log/secure
grep -E 'Failed password for' /var/log/secure | \
cut -d' ' -f9 | \
tr '[:upper:]' '[:lower:]' | \
sort | uniq -c | sort -nr | head
Check one raw line first before trusting field -f9, because log formats can differ by SSH version and PAM settings. In automation, add a validation step so a format change fails loudly instead of silently producing bad data.
Performance and compatibility notes for current distributions
The commands in this article work on Debian 13.3, Ubuntu 24.04.3 LTS, Ubuntu 25.10, Fedora 43, and RHEL 10.1. They are also compatible with RHEL 9.7 for normal shell workflows. Package names are consistent because these tools come from coreutils and related base packages.
- For predictable sorting in scripts, set
LC_ALL=C. Locale-aware sorting can reorder characters differently across environments. - Large
sortjobs may spill to disk. Usesort -Sto set memory and-Tfor temp directory placement. - If you parse logs from mixed OS sources, strip
\rwithtr -d '\r'beforecutorwc. - On minimal containers, verify coreutils options. BusyBox variants may not support every GNU option used in production scripts.
# Stable collation and controlled sort memory for big files
LC_ALL=C sort -S 50% -T /var/tmp /var/log/app/events.log | uniq -c | sort -nr | head
For operators, these details matter during high-volume jobs. A default sort on a huge file can fill /tmp or run slowly under memory pressure. Planning temp space and locale gives repeatable output.
Summary
cut, sort, uniq, tr, and wc are enough for many day-to-day text workflows. Use cut only when delimiters are stable, clean input with tr, sort before uniq, and verify totals with wc. In production, accuracy comes from small checks: inspect sample lines, normalize data, and keep locale behavior explicit. These habits transfer cleanly across Debian 13.3, Ubuntu 24.04.3 LTS and 25.10, Fedora 43, RHEL 10.1, and RHEL 9.7.