Shell scripts fail in quiet and expensive ways when error handling is missing. A script can copy half a backup, skip one command, and still exit with status 0 if the wrong pattern is used. For entry-level technicians, the goal is simple: when something breaks, the script should stop early, print a useful message, and leave the system in a clean state.
This guide focuses on Bash because it is available by default on Debian 13.3, Ubuntu 24.04.3 LTS, Ubuntu 25.10, Fedora 43, RHEL 10.1, and RHEL 9.7. The details below are practical patterns you can move into daily operations.
why error handling matters in shell scripts
In production, shell scripts often run as cron jobs, deployment hooks, backup tasks, and health checks. A bad script usually does one of these:
- Continues after a failed command and writes incomplete data.
- Hides failures because output goes to
/dev/null. - Runs two copies at once and creates race conditions.
- Deletes temporary files from the wrong path when a variable is empty.
For beginners, this means longer troubleshooting time and unclear logs. For operators, this means downtime, corrupted files, and noisy on-call alerts. Good error handling does not make scripts fancy. It makes them predictable.
start with safe defaults
Use a strict Bash header in most operational scripts:
#!/usr/bin/env bash
set -Eeuo pipefail
IFS=$'\n\t'
trap 'rc=$?; echo "ERROR line $LINENO: $BASH_COMMAND (exit $rc)" >&2' ERR
backup_src="/srv/app/data"
backup_dst="/srv/backups/data.tar.gz"
tar -czf "$backup_dst" "$backup_src"
echo "Backup completed: $backup_dst"
What each option does:
-e: stop on unhandled command errors.-u: fail when using an unset variable.pipefail: if a pipeline fails anywhere, the whole pipeline fails.-Ewithtrap ERR: keeps the error trap active in functions and subshells.
Do not assume set -e catches every case. Commands inside if tests, boolean chains, and some subshell contexts can behave differently. Test the script with real failure scenarios.
Also, quote variables unless you need word splitting. Unquoted variables are a common source of data loss when paths contain spaces or wildcards.
handle expected failures yourself
Some failures are expected, such as temporary network errors. Handle them explicitly instead of hoping strict mode solves everything.
#!/usr/bin/env bash
set -Eeuo pipefail
retry() {
local attempts=$1
local delay=$2
shift 2
local n=1
while true; do
if "$@"; then
return 0
fi
if [ "$n" -ge "$attempts" ]; then
echo "Command failed after $attempts attempts: $*" >&2
return 1
fi
echo "Attempt $n failed. Retrying in $delay seconds..." >&2
sleep "$delay"
n=$((n + 1))
done
}
url="https://repo.example.internal/health"
retry 4 3 curl --fail --silent --show-error "$url" > /tmp/repo-health.json
This pattern is useful for package mirror checks and internal API calls. It avoids false alarms during short outages but still fails hard after the retry limit.
When you must allow one command to fail, do it in a visible way:
if ! rm -f /tmp/old-report.lock; then
echo "Could not remove stale lock file" >&2
fi
That is safer than adding || true everywhere, which can hide real bugs.
debugging methods that work under pressure
When a script fails in production, use layered debugging. Start with syntax checks, then trace execution.
# 1) Syntax check only (no execution)
bash -n deploy.sh
# 2) Full command trace
bash -x deploy.sh 2> /tmp/deploy.trace
# 3) Add line-number tracing inside script
export PS4='+${BASH_SOURCE}:${LINENO}:${FUNCNAME[0]}: '
exec 3> /tmp/script-xtrace.log
export BASH_XTRACEFD=3
set -x
# sensitive commands should be outside traced blocks
set +x
bash -n catches missing fi, done, or quote errors fast. bash -x shows exactly which command failed and with which expanded values. In incident response, this can reduce recovery time from hours to minutes.
Install and run ShellCheck in CI when possible. It catches many quoting, variable, and control-flow mistakes before deployment.
cleanup and lock files prevent repeat incidents
Two production safeguards are often skipped: cleanup traps and file locks. They prevent stale temp files and duplicate concurrent runs.
#!/usr/bin/env bash
set -Eeuo pipefail
lock_file="/var/lock/inventory-sync.lock"
exec 9>"$lock_file"
if ! flock -n 9; then
echo "inventory-sync is already running" >&2
exit 1
fi
tmp_dir="$(mktemp -d /tmp/inventory-sync.XXXXXX)"
cleanup() {
rm -rf "$tmp_dir"
}
trap cleanup EXIT
# Simulated workload
cp /srv/inventory/current.csv "$tmp_dir/"
awk -F, 'NR > 1 {print $1","$3}' "$tmp_dir/current.csv" > "$tmp_dir/summary.csv"
cp "$tmp_dir/summary.csv" /srv/inventory/summary.csv
If this script crashes, the EXIT trap still removes temporary files. If cron starts a second copy, flock -n exits early instead of creating write conflicts.
compatibility notes for Debian, Ubuntu, Fedora, and RHEL
| Distribution | Error-handling and debugging notes |
|---|---|
| Debian 13.3 | /bin/sh points to dash. If your script needs pipefail, arrays, or [[ ]], use a Bash shebang explicitly. |
| Ubuntu 24.04.3 LTS / 25.10 | Same dash behavior for /bin/sh. Cron scripts copied from Bash tutorials fail if they are started with sh script.sh by mistake. |
| Fedora 43 | /bin/sh is Bash in POSIX mode. Bash scripts are fine with #!/usr/bin/env bash. ShellCheck is available through dnf install ShellCheck. |
| RHEL 10.1 / RHEL 9.7 | Behavior is close between 10.1 and 9.7 for Bash operational scripts. Validate repo availability for ShellCheck in your environment; many teams provide it through internal mirrors. |
The key compatibility rule is simple: if you wrote Bash, run Bash. Do not rely on /bin/sh behavior being the same everywhere.
summary
Reliable shell scripting is mostly disciplined basics: strict mode, explicit handling for expected failures, useful tracing, cleanup traps, and lock files. Beginners get faster troubleshooting because scripts fail loudly and at the right line. Operators get safer production runs with fewer partial updates and duplicate jobs. If you keep these patterns in every new script, debugging becomes routine instead of emergency work.