Disk problems usually show up in two ways: the server runs out of space, or the disk hardware starts failing. You need to monitor both. If you only watch free space, you can miss early hardware damage. If you only watch SMART health, you can still crash services when a filesystem hits 100%.
This guide uses three standard tools: df for filesystem capacity, du for directory usage, and smartctl for drive health. These commands are available on Debian 13.3, Ubuntu 24.04.3 LTS and 25.10, Fedora 43, and RHEL 10.1. The same workflow also works on RHEL 9.7.
What each tool tells you
df answers: "How full is each mounted filesystem right now?" It reads filesystem metadata, so it is fast and safe to run often.
du answers: "Which directories and files are using that space?" It walks the tree and can be slow on very large paths, but it gives exact ownership of space usage.
smartctl answers: "Is the physical disk reporting errors or wear?" It reads SMART/NVMe health data from the device controller. This is where you catch failing sectors, media errors, and wear-out trends.
Production consequence: if df is high and du identifies a fast-growing log directory, you can clean space before outage. If smartctl reports pending sectors, you can replace the disk before data corruption grows.
Monitor capacity with df
Start with filesystem type and inode usage, not just "human readable" sizes.
# Filesystem size and type
sudo df -hT
# Inode usage (small-file workloads fail here first)
sudo df -hi
Example output pattern to focus on:
Filesystem Type Size Used Avail Use% Mounted on
/dev/sda2 xfs 120G 98G 22G 82% /
/dev/sdb1 ext4 500G 451G 24G 95% /var
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/sdb1 32M 30M 2M 94% /var
Read this as risk, not just numbers:
Use%above 80% is warning territory for busy servers.- Above 90% often causes unstable behavior: failed writes, package manager errors, database stalls, and log rotation problems.
- High inode usage can break systems even when GB space looks free.
A useful daily check on production nodes is:
sudo df -hT | awk 'NR==1 || $6+0 >= 80'
sudo df -hi | awk 'NR==1 || $5+0 >= 80'
Use du to find where the space went
Once df shows a full filesystem, switch to du on that mount point. Add -x so the scan stays on one filesystem and does not cross into mounted volumes.
# Top-level usage under /var, one level deep
sudo du -xhd1 /var | sort -h
# Drill down into a suspicious directory
sudo du -xhd1 /var/log | sort -h
# Largest files in /var/log
sudo find /var/log -xdev -type f -printf '%s %p\n' | sort -n | tail -20
If df says space is full but du totals look smaller than expected, check for deleted files still held open by running processes:
sudo lsof +L1
This is common with old log files after manual deletion. Restarting the service or rotating logs cleanly releases that space.
Avoid quick fixes like deleting random files under /var/lib or database directories. That can break package metadata or destroy application data. Use service-specific cleanup methods.
Check hardware health with smartctl
Install smartmontools first:
# Debian 13.3, Ubuntu 24.04.3 LTS, Ubuntu 25.10
sudo apt update
sudo apt install smartmontools
# Fedora 43, RHEL 10.1, RHEL 9.7
sudo dnf install smartmontools
Then query device identity and health:
# SATA/SAS example
sudo smartctl -i /dev/sda
sudo smartctl -H /dev/sda
sudo smartctl -A /dev/sda
# NVMe example
sudo smartctl -a /dev/nvme0
Important fields and what they mean:
SMART overall-health self-assessment test result: quick pass/fail, but do not stop here.Reallocated_Sector_Ct: non-zero means sectors were already remapped; rising trend is bad.Current_Pending_Sector: sectors waiting for re-test/remap; non-zero is high risk.Offline_Uncorrectable: read errors that could not be corrected.- NVMe
media_errorsandcritical_warning: direct signal of controller/media trouble. - NVMe
percentage_used: wear indicator for SSD life consumption.
Run a self-test during a low-traffic window:
# Short test (usually a few minutes)
sudo smartctl -t short /dev/sda
# Later, read results
sudo smartctl -l selftest /dev/sda
Production consequence: SMART values that slowly get worse can still pass simple health checks. Trend matters more than one snapshot.
Build a simple daily check
For entry-level operations, a small script and cron/systemd timer is enough to prevent many outages.
#!/usr/bin/env bash
set -euo pipefail
echo "=== $(date '+%F %T') ==="
df -hT | awk 'NR==1 || $6+0 >= 85'
df -hi | awk 'NR==1 || $5+0 >= 85'
# Adjust devices for your host
for dev in /dev/sda /dev/nvme0; do
if [ -b "$dev" ]; then
echo "--- SMART $dev ---"
smartctl -H "$dev" || true
smartctl -A "$dev" | egrep 'Reallocated_Sector_Ct|Current_Pending_Sector|Offline_Uncorrectable|media_errors|critical_warning|percentage_used' || true
fi
done
Save this as /usr/local/sbin/disk-check.sh, make it executable, and schedule it. Send output to central logging or mail. On RHEL 10.1 and RHEL 9.7, SELinux policy may require standard system locations and proper contexts if you integrate with custom service units.
Compatibility notes by distribution
- Debian 13.3 and Ubuntu 24.04.3 LTS/25.10:
dfandducome from coreutils;smartctlfromsmartmontoolspackage via APT. - Fedora 43: same command syntax; package install is
dnf install smartmontools. - RHEL 10.1 and RHEL 9.7: same workflow; ensure required repositories are enabled for
smartmontools. - Cloud VMs: SMART may be hidden by virtualized storage. If
smartctlcannot access device data, rely on provider disk-health telemetry plus guest-leveldf/dumonitoring.
Conclusion
Use df to see pressure, du to locate the cause, and smartctl to catch hardware decline. This combination gives early warning for both capacity incidents and disk failure. For production reliability, monitor trends daily, set clear warning thresholds, and treat repeated SMART errors as replacement signals, not background noise.