Level 1

Disk health and capacity monitoring with df du and smartctl

Maximilian B. 5 min read 3 views

Disk problems usually show up in two ways: the server runs out of space, or the disk hardware starts failing. You need to monitor both. If you only watch free space, you can miss early hardware damage. If you only watch SMART health, you can still crash services when a filesystem hits 100%.

This guide uses three standard tools: df for filesystem capacity, du for directory usage, and smartctl for drive health. These commands are available on Debian 13.3, Ubuntu 24.04.3 LTS and 25.10, Fedora 43, and RHEL 10.1. The same workflow also works on RHEL 9.7.

What each tool tells you

df answers: "How full is each mounted filesystem right now?" It reads filesystem metadata, so it is fast and safe to run often.

du answers: "Which directories and files are using that space?" It walks the tree and can be slow on very large paths, but it gives exact ownership of space usage.

smartctl answers: "Is the physical disk reporting errors or wear?" It reads SMART/NVMe health data from the device controller. This is where you catch failing sectors, media errors, and wear-out trends.

Production consequence: if df is high and du identifies a fast-growing log directory, you can clean space before outage. If smartctl reports pending sectors, you can replace the disk before data corruption grows.

Monitor capacity with df

Start with filesystem type and inode usage, not just "human readable" sizes.

# Filesystem size and type
sudo df -hT

# Inode usage (small-file workloads fail here first)
sudo df -hi

Example output pattern to focus on:

Filesystem     Type  Size  Used Avail Use% Mounted on
/dev/sda2      xfs   120G   98G   22G  82% /
/dev/sdb1      ext4  500G  451G   24G  95% /var

Filesystem     Inodes IUsed IFree IUse% Mounted on
/dev/sdb1         32M   30M   2M   94% /var

Read this as risk, not just numbers:

  • Use% above 80% is warning territory for busy servers.
  • Above 90% often causes unstable behavior: failed writes, package manager errors, database stalls, and log rotation problems.
  • High inode usage can break systems even when GB space looks free.

A useful daily check on production nodes is:

sudo df -hT | awk 'NR==1 || $6+0 >= 80'
sudo df -hi | awk 'NR==1 || $5+0 >= 80'

Use du to find where the space went

Once df shows a full filesystem, switch to du on that mount point. Add -x so the scan stays on one filesystem and does not cross into mounted volumes.

# Top-level usage under /var, one level deep
sudo du -xhd1 /var | sort -h

# Drill down into a suspicious directory
sudo du -xhd1 /var/log | sort -h

# Largest files in /var/log
sudo find /var/log -xdev -type f -printf '%s %p\n' | sort -n | tail -20

If df says space is full but du totals look smaller than expected, check for deleted files still held open by running processes:

sudo lsof +L1

This is common with old log files after manual deletion. Restarting the service or rotating logs cleanly releases that space.

Avoid quick fixes like deleting random files under /var/lib or database directories. That can break package metadata or destroy application data. Use service-specific cleanup methods.

Check hardware health with smartctl

Install smartmontools first:

# Debian 13.3, Ubuntu 24.04.3 LTS, Ubuntu 25.10
sudo apt update
sudo apt install smartmontools

# Fedora 43, RHEL 10.1, RHEL 9.7
sudo dnf install smartmontools

Then query device identity and health:

# SATA/SAS example
sudo smartctl -i /dev/sda
sudo smartctl -H /dev/sda
sudo smartctl -A /dev/sda

# NVMe example
sudo smartctl -a /dev/nvme0

Important fields and what they mean:

  • SMART overall-health self-assessment test result: quick pass/fail, but do not stop here.
  • Reallocated_Sector_Ct: non-zero means sectors were already remapped; rising trend is bad.
  • Current_Pending_Sector: sectors waiting for re-test/remap; non-zero is high risk.
  • Offline_Uncorrectable: read errors that could not be corrected.
  • NVMe media_errors and critical_warning: direct signal of controller/media trouble.
  • NVMe percentage_used: wear indicator for SSD life consumption.

Run a self-test during a low-traffic window:

# Short test (usually a few minutes)
sudo smartctl -t short /dev/sda

# Later, read results
sudo smartctl -l selftest /dev/sda

Production consequence: SMART values that slowly get worse can still pass simple health checks. Trend matters more than one snapshot.

Build a simple daily check

For entry-level operations, a small script and cron/systemd timer is enough to prevent many outages.

#!/usr/bin/env bash
set -euo pipefail

echo "=== $(date '+%F %T') ==="
df -hT | awk 'NR==1 || $6+0 >= 85'
df -hi | awk 'NR==1 || $5+0 >= 85'

# Adjust devices for your host
for dev in /dev/sda /dev/nvme0; do
  if [ -b "$dev" ]; then
    echo "--- SMART $dev ---"
    smartctl -H "$dev" || true
    smartctl -A "$dev" | egrep 'Reallocated_Sector_Ct|Current_Pending_Sector|Offline_Uncorrectable|media_errors|critical_warning|percentage_used' || true
  fi
done

Save this as /usr/local/sbin/disk-check.sh, make it executable, and schedule it. Send output to central logging or mail. On RHEL 10.1 and RHEL 9.7, SELinux policy may require standard system locations and proper contexts if you integrate with custom service units.

Compatibility notes by distribution

  • Debian 13.3 and Ubuntu 24.04.3 LTS/25.10: df and du come from coreutils; smartctl from smartmontools package via APT.
  • Fedora 43: same command syntax; package install is dnf install smartmontools.
  • RHEL 10.1 and RHEL 9.7: same workflow; ensure required repositories are enabled for smartmontools.
  • Cloud VMs: SMART may be hidden by virtualized storage. If smartctl cannot access device data, rely on provider disk-health telemetry plus guest-level df/du monitoring.

Conclusion

Use df to see pressure, du to locate the cause, and smartctl to catch hardware decline. This combination gives early warning for both capacity incidents and disk failure. For production reliability, monitor trends daily, set clear warning thresholds, and treat repeated SMART errors as replacement signals, not background noise.

Share this article
X / Twitter LinkedIn Reddit