GPU Monitoring for AI Workloads on Linux: Tools, Dashboards, and Alerts

When you are running inference servers, fine-tuning jobs, or batch processing on GPU hardware, the question is never whether something will go wrong — it is when you will find out about it. A model silently falling back to CPU because of a driver glitch, a GPU overheating under sustained load, or VRAM slowly leaking until an out-of-memory crash kills your production inference server at 3 AM — these are the scenarios that proper gpu monitoring linux ai setups prevent. This guide covers every practical tool and method for monitoring NVIDIA and AMD GPUs on Linux, from quick terminal checks to full Prometheus and Grafana dashboards with automated alerting.

Effective GPU monitoring goes well beyond watching nvidia-smi in a terminal. Brousseau and Sharp dedicate an entire section of LLMs in Production to monitoring LLM infrastructure, emphasizing that the metrics that matter for AI workloads differ from traditional compute monitoring. GPU utilization percentage is misleading, as a GPU showing 100% utilization might be bottlenecked on memory bandwidth rather than compute. The critical metrics are: VRAM utilization (used vs. available), memory bandwidth utilization, SM (Streaming Multiprocessor) activity, power draw relative to TDP, and GPU temperature. DCGM (Data Center GPU Manager) exposes all of these through Prometheus-compatible endpoints.

We will start with the tools you already have (nvidia-smi), move through the interactive monitors that make debugging enjoyable (nvtop, gpustat), and then build out the production monitoring stack that serious AI deployments need. If you are running GPUs in Docker containers, there is a dedicated section on monitoring containerized GPU workloads. And for the growing number of shops running AMD hardware, we cover rocm-smi and the ROCm monitoring ecosystem as well. For the foundational setup, see our complete Ollama installation guide.

nvidia-smi: The GPU Swiss Army Knife

Every NVIDIA GPU installation comes with nvidia-smi, and most administrators only scratch the surface of what it can do. Beyond the basic status display, nvidia-smi supports structured output, continuous monitoring, process-level GPU tracking, and programmatic queries that make it the foundation for any monitoring pipeline.

Search-o1 agent architecture showing search, compress, and answer phases — AI monitoring systems can adopt agent-like architectures: continuously searching for anomalies, compressing metrics, and generating alerts. Source: *An Illustrated Guide to AI Agents*

Basic Usage and Output Interpretation

# Standard status display
nvidia-smi

# The output shows:
# - Driver version and CUDA version
# - Per-GPU: temperature, power draw, memory usage, GPU utilization
# - Running processes with PID and GPU memory usage

The default nvidia-smi output is designed for human consumption, but every field matters. The temperature column shows the GPU die temperature — sustained operation above 83C on consumer cards or 85C on datacenter cards indicates cooling problems. The power draw column should be compared against the power limit (shown in the header) — if the GPU is consistently at its power limit, it is throttling. Memory usage shows allocated VRAM, not working set — a process can allocate VRAM without actively using it.

Continuous Monitoring with Watch Mode

# Update every second (default is every 5 seconds if no interval specified)
nvidia-smi -l 1

# Using the dmon subcommand for a cleaner continuous output
nvidia-smi dmon -s pucvmet -d 1

# dmon flags:
# -s p = power
# -s u = utilization
# -s c = clocks
# -s v = violations (throttling)
# -s m = memory
# -s e = ECC errors
# -s t = temperature
# -d 1 = update every 1 second

# For process-level monitoring
nvidia-smi pmon -s um -d 1
# Shows per-process GPU utilization and memory usage

Structured Queries for Scripting

This is where nvidia-smi becomes a monitoring data source rather than a human tool. The --query-gpu flag outputs structured CSV data that you can pipe into any monitoring system.

# Query specific metrics in CSV format
nvidia-smi --query-gpu=name,temperature.gpu,utilization.gpu,utilization.memory,memory.used,memory.total,power.draw,power.limit --format=csv

# Output:
# name, temperature.gpu, utilization.gpu [%], utilization.memory [%], memory.used [MiB], memory.total [MiB], power.draw [W], power.limit [W]
# NVIDIA RTX 4090, 45, 0 %, 0 %, 512 MiB, 24564 MiB, 25.50 W, 450.00 W

# Remove headers and units for clean parsing
nvidia-smi --query-gpu=name,temperature.gpu,utilization.gpu,memory.used,memory.total,power.draw --format=csv,noheader,nounits

# Continuous structured output (updates every 2 seconds)
nvidia-smi --query-gpu=timestamp,name,temperature.gpu,utilization.gpu,memory.used,memory.total,power.draw --format=csv -l 2

# Query running processes
nvidia-smi --query-compute-apps=pid,name,used_gpu_memory --format=csv

All Available Query Fields

# List every available query field
nvidia-smi --help-query-gpu

# Commonly useful fields beyond the basics:
# clocks.current.graphics     - Current GPU clock speed
# clocks.current.memory       - Current memory clock speed
# clocks.max.graphics         - Maximum GPU clock speed
# pstate                      - Performance state (P0=max, P8=idle)
# fan.speed                   - Fan speed percentage
# encoder.stats.sessionCount  - Active NVENC sessions
# decoder.stats.sessionCount  - Active NVDEC sessions
# ecc.errors.corrected.total  - Corrected ECC errors (datacenter GPUs)
# ecc.errors.uncorrected.total - Uncorrected ECC errors (data corruption risk)
# retired_pages.pending       - Pages pending retirement (bad memory)

nvtop: Interactive GPU Process Monitor

nvtop is to GPUs what htop is to CPUs — a full-screen, interactive terminal application that shows real-time GPU utilization with process details. It supports NVIDIA, AMD, Intel, and Apple GPUs, making it the one monitoring tool that works across all hardware vendors.

Installation

# Ubuntu / Debian
sudo apt install -y nvtop

# Fedora
sudo dnf install -y nvtop

# RHEL 9 (via EPEL)
sudo dnf install -y epel-release
sudo dnf install -y nvtop

# Arch Linux
sudo pacman -S nvtop

# From source (for the latest version)
sudo apt install -y cmake libncurses5-dev libdrm-dev libsystemd-dev
git clone https://github.com/Syllo/nvtop.git
cd nvtop && mkdir build && cd build
cmake .. -DNVIDIA_SUPPORT=ON -DAMDGPU_SUPPORT=ON
make && sudo make install

Using nvtop Effectively

# Launch with default settings
nvtop

# Show only NVIDIA GPUs (skip integrated graphics)
nvtop -d 0

# Keyboard shortcuts inside nvtop:
# F2 or s   = Setup (change display options)
# F9 or k   = Kill a process
# F6 or o   = Sort processes
# F10 or q  = Quit
# +/-        = Expand/collapse GPU sections
# e          = Toggle encoder/decoder view

nvtop provides rolling graphs of GPU utilization, memory usage, temperature, and fan speed — all updated in real time. The process list shows every GPU-using process with its PID, user, GPU percentage, memory usage, and encoder/decoder usage. For AI workloads, the memory graph is the most critical one to watch: a steadily climbing line that never decreases indicates a memory leak that will eventually crash your inference server.

gpustat: One-Line GPU Status

Sometimes you just want a quick, clean, one-line-per-GPU status report. gpustat is a Python wrapper around nvidia-smi that formats the output into a compact, color-coded display. It is perfect for shell prompts, tmux status bars, and quick checks.

# Install via pip
pip install gpustat

# Basic usage
gpustat

# Output example:
# [0] NVIDIA RTX 4090  | 45°C,  0 % |   512 / 24564 MB | user:python/12345(4096M)

# With full details
gpustat -cupP

# Flags:
# -c = show command name for each process
# -u = show user for each process
# -p = show PID for each process
# -P = show power draw

# Continuous monitoring (update every 2 seconds)
gpustat -cupP -i 2

# JSON output for programmatic consumption
gpustat --json

# Watch mode with timestamps
watch -n 1 gpustat -cupP

gpustat is particularly useful in multi-user environments. When several people share a GPU server, a quick gpustat shows who is using which GPU and how much VRAM they have allocated. Add it to your shell's MOTD or login script so users see GPU status every time they connect.

Prometheus NVIDIA GPU Exporter

For production monitoring, you need metrics in a time-series database with alerting capability. Prometheus is the standard, and there are several GPU exporters available. The most mature is the DCGM (Data Center GPU Manager) exporter from NVIDIA, which provides detailed metrics beyond what nvidia-smi exposes.

Option 1: DCGM Exporter (Recommended for Production)

# Run the DCGM exporter as a Docker container
docker run -d \
  --name dcgm-exporter \
  --gpus all \
  --cap-add SYS_ADMIN \
  -p 9400:9400 \
  nvcr.io/nvidia/k8s/dcgm-exporter:3.3.8-3.6.0-ubuntu22.04

# Verify metrics are being exported
curl -s localhost:9400/metrics | head -50

# Key metrics exported:
# DCGM_FI_DEV_GPU_TEMP          - GPU temperature
# DCGM_FI_DEV_GPU_UTIL          - GPU utilization percentage
# DCGM_FI_DEV_MEM_COPY_UTIL     - Memory utilization percentage
# DCGM_FI_DEV_FB_USED           - Framebuffer (VRAM) used in MiB
# DCGM_FI_DEV_FB_FREE           - Framebuffer free in MiB
# DCGM_FI_DEV_POWER_USAGE       - Power draw in watts
# DCGM_FI_DEV_SM_CLOCK          - Streaming multiprocessor clock in MHz
# DCGM_FI_DEV_MEM_CLOCK         - Memory clock in MHz
# DCGM_FI_DEV_PCIE_TX_THROUGHPUT - PCIe transmit bandwidth
# DCGM_FI_DEV_PCIE_RX_THROUGHPUT - PCIe receive bandwidth
# DCGM_FI_DEV_XID_ERRORS        - XID error count (hardware errors)

Option 2: nvidia_gpu_exporter (Lightweight)

# If you prefer a simpler exporter that wraps nvidia-smi
# Download the binary
wget https://github.com/utkuozdemir/nvidia_gpu_exporter/releases/latest/download/nvidia_gpu_exporter_linux_amd64.tar.gz
tar xzf nvidia_gpu_exporter_linux_amd64.tar.gz
sudo mv nvidia_gpu_exporter /usr/local/bin/

# Run as a systemd service
sudo tee /etc/systemd/system/nvidia-gpu-exporter.service << EOF
[Unit]
Description=NVIDIA GPU Exporter for Prometheus
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/bin/nvidia_gpu_exporter --web.listen-address=:9835
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable --now nvidia-gpu-exporter

# Verify
curl -s localhost:9835/metrics | grep gpu

Prometheus Configuration

# Add to /etc/prometheus/prometheus.yml
scrape_configs:
  - job_name: 'nvidia-dcgm'
    static_configs:
      - targets: ['gpu-server-01:9400']
        labels:
          instance: 'gpu-server-01'
      - targets: ['gpu-server-02:9400']
        labels:
          instance: 'gpu-server-02'
    scrape_interval: 15s
    scrape_timeout: 10s

# Reload Prometheus configuration
curl -X POST http://localhost:9090/-/reload

Grafana GPU Dashboard Setup

Metrics in Prometheus are useful, but a Grafana dashboard makes them actionable. Here is how to build a GPU monitoring dashboard from scratch.

Dashboard Panels to Include

A practical GPU monitoring dashboard should have these panels organized in rows:

Row 1: Overview — Stat panels showing total GPUs, average utilization, total VRAM used/available, and current power draw across all GPUs.

Row 2: Per-GPU Utilization — Time-series graph of GPU compute utilization per device. Use the query:

# Prometheus query for GPU utilization per device
DCGM_FI_DEV_GPU_UTIL{instance="$instance"}

# For average across all GPUs
avg(DCGM_FI_DEV_GPU_UTIL{instance="$instance"})

Row 3: Memory Usage — VRAM usage per GPU with thresholds marked. This is the panel you will look at most for AI workloads.

# VRAM usage percentage per GPU
DCGM_FI_DEV_FB_USED{instance="$instance"} / (DCGM_FI_DEV_FB_USED{instance="$instance"} + DCGM_FI_DEV_FB_FREE{instance="$instance"}) * 100

# Absolute VRAM usage in GB
DCGM_FI_DEV_FB_USED{instance="$instance"} / 1024

Row 4: Thermal and Power — Temperature and power draw graphs with warning/critical thresholds.

# Temperature with threshold lines
DCGM_FI_DEV_GPU_TEMP{instance="$instance"}
# Add threshold annotations at 80°C (warning) and 90°C (critical)

# Power usage vs limit
DCGM_FI_DEV_POWER_USAGE{instance="$instance"}
# Overlay with power limit as a constant line

Row 5: Clock Speeds and PCIe Bandwidth — Useful for detecting thermal throttling (clocks drop) and data transfer bottlenecks.

# SM clock speed — drops indicate throttling
DCGM_FI_DEV_SM_CLOCK{instance="$instance"}

# PCIe throughput — useful for multi-GPU training
rate(DCGM_FI_DEV_PCIE_TX_THROUGHPUT{instance="$instance"}[5m])
rate(DCGM_FI_DEV_PCIE_RX_THROUGHPUT{instance="$instance"}[5m])

Importing a Pre-Built Dashboard

# NVIDIA provides an official Grafana dashboard for DCGM
# Import dashboard ID 12239 in Grafana:
# 1. Go to Grafana → Dashboards → Import
# 2. Enter dashboard ID: 12239
# 3. Select your Prometheus data source
# 4. Click Import

# Or import via the API
curl -X POST http://admin:admin@localhost:3000/api/dashboards/import \
  -H "Content-Type: application/json" \
  -d '{"dashboard":{"id":12239},"overwrite":true,"inputs":[{"name":"DS_PROMETHEUS","type":"datasource","pluginId":"prometheus","value":"Prometheus"}]}'

Alerting on GPU Metrics

Dashboards tell you what happened. Alerts tell you when something is happening. Here are the alerting rules that every GPU deployment should have.

Prometheus Alerting Rules

# /etc/prometheus/rules/gpu_alerts.yml
groups:
  - name: gpu_alerts
    rules:
      # VRAM usage above 90% for 5 minutes
      - alert: GPUMemoryHigh
        expr: |
          DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE) * 100 > 90
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "GPU {{ $labels.gpu }} VRAM usage above 90%"
          description: "GPU {{ $labels.gpu }} on {{ $labels.instance }} is using {{ $value | printf \"%.1f\" }}% VRAM"

      # GPU temperature above 85°C for 2 minutes
      - alert: GPUTemperatureHigh
        expr: DCGM_FI_DEV_GPU_TEMP > 85
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "GPU {{ $labels.gpu }} temperature critical"
          description: "GPU {{ $labels.gpu }} on {{ $labels.instance }} is at {{ $value }}°C"

      # GPU utilization at 0% for 30 minutes (wasted resources)
      - alert: GPUIdle
        expr: DCGM_FI_DEV_GPU_UTIL == 0
        for: 30m
        labels:
          severity: info
        annotations:
          summary: "GPU {{ $labels.gpu }} has been idle for 30 minutes"

      # Power usage at limit for 10 minutes (throttling)
      - alert: GPUPowerThrottling
        expr: |
          DCGM_FI_DEV_POWER_USAGE / DCGM_FI_DEV_POWER_LIMIT * 100 > 95
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "GPU {{ $labels.gpu }} power throttling"

      # XID errors detected (hardware problems)
      - alert: GPUXIDErrors
        expr: increase(DCGM_FI_DEV_XID_ERRORS[5m]) > 0
        labels:
          severity: critical
        annotations:
          summary: "GPU {{ $labels.gpu }} XID errors detected"
          description: "Hardware errors on GPU {{ $labels.gpu }} — check dmesg for details"

Monitoring GPUs in Docker Containers

When GPUs are used inside Docker containers, monitoring has an extra layer. The host can see the GPU utilization, but mapping that utilization to specific containers requires additional tooling.

# From the host, nvidia-smi shows container processes with their PIDs
nvidia-smi --query-compute-apps=pid,name,used_gpu_memory --format=csv

# Map PIDs to container names
nvidia-smi --query-compute-apps=pid --format=csv,noheader | while read pid; do
  container=$(docker inspect --format '{{.Name}}' $(docker ps -q --filter "pid=$pid") 2>/dev/null)
  echo "PID: $pid -> Container: ${container:-host-process}"
done

# Run nvidia-smi inside a specific container
docker exec ollama nvidia-smi

# DCGM exporter inside Docker automatically labels metrics with container info
# when running with the NVIDIA Container Toolkit

Container-Level GPU Metrics with cAdvisor

# cAdvisor with GPU support provides per-container GPU metrics
docker run -d \
  --name cadvisor \
  --gpus all \
  -p 8080:8080 \
  -v /:/rootfs:ro \
  -v /var/run:/var/run:ro \
  -v /sys:/sys:ro \
  -v /var/lib/docker/:/var/lib/docker:ro \
  gcr.io/cadvisor/cadvisor:latest

# GPU metrics per container are available at:
# http://localhost:8080/metrics
# container_accelerator_memory_total_bytes
# container_accelerator_memory_used_bytes
# container_accelerator_duty_cycle

AMD ROCm Monitoring with rocm-smi

AMD's equivalent to nvidia-smi is rocm-smi, which is installed as part of the ROCm stack. The interface is similar but the flag names differ.

# Basic GPU status
rocm-smi

# Detailed information
rocm-smi --showallinfo

# Temperature monitoring
rocm-smi --showtemp

# Power monitoring
rocm-smi --showpower

# Memory usage
rocm-smi --showmeminfo vram

# Continuous monitoring (update every second)
watch -n 1 rocm-smi

# CSV output for scripting
rocm-smi --showgpuclocks --showmeminfo vram --showtemp --csv

# JSON output
rocm-smi --showgpuclocks --showmeminfo vram --showtemp --json

# Set power limit (useful for thermal management)
rocm-smi --setpoweroverdrive 200  # Set to 200W

# Monitor specific GPU
rocm-smi -d 0 --showuse --showtemp --showpower

ROCm Prometheus Exporter

# The amd_smi_exporter provides Prometheus metrics for AMD GPUs
# Clone and build
git clone https://github.com/amd/amd_smi_exporter.git
cd amd_smi_exporter
go build -o amd_smi_exporter .
sudo mv amd_smi_exporter /usr/local/bin/

# Create systemd service
sudo tee /etc/systemd/system/amd-smi-exporter.service << EOF
[Unit]
Description=AMD SMI Exporter for Prometheus
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/bin/amd_smi_exporter --listen-address=:9301
Restart=always

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable --now amd-smi-exporter

# Verify metrics
curl -s localhost:9301/metrics | grep amd

Custom Monitoring Scripts

Sometimes you need monitoring logic that off-the-shelf tools do not provide. Here are practical scripts for common GPU monitoring scenarios.

VRAM Leak Detection Script

#!/bin/bash
# gpu_leak_monitor.sh - Detect VRAM leaks by tracking memory growth
# Run with: ./gpu_leak_monitor.sh >> /var/log/gpu_leak_monitor.log

THRESHOLD_MB=500    # Alert if VRAM grows by this much
CHECK_INTERVAL=300  # Check every 5 minutes

declare -A PREV_MEM

while true; do
    TIMESTAMP=$(date "+%Y-%m-%d %H:%M:%S")

    while IFS=, read -r gpu_id mem_used; do
        gpu_id=$(echo "$gpu_id" | tr -d ' ')
        mem_used=$(echo "$mem_used" | tr -d ' ')

        if [[ -n "${PREV_MEM[$gpu_id]}" ]]; then
            DIFF=$((mem_used - PREV_MEM[$gpu_id]))
            if [[ $DIFF -gt $THRESHOLD_MB ]]; then
                echo "[$TIMESTAMP] WARNING: GPU $gpu_id VRAM grew by ${DIFF}MB (${PREV_MEM[$gpu_id]}MB -> ${mem_used}MB)"
                # Send alert (webhook, email, etc.)
                # curl -X POST "https://hooks.slack.com/..." -d "{\"text\":\"GPU $gpu_id VRAM leak: +${DIFF}MB\"}"
            fi
        fi

        PREV_MEM[$gpu_id]=$mem_used
    done < <(nvidia-smi --query-gpu=index,memory.used --format=csv,noheader,nounits)

    sleep $CHECK_INTERVAL
done

GPU Health Check for Load Balancers

#!/bin/bash
# gpu_health_check.sh - Returns HTTP-style exit codes for load balancer health checks
# Exit 0 = healthy, Exit 1 = unhealthy

MAX_TEMP=85
MAX_VRAM_PERCENT=95
MAX_ERRORS=0

# Check temperature
TEMP=$(nvidia-smi --query-gpu=temperature.gpu --format=csv,noheader,nounits | sort -n | tail -1)
if [[ $TEMP -gt $MAX_TEMP ]]; then
    echo "UNHEALTHY: GPU temperature ${TEMP}C exceeds ${MAX_TEMP}C"
    exit 1
fi

# Check VRAM usage
while IFS=, read -r used total; do
    used=$(echo "$used" | tr -d ' ')
    total=$(echo "$total" | tr -d ' ')
    percent=$((used * 100 / total))
    if [[ $percent -gt $MAX_VRAM_PERCENT ]]; then
        echo "UNHEALTHY: GPU VRAM at ${percent}%"
        exit 1
    fi
done < <(nvidia-smi --query-gpu=memory.used,memory.total --format=csv,noheader,nounits)

# Check for XID errors in last 5 minutes
ERRORS=$(dmesg --time-format iso | grep -c "NVRM: Xid" | tail -1)
if [[ $ERRORS -gt $MAX_ERRORS ]]; then
    echo "UNHEALTHY: $ERRORS XID errors detected"
    exit 1
fi

echo "HEALTHY: temp=${TEMP}C"
exit 0

Integration with Existing Monitoring Stacks

If you already run Nagios, Zabbix, Datadog, or another monitoring platform, you do not need to switch to Prometheus. Every major monitoring platform can consume GPU metrics.

Zabbix Integration

# Add custom Zabbix UserParameters for GPU monitoring
# /etc/zabbix/zabbix_agentd.d/gpu.conf

UserParameter=gpu.temp[*],nvidia-smi --query-gpu=temperature.gpu --format=csv,noheader,nounits -i $1
UserParameter=gpu.util[*],nvidia-smi --query-gpu=utilization.gpu --format=csv,noheader,nounits -i $1
UserParameter=gpu.mem.used[*],nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits -i $1
UserParameter=gpu.mem.total[*],nvidia-smi --query-gpu=memory.total --format=csv,noheader,nounits -i $1
UserParameter=gpu.power[*],nvidia-smi --query-gpu=power.draw --format=csv,noheader,nounits -i $1
UserParameter=gpu.count,nvidia-smi --query-gpu=count --format=csv,noheader

# Restart Zabbix agent
sudo systemctl restart zabbix-agent

Telegraf (InfluxDB) Integration

# Add to /etc/telegraf/telegraf.d/nvidia.conf
[[inputs.nvidia_smi]]
  bin_path = "/usr/bin/nvidia-smi"
  timeout = "5s"

# Restart Telegraf
sudo systemctl restart telegraf

# This exposes metrics like:
# nvidia_smi.temperature_gpu
# nvidia_smi.utilization_gpu
# nvidia_smi.memory_used
# nvidia_smi.power_draw

Monitoring Best Practices for AI Workloads

AI workloads have monitoring patterns that differ from traditional GPU compute (like rendering or video encoding). Here are the practices that matter most for inference and training operations.

Track VRAM usage as a percentage, not absolute. A model using 22 GB on a 24 GB GPU is in a very different situation than 22 GB on a 48 GB GPU. Set alerts on percentage thresholds, not absolute values.

Monitor tokens per second, not just GPU utilization. GPU utilization can be 100% while your inference server is actually performing poorly due to memory bandwidth bottlenecks. If you can instrument your inference server to export tokens/second as a Prometheus metric, that is the real performance indicator.

Watch for thermal throttling patterns. A GPU that cycles between 100% and 0% utilization every few minutes is likely hitting its thermal limit and throttling. The clock speed metrics reveal this — if SM clocks drop below their boost level during load, the GPU is throttling. Fix this with better cooling, higher fan speeds, or reduced power limits.

Set up separate alerting for training and inference. Training jobs are expected to use 100% GPU for hours. Inference servers should have variable utilization that correlates with request load. An inference server stuck at 100% GPU utilization might have a stuck request queue rather than healthy load.

Frequently Asked Questions

Can I monitor GPU usage inside Docker containers from the host?

Yes. nvidia-smi on the host shows all GPU processes regardless of whether they run in containers or on bare metal. The PIDs shown are the host-namespace PIDs, so you can map them to container names using docker inspect. For container-level aggregation, use cAdvisor with GPU support or the DCGM exporter, both of which can label metrics with container identifiers. The key insight is that GPU devices are not namespaced the way CPU and memory are — the host always has full visibility into GPU utilization.

What is the performance overhead of GPU monitoring?

nvidia-smi queries add negligible overhead — each call takes about 20-50 milliseconds and does not interrupt GPU compute operations. The DCGM exporter is more efficient for continuous monitoring because it maintains a persistent connection to the DCGM daemon rather than spawning a new process for each query. Running nvidia-smi every second is fine for debugging, but for production monitoring with Prometheus, a 15-second scrape interval is sufficient and keeps the overhead effectively zero. nvtop and gpustat have slightly higher overhead because they refresh their terminal UI, but the GPU overhead is still negligible.

How do I monitor GPU memory leaks in long-running inference servers?

Track VRAM usage over time with Prometheus and set up a rate-of-change alert. If deriv(DCGM_FI_DEV_FB_USED[1h]) is consistently positive (memory growing) over several hours without a corresponding increase in load, you have a leak. Common causes include: PyTorch not releasing cached memory (try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True), CUDA context accumulation when models are loaded and unloaded repeatedly, and framework-level bugs in batching code that accumulate tensors. Restarting the inference process periodically (via a cron job or Kubernetes liveness probe) is a practical mitigation while you track down the root cause.

Ranjan et al. discuss in Agentic AI in Enterprise that enterprise GPU monitoring should include capacity planning metrics alongside real-time health. Track not just current utilization but also peak utilization patterns, model load/unload frequency, and queue depth over time. These trends reveal when you are approaching capacity limits weeks before users experience degradation. For Kubernetes-based deployments, they recommend the NVIDIA GPU Operator combined with Prometheus and Grafana for a complete observability stack that covers GPU health, scheduling efficiency, and per-pod resource consumption.

What should my GPU temperature alert thresholds be?

For NVIDIA consumer GPUs (RTX series), warning at 83C and critical at 90C. For NVIDIA datacenter GPUs (A100, H100, L40S), warning at 80C and critical at 85C — these cards have tighter thermal specifications. For AMD RDNA 3 consumer cards, the junction temperature can safely reach 110C by design, so use hotspot temperature rather than edge temperature and alert at 100C/110C. Always check your specific GPU's thermal specifications in the data sheet rather than using generic thresholds. Power throttling starts before thermal shutdown, so monitoring clock speeds alongside temperature gives earlier warning of cooling problems.

Can I use these monitoring tools with AMD GPUs?

nvtop supports AMD GPUs natively — it uses the DRM interface rather than vendor-specific libraries, so it works with any GPU that has a Linux kernel driver. rocm-smi provides AMD-specific monitoring equivalent to nvidia-smi. For Prometheus, the amd_smi_exporter provides similar metrics to the NVIDIA DCGM exporter. gpustat is NVIDIA-only. Grafana dashboards work with any Prometheus data source, so you can use the same dashboard infrastructure for AMD GPUs — you just need different queries matching the AMD exporter metric names. The monitoring concepts (VRAM, temperature, utilization, power) are identical across vendors; only the tool names and metric labels differ.

Tool usage pipeline showing research, tools, extraction, and summarization stages — GPU monitoring pipelines follow a similar pattern to AI tool usage: collect metrics, apply analytical tools, extract insights, and summarize for dashboards. Source: *An Illustrated Guide to AI Agents*