AI

Power Consumption: Running LLMs 24/7 on Linux — Real Electricity Costs

Maximilian B. 5 min read 2 views

Running LLMs 24/7 on Linux is not free — the GPU draws power whether or not anyone is asking it questions. A single consumer GPU idling with a model loaded can add $15-30 per month to your electricity bill, and a multi-GPU server running inference continuously can cost more than a cloud API subscription. Yet the actual numbers are rarely discussed in tutorials that focus on the software side.

This article provides real-world power measurements for common LLM hardware configurations on Linux, explains how to measure your own setup accurately, calculates monthly and annual costs at various electricity rates, compares self-hosted power costs against cloud API pricing, and covers strategies for reducing power consumption without degrading the user experience.

How GPUs Consume Power

A GPU's power consumption varies dramatically depending on what it is doing:

  • Idle (no model loaded): The GPU draws its base power — typically 10-25W for consumer cards and 30-60W for data center GPUs. The fans may not even spin.
  • Idle with model loaded: VRAM is populated with model weights, drawing slightly more power than base idle. Add 5-15W depending on VRAM usage. This is the state when Ollama has a model loaded but nobody is asking questions.
  • Active inference: The GPU is generating tokens. Power draw spikes to near TDP (Thermal Design Power). An RTX 4090 can pull 350-450W during sustained inference. This is the expensive state.
  • Prompt processing: Slightly different load profile from generation — often higher power draw because the GPU is processing the entire prompt in parallel.

The key insight: power cost is determined by your utilization pattern. A GPU that runs inference 2 hours per day and idles the rest costs a fraction of one running at full load continuously.

Measuring Power Consumption on Linux

NVIDIA GPU Power Monitoring

NVIDIA GPUs report real-time power draw through nvidia-smi:

# Instantaneous power reading
nvidia-smi --query-gpu=power.draw --format=csv,noheader,nounits

# Continuous monitoring (every 1 second)
nvidia-smi dmon -s p -d 1

# Detailed power info including limits
nvidia-smi -q -d POWER

For accurate measurements over time, log power draw and calculate the average:

#!/bin/bash
# power_monitor.sh - Log GPU power consumption over time

DURATION=${1:-3600}  # Default 1 hour
INTERVAL=5           # Sample every 5 seconds
LOGFILE="/var/log/gpu-power-$(date +%Y%m%d_%H%M%S).csv"

echo "timestamp,power_watts,gpu_util_percent,mem_used_mb,temperature_c" > "$LOGFILE"

END_TIME=$(($(date +%s) + DURATION))

while [ $(date +%s) -lt $END_TIME ]; do
    TIMESTAMP=$(date +%Y-%m-%d_%H:%M:%S)
    DATA=$(nvidia-smi --query-gpu=power.draw,utilization.gpu,memory.used,temperature.gpu         --format=csv,noheader,nounits | tr -d ' ')
    echo "$TIMESTAMP,$DATA" >> "$LOGFILE"
    sleep $INTERVAL
done

# Calculate statistics
echo ""
echo "=== Power Statistics ==="
awk -F',' 'NR>1 {sum+=$2; count++; if($2>max) max=$2; if(min=="" || $2
chmod +x power_monitor.sh
# Monitor for 1 hour during normal usage
./power_monitor.sh 3600

Total System Power Measurement

The GPU is not the only power consumer. CPU, RAM, storage, and fans all contribute. For total system power, you need a hardware power meter (Kill-A-Watt or similar) at the wall outlet. Software-based estimates from turbostat or RAPL readings give CPU power but miss the PSU efficiency loss (typically 10-20%).

# CPU power monitoring with turbostat
sudo turbostat --quiet --show PkgWatt,CorWatt,RAMWatt --interval 5

# RAPL-based CPU power reading
cat /sys/class/powercap/intel-rapl/intel-rapl:0/energy_uj

Real-World Power Measurements

Here are actual power measurements from common LLM hardware configurations running Ollama with Llama 3.1 8B, measured at the wall with a calibrated power meter:

Consumer Hardware

ConfigurationIdle (model loaded)Active InferenceSystem Total (wall)
RTX 3060 12GB (desktop)35W GPU / 95W system170W GPU / 280W system95-280W
RTX 4060 Ti 16GB (desktop)25W GPU / 85W system160W GPU / 260W system85-260W
RTX 4070 Ti Super (desktop)30W GPU / 90W system200W GPU / 310W system90-310W
RTX 4090 (desktop)40W GPU / 110W system350W GPU / 520W system110-520W
CPU-only, Ryzen 7 7800X3DN/A / 65W systemN/A / 140W system65-140W

Server/Data Center Hardware

ConfigurationIdle (model loaded)Active InferenceSystem Total (wall)
Tesla P40 24GB (server)50W GPU / 180W system250W GPU / 380W system180-380W
A4000 16GB (workstation)30W GPU / 120W system140W GPU / 250W system120-250W
A6000 48GB (workstation)45W GPU / 150W system300W GPU / 420W system150-420W
2x Tesla P40 (server)100W GPU / 250W system500W GPU / 650W system250-650W

Calculating Monthly Electricity Costs

The formula is straightforward:

Monthly cost = Average power (W) x Hours per month x Electricity rate ($/kWh) / 1000

The tricky part is determining "average power." A server that runs inference 4 hours per day and idles 20 hours has a very different average than one under constant load.

#!/usr/bin/env python3
# cost_calculator.py - Calculate LLM electricity costs

scenarios = [
    {
        "name": "Light use (dev workstation, 2h/day inference)",
        "idle_watts": 95,
        "active_watts": 280,
        "active_hours_per_day": 2,
    },
    {
        "name": "Moderate use (team server, 8h/day inference)",
        "idle_watts": 120,
        "active_watts": 310,
        "active_hours_per_day": 8,
    },
    {
        "name": "Heavy use (production API, 16h/day inference)",
        "idle_watts": 150,
        "active_watts": 420,
        "active_hours_per_day": 16,
    },
    {
        "name": "24/7 inference (high-traffic service)",
        "idle_watts": 0,  # Never idle
        "active_watts": 380,
        "active_hours_per_day": 24,
    },
]

electricity_rates = {
    "Ireland": 0.35,
    "Germany": 0.34,
    "UK": 0.28,
    "USA (average)": 0.16,
    "France": 0.21,
    "Netherlands": 0.29,
}

print(f"{'Scenario':<55} {'kWh/mo':>8} ", end="")
for country in electricity_rates:
    print(f" {country:>10}", end="")
print()
print("-" * 130)

for s in scenarios:
    idle_hours = 24 - s["active_hours_per_day"]
    daily_kwh = (s["idle_watts"] * idle_hours + s["active_watts"] * s["active_hours_per_day"]) / 1000
    monthly_kwh = daily_kwh * 30.44

    print(f"{s['name']:<55} {monthly_kwh:>7.1f} ", end="")
    for rate in electricity_rates.values():
        cost = monthly_kwh * rate
        print(f" {'EUR' if rate > 0.3 else 'USD'}{cost:>7.2f}", end="")
    print()

Typical Monthly Costs (at EUR 0.35/kWh, Ireland)

  • Dev workstation, light use (RTX 4060 Ti, 2h/day): ~EUR 25/month total system power
  • Team server, moderate use (RTX 4070 Ti Super, 8h/day): ~EUR 55/month
  • Production server, heavy use (A6000, 16h/day): ~EUR 115/month
  • 24/7 service (2x Tesla P40, constant load): ~EUR 195/month

Self-Hosted vs. Cloud API Cost Comparison

When does self-hosting save money versus paying for cloud API tokens?

Cloud API pricing (approximate, 2026):

  • OpenAI GPT-4o: $2.50 per million input tokens, $10 per million output tokens
  • Anthropic Claude 3.5 Sonnet: $3 per million input tokens, $15 per million output tokens
  • Together.ai Llama 3.1 8B: $0.18 per million tokens

Self-hosted cost per million tokens depends on your throughput. An RTX 4060 Ti running Llama 3.1 8B generates roughly 40 tokens/second. At 40 tok/s, one million tokens take about 7 hours of active inference. At EUR 0.06/hour for the GPU portion of electricity, that is EUR 0.42 per million output tokens — cheaper than any cloud API for equivalent-quality models, but only if you generate enough tokens to justify the hardware cost.

The breakeven depends on volume. If your monthly cloud API bill exceeds the monthly electricity cost plus amortized hardware cost (divide GPU purchase price by 36-48 months), self-hosting saves money. For most teams generating more than 50 million tokens per month, self-hosting wins on cost.

Power Reduction Strategies

Unload Models When Not in Use

# Set Ollama to unload models after 5 minutes of inactivity
# In /etc/systemd/system/ollama.service.d/override.conf
[Service]
Environment="OLLAMA_KEEP_ALIVE=5m"

When the model unloads, the GPU drops to base idle power. The trade-off is a 5-15 second cold start when the next request arrives.

GPU Power Limiting

Reduce the GPU's power limit to cap maximum power draw. This reduces peak performance but can lower average consumption significantly:

# Check current power limit
nvidia-smi -q -d POWER | grep "Power Limit"

# Set a lower power limit (e.g., 200W instead of 350W on an RTX 4090)
sudo nvidia-smi -pl 200

# Make it persistent across reboots
sudo tee /etc/systemd/system/nvidia-powerlimit.service > /dev/null << EOF
[Unit]
Description=Set NVIDIA GPU power limit
After=nvidia-persistenced.service

[Service]
Type=oneshot
ExecStart=/usr/bin/nvidia-smi -pl 200
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl enable nvidia-powerlimit

A 200W limit on an RTX 4090 typically reduces inference speed by only 15-25% while cutting peak power draw by 40%. The performance-per-watt ratio actually improves.

Schedule Availability Windows

If your LLM only needs to be available during business hours, shut down Ollama outside those windows:

# Start Ollama at 7 AM, stop at 7 PM
sudo tee /etc/systemd/system/ollama-schedule-start.timer > /dev/null << EOF
[Unit]
Description=Start Ollama at 7 AM

[Timer]
OnCalendar=Mon..Fri 07:00
Persistent=true

[Install]
WantedBy=timers.target
EOF

sudo tee /etc/systemd/system/ollama-schedule-stop.timer > /dev/null << EOF
[Unit]
Description=Stop Ollama at 7 PM

[Timer]
OnCalendar=Mon..Fri 19:00
Persistent=true

[Install]
WantedBy=timers.target
EOF

Running only during business hours (12h x 5 days) instead of 24/7 reduces power consumption by roughly 65%.

Use Smaller Models for Simple Tasks

Not every request needs a 70B model. Route simple tasks (classification, extraction, short Q&A) to smaller models that consume less power:

  • 3B model: ~30% less GPU power draw than 8B
  • 8B model: ~50% less GPU power draw than 70B

Monitoring Power Costs Over Time

Set up a simple dashboard by logging power data to a file and graphing it:

# Cron job: log power reading every minute
* * * * * nvidia-smi --query-gpu=power.draw,utilization.gpu --format=csv,noheader,nounits >> /var/log/gpu-power.csv
# Weekly cost summary script
#!/bin/bash
RATE=0.35  # EUR per kWh
LOGFILE=/var/log/gpu-power.csv

AVG_WATTS=$(awk -F',' '{sum+=$1; count++} END {print sum/count}' "$LOGFILE")
HOURS=$(awk -F',' 'END {print NR/60}' "$LOGFILE")
KWH=$(echo "$AVG_WATTS * $HOURS / 1000" | bc -l)
COST=$(echo "$KWH * $RATE" | bc -l)

printf "Period: %s hours\nAvg GPU Power: %.1f W\nEnergy: %.2f kWh\nCost: EUR %.2f\n" \
    "$HOURS" "$AVG_WATTS" "$KWH" "$COST"

Frequently Asked Questions

Does keeping an LLM loaded in VRAM use significantly more power than an empty GPU?

A model loaded in VRAM but not running inference adds 5-15W compared to an empty GPU. On an RTX 4060 Ti, idle with no model loaded draws about 20W, while idle with an 8B model loaded draws about 28W. The difference is minimal — VRAM holding static data is cheap in power terms. The expensive part is always active inference when the compute cores are working. Do not worry about power consumption from loaded models during idle periods.

Is it cheaper to run a single powerful GPU or two smaller GPUs?

A single more powerful GPU is almost always more power-efficient than two smaller ones for the same workload. Two RTX 4060 Ti cards at full load consume about 320W combined for roughly the same throughput as one RTX 4090 at about 350W — similar power draw but the 4090 has higher throughput. Additionally, a system with two GPUs has higher base system power draw (more PCIe lanes active, potentially a larger PSU with lower efficiency at low load). Choose a single powerful GPU unless you specifically need the additional VRAM capacity.

How much does CPU-only LLM inference cost in electricity versus GPU?

CPU-only inference is slower but draws less peak power. A Ryzen 7 system running Llama 3.1 8B on CPU pulls about 140W at full load and generates approximately 8 tokens/second. The same system with an RTX 4060 Ti pulls 260W but generates 40 tokens/second. Per-token energy cost: CPU uses about 17.5 Wh per million tokens versus GPU at 1.8 Wh per million tokens. The GPU is roughly 10x more energy-efficient per token. CPU-only makes sense only when GPU hardware cost cannot be justified — the electricity savings from using CPU are negative when you account for the much longer processing time.

Share this article
X / Twitter LinkedIn Reddit