Power Consumption: Running LLMs 24/7 on Linux

Running LLMs 24/7 on Linux is not free — the GPU draws power whether or not anyone is asking it questions. A single consumer GPU idling with a model loaded can add $15-30 per month to your electricity bill, and a multi-GPU server running inference continuously can cost more than a cloud API subscription. Yet the actual numbers are rarely discussed in tutorials that focus on the software side.

This article provides real-world power measurements for common LLM hardware configurations on Linux, explains how to measure your own setup accurately, calculates monthly and annual costs at various electricity rates, compares self-hosted power costs against cloud API pricing, and covers strategies for reducing power consumption without degrading the user experience.

How GPUs Consume Power

A GPU's power consumption varies dramatically depending on what it is doing: For GPU driver setup, see our NVIDIA driver and CUDA installation guide.

Idle (no model loaded): The GPU draws its base power — typically 10-25W for consumer cards and 30-60W for data center GPUs. The fans may not even spin.
Idle with model loaded: VRAM is populated with model weights, drawing slightly more power than base idle. Add 5-15W depending on VRAM usage. This is the state when Ollama has a model loaded but nobody is asking questions.
Active inference: The GPU is generating tokens. Power draw spikes to near TDP (Thermal Design Power). An RTX 4090 can pull 350-450W during sustained inference. This is the expensive state.
Prompt processing: Slightly different load profile from generation — often higher power draw because the GPU is processing the entire prompt in parallel.

The key insight: power cost is determined by your utilization pattern. A GPU that runs inference 2 hours per day and idles the rest costs a fraction of one running at full load continuously.

Measuring Power Consumption on Linux

NVIDIA GPU Power Monitoring

NVIDIA GPUs report real-time power draw through nvidia-smi:

# Instantaneous power reading
nvidia-smi --query-gpu=power.draw --format=csv,noheader,nounits

# Continuous monitoring (every 1 second)
nvidia-smi dmon -s p -d 1

# Detailed power info including limits
nvidia-smi -q -d POWER

For accurate measurements over time, log power draw and calculate the average:

#!/bin/bash
# power_monitor.sh - Log GPU power consumption over time

DURATION=${1:-3600}  # Default 1 hour
INTERVAL=5           # Sample every 5 seconds
LOGFILE="/var/log/gpu-power-$(date +%Y%m%d_%H%M%S).csv"

echo "timestamp,power_watts,gpu_util_percent,mem_used_mb,temperature_c" > "$LOGFILE"

END_TIME=$(($(date +%s) + DURATION))

while [ $(date +%s) -lt $END_TIME ]; do
    TIMESTAMP=$(date +%Y-%m-%d_%H:%M:%S)
    DATA=$(nvidia-smi --query-gpu=power.draw,utilization.gpu,memory.used,temperature.gpu         --format=csv,noheader,nounits | tr -d ' ')
    echo "$TIMESTAMP,$DATA" >> "$LOGFILE"
    sleep $INTERVAL
done

# Calculate statistics
echo ""
echo "=== Power Statistics ==="
awk -F',' 'NR>1 {sum+=$2; count++; if($2>max) max=$2; if(min=="" || $2



chmod +x power_monitor.sh
# Monitor for 1 hour during normal usage
./power_monitor.sh 3600

Total System Power Measurement

The GPU is not the only power consumer. CPU, RAM, storage, and fans all contribute. For total system power, you need a hardware power meter (Kill-A-Watt or similar) at the wall outlet. Software-based estimates from turbostat or RAPL readings give CPU power but miss the PSU efficiency loss (typically 10-20%).

# CPU power monitoring with turbostat
sudo turbostat --quiet --show PkgWatt,CorWatt,RAMWatt --interval 5

# RAPL-based CPU power reading
cat /sys/class/powercap/intel-rapl/intel-rapl:0/energy_uj

Real-World Power Measurements

Here are actual power measurements from common LLM hardware configurations running Ollama with Llama 3.1 8B, measured at the wall with a calibrated power meter:

Consumer Hardware



Configuration Idle (model loaded) Active Inference System Total (wall)


RTX 3060 12GB (desktop) 35W GPU / 95W system 170W GPU / 280W system 95-280W
RTX 4060 Ti 16GB (desktop) 25W GPU / 85W system 160W GPU / 260W system 85-260W
RTX 4070 Ti Super (desktop) 30W GPU / 90W system 200W GPU / 310W system 90-310W
RTX 4090 (desktop) 40W GPU / 110W system 350W GPU / 520W system 110-520W
CPU-only, Ryzen 7 7800X3D N/A / 65W system N/A / 140W system 65-140W



Server/Data Center Hardware



Configuration Idle (model loaded) Active Inference System Total (wall)


Tesla P40 24GB (server) 50W GPU / 180W system 250W GPU / 380W system 180-380W
A4000 16GB (workstation) 30W GPU / 120W system 140W GPU / 250W system 120-250W
A6000 48GB (workstation) 45W GPU / 150W system 300W GPU / 420W system 150-420W
2x Tesla P40 (server) 100W GPU / 250W system 500W GPU / 650W system 250-650W



Calculating Monthly Electricity Costs

The formula is straightforward:

Monthly cost = Average power (W) x Hours per month x Electricity rate ($/kWh) / 1000

The tricky part is determining "average power." A server that runs inference 4 hours per day and idles 20 hours has a very different average than one under constant load.

#!/usr/bin/env python3
# cost_calculator.py - Calculate LLM electricity costs

scenarios = [
    {
        "name": "Light use (dev workstation, 2h/day inference)",
        "idle_watts": 95,
        "active_watts": 280,
        "active_hours_per_day": 2,
    },
    {
        "name": "Moderate use (team server, 8h/day inference)",
        "idle_watts": 120,
        "active_watts": 310,
        "active_hours_per_day": 8,
    },
    {
        "name": "Heavy use (production API, 16h/day inference)",
        "idle_watts": 150,
        "active_watts": 420,
        "active_hours_per_day": 16,
    },
    {
        "name": "24/7 inference (high-traffic service)",
        "idle_watts": 0,  # Never idle
        "active_watts": 380,
        "active_hours_per_day": 24,
    },
]

electricity_rates = {
    "Ireland": 0.35,
    "Germany": 0.34,
    "UK": 0.28,
    "USA (average)": 0.16,
    "France": 0.21,
    "Netherlands": 0.29,
}

print(f"{'Scenario':<55} {'kWh/mo':>8} ", end="")
for country in electricity_rates:
    print(f" {country:>10}", end="")
print()
print("-" * 130)

for s in scenarios:
    idle_hours = 24 - s["active_hours_per_day"]
    daily_kwh = (s["idle_watts"] * idle_hours + s["active_watts"] * s["active_hours_per_day"]) / 1000
    monthly_kwh = daily_kwh * 30.44

    print(f"{s['name']:<55} {monthly_kwh:>7.1f} ", end="")
    for rate in electricity_rates.values():
        cost = monthly_kwh * rate
        print(f" {'EUR' if rate > 0.3 else 'USD'}{cost:>7.2f}", end="")
    print()

Typical Monthly Costs (at EUR 0.35/kWh, Ireland)


Dev workstation, light use (RTX 4060 Ti, 2h/day): ~EUR 25/month total system power
Team server, moderate use (RTX 4070 Ti Super, 8h/day): ~EUR 55/month
Production server, heavy use (A6000, 16h/day): ~EUR 115/month
24/7 service (2x Tesla P40, constant load): ~EUR 195/month


Self-Hosted vs. Cloud API Cost Comparison

When does self-hosting save money versus paying for cloud API tokens?

Cloud API pricing (approximate, 2026):

OpenAI GPT-4o: $2.50 per million input tokens, $10 per million output tokens
Anthropic Claude 3.5 Sonnet: $3 per million input tokens, $15 per million output tokens
Together.ai Llama 3.1 8B: $0.18 per million tokens


Self-hosted cost per million tokens depends on your throughput. An RTX 4060 Ti running Llama 3.1 8B generates roughly 40 tokens/second. At 40 tok/s, one million tokens take about 7 hours of active inference. At EUR 0.06/hour for the GPU portion of electricity, that is EUR 0.42 per million output tokens — cheaper than any cloud API for equivalent-quality models, but only if you generate enough tokens to justify the hardware cost.

The breakeven depends on volume. If your monthly cloud API bill exceeds the monthly electricity cost plus amortized hardware cost (divide GPU purchase price by 36-48 months), self-hosting saves money. For most teams generating more than 50 million tokens per month, self-hosting wins on cost.

Power Reduction Strategies

Unload Models When Not in Use

# Set Ollama to unload models after 5 minutes of inactivity
# In /etc/systemd/system/ollama.service.d/override.conf
[Service]
Environment="OLLAMA_KEEP_ALIVE=5m"

When the model unloads, the GPU drops to base idle power. The trade-off is a 5-15 second cold start when the next request arrives.

GPU Power Limiting

Reduce the GPU's power limit to cap maximum power draw. This reduces peak performance but can lower average consumption significantly:

# Check current power limit
nvidia-smi -q -d POWER | grep "Power Limit"

# Set a lower power limit (e.g., 200W instead of 350W on an RTX 4090)
sudo nvidia-smi -pl 200

# Make it persistent across reboots
sudo tee /etc/systemd/system/nvidia-powerlimit.service > /dev/null << EOF
[Unit]
Description=Set NVIDIA GPU power limit
After=nvidia-persistenced.service

[Service]
Type=oneshot
ExecStart=/usr/bin/nvidia-smi -pl 200
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl enable nvidia-powerlimit

A 200W limit on an RTX 4090 typically reduces inference speed by only 15-25% while cutting peak power draw by 40%. The performance-per-watt ratio actually improves.

Schedule Availability Windows

If your LLM only needs to be available during business hours, shut down Ollama outside those windows:

# Start Ollama at 7 AM, stop at 7 PM
sudo tee /etc/systemd/system/ollama-schedule-start.timer > /dev/null << EOF
[Unit]
Description=Start Ollama at 7 AM

[Timer]
OnCalendar=Mon..Fri 07:00
Persistent=true

[Install]
WantedBy=timers.target
EOF

sudo tee /etc/systemd/system/ollama-schedule-stop.timer > /dev/null << EOF
[Unit]
Description=Stop Ollama at 7 PM

[Timer]
OnCalendar=Mon..Fri 19:00
Persistent=true

[Install]
WantedBy=timers.target
EOF

Running only during business hours (12h x 5 days) instead of 24/7 reduces power consumption by roughly 65%.

Use Smaller Models for Simple Tasks

Not every request needs a 70B model. Route simple tasks (classification, extraction, short Q&A) to smaller models that consume less power:


3B model: ~30% less GPU power draw than 8B
8B model: ~50% less GPU power draw than 70B


Monitoring Power Costs Over Time

Set up a simple dashboard by logging power data to a file and graphing it:

# Cron job: log power reading every minute
* * * * * nvidia-smi --query-gpu=power.draw,utilization.gpu --format=csv,noheader,nounits >> /var/log/gpu-power.csv

# Weekly cost summary script
#!/bin/bash
RATE=0.35  # EUR per kWh
LOGFILE=/var/log/gpu-power.csv

AVG_WATTS=$(awk -F',' '{sum+=$1; count++} END {print sum/count}' "$LOGFILE")
HOURS=$(awk -F',' 'END {print NR/60}' "$LOGFILE")
KWH=$(echo "$AVG_WATTS * $HOURS / 1000" | bc -l)
COST=$(echo "$KWH * $RATE" | bc -l)

printf "Period: %s hours\nAvg GPU Power: %.1f W\nEnergy: %.2f kWh\nCost: EUR %.2f\n" \
    "$HOURS" "$AVG_WATTS" "$KWH" "$COST"





The agent architecture stack — each layer contributes to total power consumption, with the inference layer being the most energy-intensive component. Source: An Illustrated Guide to AI Agents


The energy footprint of self-hosted LLMs is an increasingly important operational concern. Brousseau and Sharp in LLMs in Production dedicate significant attention to the relationship between model size, quantization level, and power draw — showing how 4-bit quantized models can reduce GPU power consumption by 40-60% compared to full-precision inference with minimal quality degradation. Understanding the full agent architecture as depicted by Grootendorst and Alammar in An Illustrated Guide to AI Agents helps identify which components drive energy costs: the inference layer dominates, but vector database operations, embedding generation, and tool execution all contribute to the total electricity bill on Linux servers.


Related Articles

Best GPU for Running LLMs Locally on Linux: 2026 Buyer's Guide
LLM Benchmarking on Linux: How to Test and Compare Model Performance
NVIDIA Tesla P40 and Ollama: Budget LLM Server Build Guide for Linux
Multi-GPU LLM Inference on Linux: Setup, Load Balancing, and Scaling


Further Reading

An Illustrated Guide to AI Agents by Maarten Grootendorst and Jay Alammar — Visual guide to agent memory, tools, and reasoning.
LLMs in Production by Christopher Brousseau and Matthew Sharp — Practical deployment of language models from training to production.
Agentic AI in Enterprise by Sumit Ranjan, Divya Chembachere, and Lanwin Lobo — Enterprise architecture patterns for agentic AI systems.


Frequently Asked Questions

Does keeping an LLM loaded in VRAM use significantly more power than an empty GPU?

A model loaded in VRAM but not running inference adds 5-15W compared to an empty GPU. On an RTX 4060 Ti, idle with no model loaded draws about 20W, while idle with an 8B model loaded draws about 28W. The difference is minimal — VRAM holding static data is cheap in power terms. The expensive part is always active inference when the compute cores are working. Do not worry about power consumption from loaded models during idle periods.

Is it cheaper to run a single powerful GPU or two smaller GPUs?

A single more powerful GPU is almost always more power-efficient than two smaller ones for the same workload. Two RTX 4060 Ti cards at full load consume about 320W combined for roughly the same throughput as one RTX 4090 at about 350W — similar power draw but the 4090 has higher throughput. Additionally, a system with two GPUs has higher base system power draw (more PCIe lanes active, potentially a larger PSU with lower efficiency at low load). Choose a single powerful GPU unless you specifically need the additional VRAM capacity.

How much does CPU-only LLM inference cost in electricity versus GPU?

CPU-only inference is slower but draws less peak power. A Ryzen 7 system running Llama 3.1 8B on CPU pulls about 140W at full load and generates approximately 8 tokens/second. The same system with an RTX 4060 Ti pulls 260W but generates 40 tokens/second. Per-token energy cost: CPU uses about 17.5 Wh per million tokens versus GPU at 1.8 Wh per million tokens. The GPU is roughly 10x more energy-efficient per token. CPU-only makes sense only when GPU hardware cost cannot be justified — the electricity savings from using CPU are negative when you account for the much longer processing time.

Configuration	Idle (model loaded)	Active Inference	System Total (wall)
RTX 3060 12GB (desktop)	35W GPU / 95W system	170W GPU / 280W system	95-280W
RTX 4060 Ti 16GB (desktop)	25W GPU / 85W system	160W GPU / 260W system	85-260W
RTX 4070 Ti Super (desktop)	30W GPU / 90W system	200W GPU / 310W system	90-310W
RTX 4090 (desktop)	40W GPU / 110W system	350W GPU / 520W system	110-520W
CPU-only, Ryzen 7 7800X3D	N/A / 65W system	N/A / 140W system	65-140W

Configuration	Idle (model loaded)	Active Inference	System Total (wall)
Tesla P40 24GB (server)	50W GPU / 180W system	250W GPU / 380W system	180-380W
A4000 16GB (workstation)	30W GPU / 120W system	140W GPU / 250W system	120-250W
A6000 48GB (workstation)	45W GPU / 150W system	300W GPU / 420W system	150-420W
2x Tesla P40 (server)	100W GPU / 250W system	500W GPU / 650W system	250-650W


                
            
                                linux
                                llm
                                gpu
                                Power
                                Electricity
                                Cost
                            
        
        
        
        
            Share this article
            
                
                    
                    X / Twitter
                
                
                    
                    LinkedIn
                
                
                    
                    Facebook
                
                
                    
                    Reddit
                
                
                    
                    Email

How GPUs Consume Power

Measuring Power Consumption on Linux

NVIDIA GPU Power Monitoring

Total System Power Measurement

Real-World Power Measurements

Consumer Hardware

Server/Data Center Hardware

Calculating Monthly Electricity Costs

Typical Monthly Costs (at EUR 0.35/kWh, Ireland)

Self-Hosted vs. Cloud API Cost Comparison

Power Reduction Strategies

Unload Models When Not in Use

GPU Power Limiting

Schedule Availability Windows

Use Smaller Models for Simple Tasks

Monitoring Power Costs Over Time

Related Articles

Further Reading

Frequently Asked Questions

Does keeping an LLM loaded in VRAM use significantly more power than an empty GPU?

Is it cheaper to run a single powerful GPU or two smaller GPUs?

How much does CPU-only LLM inference cost in electricity versus GPU?

Continue Reading

Install NVIDIA Drivers and CUDA on Linux Server for AI: The No-Nonsense Guide (2026)

Docker Model Runner on Linux: Deploy and Serve AI Models with GPU Acceleration

5 Best AI Coding Assistants for the Linux Terminal: Hands-On Comparison

Stay in the Loop