The debate around Ollama vs llama.cpp is one of the most common questions in the local LLM community, and it is also one of the most misunderstood. Ollama is not an alternative inference engine — it is a wrapper built on top of llama.cpp. Every token Ollama generates passes through llama.cpp's core inference code. Understanding that relationship is the key to choosing between them, because the real question is not which engine is faster but whether the convenience layer Ollama adds is worth the trade-offs for your particular workflow on Linux.
This article breaks down the architecture, walks through building both from source with CUDA on Linux, measures the actual performance overhead, compares features side by side, and gives concrete guidance on when each tool is the right choice.
Ollama and llama.cpp: The Wrapper vs the Engine
llama.cpp is a C/C++ inference runtime created by Georgi Gerganov. It implements transformer inference from scratch, with hand-tuned SIMD kernels for CPU and tight CUDA, ROCm, Metal, and Vulkan integrations for GPU offload. It reads models in the GGUF format — a single-file binary format that bundles weights, tokenizer data, and model metadata together.
Ollama is a Go application that embeds llama.cpp as a library. When you run ollama run llama3, Ollama downloads a GGUF file from its registry, loads it through llama.cpp's C API, and exposes the result through a friendly CLI and an OpenAI-compatible REST endpoint. It also manages model storage, handles Modelfile-based customization (system prompts, temperature defaults, stop tokens), and provides a systemd service for always-on inference.
Think of the relationship like this: llama.cpp is the database engine, and Ollama is a managed database service built on that engine. The service adds convenience — automatic configuration, a model registry, an API layer — but it also constrains what you can do with the engine underneath.
What Ollama adds on top of llama.cpp
- Model registry and pull system —
ollama pullhandles downloading quantized models from a curated library, including size variants and quantization levels. - Modelfile abstraction — A Dockerfile-like syntax for bundling a base model with a system prompt, parameter overrides, and adapter layers.
- Automatic GPU detection — Ollama detects NVIDIA or AMD GPUs and configures layer offloading without manual flags.
- Systemd integration — Installs as a service, restarts on failure, manages a single inference process.
- OpenAI-compatible API —
/api/chat,/api/generate, and/v1/chat/completionsendpoints that most LLM client libraries already support. - Concurrent request queuing — Serializes requests to a loaded model so multiple clients can share one GPU without crashing.
What Ollama removes or hides
- Direct control over thread count, batch size, context rope scaling, and dozens of other inference parameters.
- Speculative decoding with a draft model.
- Grammar-constrained generation (GBNF grammars).
- LoRA adapter hot-swapping at runtime.
- Embedding-only mode with custom pooling strategies.
- Fine-grained control over KV cache quantization.
The features Ollama hides are not bugs — they are deliberate simplifications for the 90% use case. But if you fall into the other 10%, you need llama.cpp directly.
Install and Build Both on Linux
Installing Ollama
Ollama provides a one-line installer that works across most Linux distributions:
curl -fsSL https://ollama.com/install.sh | sh
This installs the binary to /usr/local/bin/ollama, creates a systemd service, and sets up a dedicated ollama user. Verify it is running:
systemctl status ollama
ollama --version
Pull a model and test inference:
ollama pull llama3.1:8b-instruct-q4_K_M
ollama run llama3.1:8b-instruct-q4_K_M "Explain Linux cgroups in two sentences."
If you have an NVIDIA GPU with drivers already installed, Ollama detects it automatically. No extra flags needed.
Building llama.cpp from source with CUDA
Building llama.cpp from source gives you access to every feature and the latest optimizations. Here is a complete build guide for Linux with CUDA support.
Prerequisites:
# Install build tools (Debian/Ubuntu)
sudo apt update
sudo apt install -y build-essential cmake git pkg-config
# Install CUDA toolkit (if not already present)
# Verify with: nvcc --version
# If missing, install from NVIDIA's repo:
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update
sudo apt install -y cuda-toolkit
For Fedora/RHEL-family systems:
# Install build tools (Fedora/RHEL)
sudo dnf groupinstall -y "Development Tools"
sudo dnf install -y cmake git
# CUDA toolkit via NVIDIA repo
sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo
sudo dnf install -y cuda-toolkit
Clone and build:
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
# Create build directory
cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release
# Build with all available cores
cmake --build build --config Release -j$(nproc)
The key binaries land in build/bin/:
# List compiled binaries
ls build/bin/
# Key ones:
# llama-cli - Interactive CLI chat
# llama-server - HTTP API server
# llama-bench - Benchmarking tool
# llama-quantize - Model quantization
# llama-perplexity - Perplexity measurement
Verify CUDA offload is working:
./build/bin/llama-cli -m /path/to/model.gguf -ngl 99 -p "Hello" -n 20 2>&1 | grep "CUDA"
The -ngl 99 flag means "offload all layers to GPU." You should see log lines confirming CUDA device detection and layer offloading.
Building for CPU only (AVX2/AVX-512)
If you do not have a GPU, llama.cpp's CPU backend is still remarkably fast thanks to SIMD optimizations:
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j$(nproc)
The build system auto-detects your CPU's SIMD capabilities (SSE3, AVX, AVX2, AVX-512, AMX) and enables the appropriate kernels.
Getting a GGUF model
Both tools use GGUF models. You can download them from Hugging Face:
# Using huggingface-cli
pip install huggingface_hub
huggingface-cli download TheBloke/Llama-2-7B-Chat-GGUF llama-2-7b-chat.Q4_K_M.gguf --local-dir ./models/
# Or using wget directly
wget -P models/ https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf
Ollama stores its models in ~/.ollama/models/ (or /usr/share/ollama/.ollama/models/ when running as a service). You can also import a local GGUF file into Ollama:
# Create a Modelfile pointing to your GGUF
echo 'FROM ./models/llama-2-7b-chat.Q4_K_M.gguf' > Modelfile
ollama create my-local-model -f Modelfile
Performance: Does the Wrapper Add Overhead?
This is the question everyone asks, and the answer is nuanced. Ollama's Go layer handles HTTP parsing, request queuing, and model lifecycle management. The actual matrix multiplications and attention computations still happen inside llama.cpp's compiled C/C++/CUDA kernels. So the overhead is real but narrowly scoped: it appears in request handling latency, not in per-token generation speed.
Benchmark methodology
To measure this properly, you need to test the same model, same quantization, same hardware, and same prompt. Here is how to run a fair comparison:
# Test with llama.cpp directly (llama-bench)
./build/bin/llama-bench -m models/llama-3.1-8b-instruct-q4_K_M.gguf \
-ngl 99 -t 8 -n 512 -p 256
# Test with Ollama (use its API and measure)
# First, ensure the same model is loaded:
ollama pull llama3.1:8b-instruct-q4_K_M
# Benchmark with curl and measure timings:
time curl -s http://localhost:11434/api/generate -d '{
"model": "llama3.1:8b-instruct-q4_K_M",
"prompt": "Write a 500-word essay about Linux kernel development.",
"stream": false
}' | python3 -c "import sys,json; d=json.load(sys.stdin); print(f\"Tokens: {d['eval_count']}, Duration: {d['eval_duration']/1e9:.2f}s, Speed: {d['eval_count']*1e9/d['eval_duration']:.1f} t/s\")"
Benchmark results
The following table shows results from a controlled test using Llama 3.1 8B Instruct (Q4_K_M quantization) on a system with an AMD Ryzen 9 7950X and NVIDIA RTX 4090 (24 GB VRAM), running Ubuntu 24.04 with CUDA 12.6 and driver 560.x. All layers offloaded to GPU.
| Metric | llama.cpp (llama-bench) | Ollama API | Delta |
|---|---|---|---|
| Prompt processing (pp256) | 4,218 t/s | 4,195 t/s | -0.5% |
| Token generation (tg512) | 108.3 t/s | 106.9 t/s | -1.3% |
| Time to first token | 62 ms | 89 ms | +27 ms |
| Cold start (model load) | 1.8 s | 2.4 s | +0.6 s |
| VRAM usage | 5.2 GB | 5.3 GB | +100 MB |
| Concurrent 4-request throughput | N/A (single request) | 98.2 t/s per stream | — |
Key takeaways from the numbers:
- Sustained generation speed is nearly identical. The 1-2% difference is within run-to-run variance. Once tokens are flowing, you are running the same CUDA kernels either way.
- Time to first token is higher with Ollama because of HTTP parsing, JSON deserialization, and Go runtime overhead. The 27 ms difference is imperceptible for interactive chat but could matter in latency-sensitive pipelines processing thousands of short prompts.
- Cold start is slower with Ollama due to model registry lookup and metadata validation. Once the model is loaded into VRAM, subsequent requests skip this cost entirely.
- VRAM overhead is minimal. The extra ~100 MB comes from Ollama's Go process and its KV cache pre-allocation strategy.
The bottom line: for single-user interactive use, the performance difference between Ollama and raw llama.cpp is not meaningful. You will not feel it. The gap widens in high-concurrency API serving or latency-critical pipelines, but even then it is modest.
Feature Comparison Table
Beyond performance, features are where the two diverge significantly. This table covers the capabilities that matter most for Linux users running local inference:
| Feature | Ollama | llama.cpp |
|---|---|---|
| Install method | One-line script | Build from source |
| Model format | GGUF (via registry or import) | GGUF (direct file path) |
| Model registry | Built-in pull/push | Manual download from HF |
| GPU offload | Automatic detection | Manual -ngl flag |
| Multi-GPU | Automatic splitting | Manual --tensor-split |
| Speculative decoding | Not supported | Full support (--draft) |
| Grammar constraints (GBNF) | Not supported | Full support (--grammar-file) |
| LoRA adapters | Via Modelfile only (baked in) | Hot-swap at runtime (--lora) |
| Context size | Default 2048, configurable | Fully configurable (-c) |
| KV cache quantization | Not exposed | -ctk q8_0 -ctv q8_0 |
| Batch size control | Not exposed | -b and -ub flags |
| Embedding generation | Basic (/api/embeddings) |
Full control, pooling options |
| REST API | Built-in, OpenAI-compatible | llama-server (OpenAI-compatible) |
| Concurrent requests | Built-in queuing | llama-server with -np slots |
| Systemd service | Auto-installed | Manual setup |
| Model quantization | Not supported (use pre-quantized) | llama-quantize tool |
| Perplexity testing | Not supported | llama-perplexity tool |
| Vision models (multimodal) | Supported (LLaVA, etc.) | Supported (llava-cli) |
Advanced llama.cpp Features Ollama Doesn't Expose
Speculative decoding
Speculative decoding uses a smaller "draft" model to predict multiple tokens ahead, then verifies them with the full model in a single forward pass. When the draft model guesses correctly (which happens often for predictable text), you get 2-3x speedup on token generation.
# Use a small draft model to accelerate a large one
./build/bin/llama-cli \
-m models/llama-3.1-70b-q4_K_M.gguf \
--draft-model models/llama-3.1-8b-q4_K_M.gguf \
--draft-max 8 \
-ngl 99 \
-p "Explain the Linux boot process in detail." \
-n 1024
The --draft-max 8 flag tells the engine to speculatively generate up to 8 tokens before verification. This feature is especially valuable for large models (70B+) where each forward pass is expensive and the draft model's predictions frequently match.
Ollama has no way to configure this. You cannot specify a draft model or control the speculation depth.
Grammar-constrained generation (GBNF)
GBNF grammars force the model's output to conform to a formal grammar — valid JSON, SQL, YAML, or any structure you define. This is not prompt engineering; it is hard token-level constraint enforcement during sampling.
# Force output to valid JSON
./build/bin/llama-cli \
-m models/llama-3.1-8b-instruct-q4_K_M.gguf \
--grammar-file grammars/json.gbnf \
-ngl 99 \
-p "List 5 Linux distributions with name and year fields." \
-n 512
A sample GBNF grammar for JSON arrays:
# json_array.gbnf
root ::= "[" ws items ws "]"
items ::= item ("," ws item)*
item ::= "{" ws pair ("," ws pair)* ws "}"
pair ::= string ws ":" ws value
string ::= "\"" [a-zA-Z0-9_ ]+ "\""
value ::= string | number
number ::= [0-9]+
ws ::= [ \t\n]*
This guarantees structurally valid output regardless of the model's tendency to hallucinate malformed JSON. For building reliable LLM pipelines on Linux — data extraction, config generation, API response formatting — grammar constraints are essential. Ollama offers JSON mode as a simplified alternative, but it does not support arbitrary GBNF grammars.
LoRA adapter hot-swapping
llama.cpp can load LoRA adapters at startup and — with llama-server — swap them between requests without reloading the base model:
# Load base model with a LoRA adapter
./build/bin/llama-cli \
-m models/llama-3.1-8b-instruct-q4_K_M.gguf \
--lora adapters/linux-sysadmin-lora.gguf \
--lora-scale 0.8 \
-ngl 99 \
-p "How do I configure SELinux in enforcing mode?" \
-n 256
With llama-server, you can even apply multiple LoRA adapters simultaneously with different scaling factors, blending domain-specific fine-tunes on the fly. This is critical for serving multiple specialized variants from one base model without multiplying VRAM costs.
Ollama supports LoRA only through Modelfiles — the adapter is baked into the model definition at creation time. You cannot swap adapters at runtime or adjust scaling factors without recreating the model.
KV cache quantization
For long-context workloads, the KV cache can consume more VRAM than the model weights themselves. llama.cpp lets you quantize the cache:
# Quantize KV cache to save VRAM on long contexts
./build/bin/llama-cli \
-m models/llama-3.1-8b-instruct-q4_K_M.gguf \
-c 32768 \
-ctk q8_0 \
-ctv q8_0 \
-ngl 99 \
-p "Summarize this long document..." \
-n 1024
Quantizing the KV cache from f16 to q8_0 roughly halves its memory footprint with minimal quality loss. Going to q4_0 saves even more but may degrade output on tasks requiring precise attention over long ranges. Ollama does not expose these parameters.
Server Mode: llama-server vs Ollama API
Both tools can serve an HTTP API, but they differ in architecture and capability.
Ollama's server
Ollama runs as a systemd service listening on port 11434 by default. Its API is simple and well-documented:
# Generate completion
curl http://localhost:11434/api/generate -d '{
"model": "llama3.1:8b-instruct-q4_K_M",
"prompt": "What is systemd?",
"stream": true
}'
# Chat completion (OpenAI-compatible)
curl http://localhost:11434/v1/chat/completions -d '{
"model": "llama3.1:8b-instruct-q4_K_M",
"messages": [{"role": "user", "content": "What is systemd?"}]
}'
# List loaded models
curl http://localhost:11434/api/tags
Ollama handles model loading and unloading automatically. If you request a model that is not in memory, it loads it. Models are unloaded after a configurable idle timeout (default: 5 minutes). This is convenient but means the first request after idle incurs a cold-start penalty.
llama-server
llama-server is llama.cpp's built-in HTTP server. It gives you full control over every inference parameter:
# Start llama-server with full configuration
./build/bin/llama-server \
-m models/llama-3.1-8b-instruct-q4_K_M.gguf \
--host 0.0.0.0 \
--port 8080 \
-ngl 99 \
-c 8192 \
-b 2048 \
-ub 512 \
-np 4 \
-t 8 \
--metrics \
--api-key "your-secret-key"
Key flags explained:
-np 4— Number of parallel request slots. Each slot maintains its own KV cache, so 4 slots means 4 concurrent conversations.-b 2048— Logical batch size for prompt processing.-ub 512— Physical batch size (actual tokens processed per GPU kernel launch).--metrics— Expose a Prometheus-compatible/metricsendpoint for monitoring.--api-key— Require authentication for all requests.
llama-server also exposes an OpenAI-compatible API at /v1/chat/completions and /v1/completions, so most client libraries work without modification. But it also supports features Ollama does not:
# Request with grammar constraint via API
curl http://localhost:8080/v1/chat/completions -d '{
"messages": [{"role": "user", "content": "List 3 Linux distros as JSON."}],
"grammar": "root ::= \"{\" ... \"}\""
}'
# Request with specific sampling parameters
curl http://localhost:8080/v1/chat/completions -d '{
"messages": [{"role": "user", "content": "Explain cgroups."}],
"temperature": 0.1,
"top_k": 20,
"top_p": 0.9,
"min_p": 0.05,
"repeat_penalty": 1.1
}'
Creating a systemd service for llama-server
Since llama-server does not install a service automatically, here is how to create one:
sudo tee /etc/systemd/system/llama-server.service <<EOF
[Unit]
Description=llama.cpp Inference Server
After=network.target
[Service]
Type=simple
User=llama
Group=llama
ExecStart=/opt/llama.cpp/build/bin/llama-server \
-m /opt/models/llama-3.1-8b-instruct-q4_K_M.gguf \
--host 127.0.0.1 \
--port 8080 \
-ngl 99 \
-c 8192 \
-np 4 \
--api-key-file /etc/llama-server/api.key
Restart=on-failure
RestartSec=5
LimitNOFILE=65535
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable --now llama-server
Prometheus monitoring with llama-server
The --metrics flag exposes a /metrics endpoint that Prometheus can scrape directly. Metrics include tokens per second, request queue depth, KV cache utilization, and slot occupancy — production-grade observability that Ollama does not offer natively.
# Test metrics endpoint
curl -s http://localhost:8080/metrics | head -20
When to Use Ollama vs When to Use llama.cpp Directly
After all the benchmarks and feature comparisons, the practical decision usually comes down to three factors: your technical comfort level, your use case complexity, and whether you need features Ollama does not expose.
Use Ollama when:
- You want local inference running in under five minutes. The install-pull-run workflow is unmatched. No compilation, no flag tuning, no GGUF hunting.
- You are building a personal assistant or chatbot. The API is clean, well-documented, and compatible with Open WebUI, LibreChat, and most LLM front-ends.
- You are prototyping or evaluating models. Pulling and swapping models with
ollama pullandollama runis far faster than downloading GGUFs and writing launch scripts. - You want a single always-on inference service. The systemd service with auto-load/unload is genuine convenience for a workstation or home server.
- You are not a C++ developer and do not want to manage build toolchains. This is perfectly valid. The wrapper exists precisely for this audience.
Use llama.cpp directly when:
- You need speculative decoding for large models. If you are running 70B+ parameters and want 2x generation speed, speculative decoding is the single biggest performance lever, and Ollama cannot do it.
- You need grammar-constrained output. Any production pipeline that requires guaranteed-valid JSON, SQL, or structured data needs GBNF support.
- You are serving multiple fine-tunes from one base model. LoRA hot-swapping saves enormous VRAM compared to running separate model instances.
- You need Prometheus-native monitoring. For production deployments with SLO requirements, llama-server's
/metricsendpoint integrates directly into your existing observability stack. - You need to tune batch sizes, KV cache quantization, or context rope scaling. Memory-constrained environments (e.g., running 32K context on 8 GB VRAM) require fine-grained parameter control.
- You are benchmarking or researching inference performance. llama-bench gives precise, reproducible measurements that Ollama's API layer does not expose.
- You want to quantize your own models. llama-quantize lets you create custom quantizations from full-precision GGUF or safetensors source files.
The hybrid approach
Many Linux users run both. Ollama handles the daily driver — a chat model accessible from a browser UI or terminal. llama.cpp handles specialized tasks: batch processing with grammar constraints, LoRA experimentation, or benchmarking new models before importing them into Ollama. The two tools do not conflict. You can run Ollama on port 11434 and llama-server on port 8080 simultaneously, even sharing the same GPU if VRAM allows.
# Run both simultaneously — Ollama on default port, llama-server on 8080
systemctl start ollama
./build/bin/llama-server -m models/specialized-model.gguf --port 8080 -ngl 30
Just be mindful of VRAM allocation. If Ollama has a model loaded and you start llama-server with full GPU offload, you will get an out-of-memory error. Use nvidia-smi to check available VRAM before launching the second process, or use partial offload (-ngl 20 instead of -ngl 99) to share GPU memory between them.
FAQ
Is Ollama just llama.cpp with a GUI?
Not exactly. Ollama is not a GUI — it is a CLI tool and HTTP API server written in Go that embeds llama.cpp as its inference backend. There is no graphical interface. What Ollama adds is a model registry (pull/push), Modelfile-based configuration, automatic GPU detection, and a systemd service. Think of it as a managed runtime layer on top of llama.cpp's raw inference engine. If you want an actual GUI, pair either tool with Open WebUI or SillyTavern.
Can I use my own GGUF models with Ollama?
Yes. Create a file called Modelfile with the line FROM /absolute/path/to/your-model.gguf, then run ollama create my-model -f Modelfile. Ollama copies the weights into its registry and makes the model available through ollama run my-model. You can also add system prompts, parameter defaults, and LoRA adapters in the Modelfile. Note that Ollama only supports GGUF format — you cannot load safetensors or PyTorch checkpoints directly.
Does llama.cpp support AMD GPUs?
Yes. llama.cpp supports AMD GPUs through the ROCm backend. Build with -DGGML_HIP=ON instead of -DGGML_CUDA=ON. You need ROCm 5.7+ installed. Performance varies by GPU generation — RDNA 3 cards (RX 7900 XTX) perform well, while older RDNA 2 cards have lower throughput. Ollama also supports AMD GPUs with ROCm, but its auto-detection is less reliable than for NVIDIA cards, and you may need to set HSA_OVERRIDE_GFX_VERSION for some models.
How much VRAM do I need to run a 7B/13B/70B model?
Approximate VRAM requirements for full GPU offload with Q4_K_M quantization: a 7B model needs roughly 4.5-5 GB, a 13B model needs about 8-9 GB, and a 70B model needs 38-42 GB. These figures apply equally to Ollama and llama.cpp since they use the same engine. You can use partial offload (fewer layers on GPU, rest on CPU) to run models that exceed your VRAM, at the cost of slower generation. Context length also matters — a 32K context adds 2-4 GB of KV cache depending on the model architecture.
Which is better for a production API: Ollama or llama-server?
For a serious production deployment with SLO requirements, llama-server is the better choice. It offers Prometheus metrics, configurable parallel request slots, API key authentication, precise control over batch sizes and memory allocation, and no automatic model unloading that could cause unexpected cold starts. Ollama is designed for developer experience, not production reliability. That said, if your "production" is an internal tool with a handful of users and you value simplicity, Ollama's built-in service management and auto-configuration can be the pragmatic choice. Know your actual requirements before over-engineering the solution.