Build a Self-Hosted RAG Pipeline on Linux: Chat with Your Documentation

Large language models are impressive at generating text, but they have a fundamental limitation: they only know what was in their training data. Ask an LLM about your internal documentation, your company's runbook procedures, or your infrastructure's specific configuration, and it will either admit ignorance or — worse — hallucinate a plausible-sounding answer that is completely wrong. Retrieval-Augmented Generation (RAG) solves this by giving the LLM access to your actual documents at query time. Instead of relying on memorized training data, the model retrieves relevant passages from your documentation and uses them as context for generating its response.

A self-hosted RAG pipeline keeps everything on your network. Your documentation never leaves your infrastructure. The embeddings (vector representations of your documents) are stored in a local database. The LLM runs on your own GPU. The result is an AI assistant that can answer questions about your specific environment — your Ansible playbooks, your architecture decisions, your incident postmortems, your internal wikis — with citations pointing to the source documents.

This guide builds the complete pipeline from scratch: document ingestion from multiple sources (Markdown, PDF, HTML, plain text), text chunking strategies that preserve context, embedding generation using a local model, vector storage with ChromaDB, retrieval-augmented query processing with Ollama, and a simple web interface for interactive use. Every component runs locally on Linux. For the foundational setup, see our complete Ollama installation guide. For GPU driver setup, see our NVIDIA driver and CUDA installation guide.

Architecture and Data Flow

# RAG Pipeline Architecture
#
# Document Ingestion:
#   Files (MD, PDF, TXT, HTML) --> Chunker --> Embedding Model --> ChromaDB
#
# Query Processing:
#   User Question --> Embedding Model --> ChromaDB (similarity search)
#                                              |
#                                     Top-K relevant chunks
#                                              |
#                                     Prompt Assembly:
#                                     "Context: [chunks]
#                                      Question: [user question]"
#                                              |
#                                        Ollama (LLM)
#                                              |
#                                     Answer with citations

Installing the Components

Ollama (LLM and Embedding Model)

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
sudo systemctl enable --now ollama

# Pull a capable chat model
ollama pull llama3.1:8b

# Pull a dedicated embedding model
# nomic-embed-text is small (274M params) and produces quality embeddings
ollama pull nomic-embed-text

# Verify both models are available
ollama list

ChromaDB (Vector Database)

# Create a virtual environment for the RAG pipeline
python3 -m venv /opt/rag-pipeline
source /opt/rag-pipeline/bin/activate

# Install ChromaDB and other dependencies
pip install chromadb sentence-transformers requests

# Install document processing libraries
pip install pymupdf python-docx beautifulsoup4 markdown
pip install tiktoken  # For accurate token counting

# Verify ChromaDB works
python3 -c "import chromadb; client = chromadb.PersistentClient(path='/tmp/test_chroma'); print('ChromaDB OK')"

Document Ingestion Pipeline

The ingestion pipeline reads documents from various formats, splits them into chunks, generates embeddings, and stores everything in ChromaDB. The chunking strategy is critical — chunks that are too large dilute the relevance signal, while chunks that are too small lose context.

Document Loader

#!/usr/bin/env python3
"""document_loader.py — Load documents from various formats."""

from pathlib import Path
from typing import List, Dict
import json

def load_markdown(file_path: Path) -> str:
    """Load a Markdown file and return raw text."""
    return file_path.read_text(encoding="utf-8")

def load_pdf(file_path: Path) -> str:
    """Extract text from a PDF file using PyMuPDF."""
    import fitz  # pymupdf
    doc = fitz.open(str(file_path))
    text = ""
    for page in doc:
        text += page.get_text() + "\n"
    doc.close()
    return text

def load_html(file_path: Path) -> str:
    """Extract text from an HTML file."""
    from bs4 import BeautifulSoup
    html = file_path.read_text(encoding="utf-8")
    soup = BeautifulSoup(html, "html.parser")
    # Remove script and style elements
    for element in soup(["script", "style", "nav", "footer"]):
        element.decompose()
    return soup.get_text(separator="\n", strip=True)

def load_text(file_path: Path) -> str:
    """Load a plain text file."""
    return file_path.read_text(encoding="utf-8")

LOADERS = {
    ".md": load_markdown,
    ".markdown": load_markdown,
    ".pdf": load_pdf,
    ".html": load_html,
    ".htm": load_html,
    ".txt": load_text,
    ".text": load_text,
    ".rst": load_text,
    ".yaml": load_text,
    ".yml": load_text,
    ".conf": load_text,
    ".cfg": load_text,
    ".json": load_text,
}

def load_document(file_path: Path) -> Dict:
    """Load a document and return its content with metadata."""
    suffix = file_path.suffix.lower()
    loader = LOADERS.get(suffix)
    if loader is None:
        raise ValueError(f"Unsupported file type: {suffix}")

    content = loader(file_path)
    return {
        "content": content,
        "metadata": {
            "source": str(file_path),
            "filename": file_path.name,
            "filetype": suffix,
            "size_bytes": file_path.stat().st_size,
        }
    }

def load_directory(dir_path: Path, recursive: bool = True) -> List[Dict]:
    """Load all supported documents from a directory."""
    documents = []
    pattern = "**/*" if recursive else "*"
    for file_path in sorted(dir_path.glob(pattern)):
        if file_path.is_file() and file_path.suffix.lower() in LOADERS:
            try:
                doc = load_document(file_path)
                documents.append(doc)
                print(f"  Loaded: {file_path.name} ({len(doc['content'])} chars)")
            except Exception as e:
                print(f"  Error loading {file_path.name}: {e}")
    return documents

Text Chunking

#!/usr/bin/env python3
"""chunker.py — Split documents into overlapping chunks."""

from typing import List, Dict
import re

def chunk_text(
    text: str,
    chunk_size: int = 512,
    chunk_overlap: int = 50,
    metadata: Dict = None
) -> List[Dict]:
    """Split text into overlapping chunks, respecting sentence boundaries."""

    # Clean up whitespace
    text = re.sub(r'\n{3,}', '\n\n', text)
    text = re.sub(r' {2,}', ' ', text)

    # Split into sentences (approximate)
    sentences = re.split(r'(?<=[.!?])\s+', text)

    chunks = []
    current_chunk = []
    current_length = 0

    for sentence in sentences:
        sentence_length = len(sentence.split())

        if current_length + sentence_length > chunk_size and current_chunk:
            # Save current chunk
            chunk_text_joined = " ".join(current_chunk)
            chunk_metadata = dict(metadata) if metadata else {}
            chunk_metadata["chunk_index"] = len(chunks)
            chunk_metadata["chunk_size"] = current_length

            chunks.append({
                "text": chunk_text_joined,
                "metadata": chunk_metadata,
            })

            # Keep overlap sentences for the next chunk
            overlap_words = 0
            overlap_start = len(current_chunk)
            for i in range(len(current_chunk) - 1, -1, -1):
                overlap_words += len(current_chunk[i].split())
                if overlap_words >= chunk_overlap:
                    overlap_start = i
                    break

            current_chunk = current_chunk[overlap_start:]
            current_length = sum(len(s.split()) for s in current_chunk)

        current_chunk.append(sentence)
        current_length += sentence_length

    # Save the last chunk
    if current_chunk:
        chunk_text_joined = " ".join(current_chunk)
        chunk_metadata = dict(metadata) if metadata else {}
        chunk_metadata["chunk_index"] = len(chunks)
        chunk_metadata["chunk_size"] = current_length
        chunks.append({
            "text": chunk_text_joined,
            "metadata": chunk_metadata,
        })

    return chunks

Embedding Generation and Storage

Embeddings are vector representations of text that capture semantic meaning. Similar texts produce similar vectors, which is how the retrieval step finds relevant passages. Ollama can generate embeddings using dedicated embedding models.

Embedding with Ollama

#!/usr/bin/env python3
"""embedder.py — Generate embeddings using Ollama."""

import requests
from typing import List

OLLAMA_API = "http://localhost:11434/api/embeddings"
EMBED_MODEL = "nomic-embed-text"

def get_embedding(text: str, model: str = EMBED_MODEL) -> List[float]:
    """Generate an embedding vector for a text using Ollama."""
    response = requests.post(OLLAMA_API, json={
        "model": model,
        "prompt": text,
    }, timeout=30)

    if response.status_code == 200:
        return response.json()["embedding"]
    raise RuntimeError(f"Embedding failed: {response.status_code}")

def get_embeddings_batch(texts: List[str], model: str = EMBED_MODEL) -> List[List[float]]:
    """Generate embeddings for multiple texts."""
    embeddings = []
    for i, text in enumerate(texts):
        if i % 50 == 0 and i > 0:
            print(f"  Embedded {i}/{len(texts)} chunks...")
        embeddings.append(get_embedding(text, model))
    return embeddings

ChromaDB Vector Store

#!/usr/bin/env python3
"""vector_store.py — ChromaDB vector storage and retrieval."""

import chromadb
from typing import List, Dict, Optional

class VectorStore:
    def __init__(self, persist_dir: str = "/var/lib/rag/chromadb"):
        self.client = chromadb.PersistentClient(path=persist_dir)

    def get_or_create_collection(self, name: str = "documents"):
        """Get or create a ChromaDB collection."""
        return self.client.get_or_create_collection(
            name=name,
            metadata={"hnsw:space": "cosine"}  # Cosine similarity
        )

    def add_documents(self, chunks: List[Dict], embeddings: List[List[float]],
                      collection_name: str = "documents"):
        """Add document chunks with embeddings to the vector store."""
        collection = self.get_or_create_collection(collection_name)

        ids = [f"chunk_{collection.count() + i}" for i in range(len(chunks))]
        documents = [chunk["text"] for chunk in chunks]
        metadatas = [chunk["metadata"] for chunk in chunks]

        collection.add(
            ids=ids,
            documents=documents,
            embeddings=embeddings,
            metadatas=metadatas,
        )
        print(f"  Added {len(chunks)} chunks to collection '{collection_name}'")
        print(f"  Total documents in collection: {collection.count()}")

    def query(self, query_embedding: List[float], n_results: int = 5,
              collection_name: str = "documents",
              where: Optional[Dict] = None) -> Dict:
        """Query the vector store for similar documents."""
        collection = self.get_or_create_collection(collection_name)
        results = collection.query(
            query_embeddings=[query_embedding],
            n_results=n_results,
            where=where,
            include=["documents", "metadatas", "distances"],
        )
        return results

    def get_stats(self, collection_name: str = "documents") -> Dict:
        """Get collection statistics."""
        collection = self.get_or_create_collection(collection_name)
        return {
            "collection": collection_name,
            "document_count": collection.count(),
        }

The RAG Query Engine

The query engine ties everything together: it takes a user question, finds relevant document chunks, builds a prompt with context, and sends it to the LLM for answer generation.

#!/usr/bin/env python3
"""rag_engine.py — Retrieval-Augmented Generation query engine."""

import requests
from typing import List, Dict, Optional
from embedder import get_embedding
from vector_store import VectorStore

OLLAMA_API = "http://localhost:11434/api/generate"
CHAT_MODEL = "llama3.1:8b"

class RAGEngine:
    def __init__(self, persist_dir: str = "/var/lib/rag/chromadb",
                 chat_model: str = CHAT_MODEL):
        self.store = VectorStore(persist_dir)
        self.chat_model = chat_model

    def query(self, question: str, n_context: int = 5,
              collection: str = "documents") -> Dict:
        """Answer a question using RAG."""

        # Step 1: Embed the question
        question_embedding = get_embedding(question)

        # Step 2: Retrieve relevant chunks
        results = self.store.query(
            query_embedding=question_embedding,
            n_results=n_context,
            collection_name=collection,
        )

        documents = results["documents"][0] if results["documents"] else []
        metadatas = results["metadatas"][0] if results["metadatas"] else []
        distances = results["distances"][0] if results["distances"] else []

        # Step 3: Build the context-aware prompt
        context_parts = []
        sources = []
        for i, (doc, meta, dist) in enumerate(zip(documents, metadatas, distances)):
            source = meta.get("source", "unknown")
            context_parts.append(f"[Source {i+1}: {meta.get('filename', 'unknown')}]\n{doc}")
            sources.append({
                "file": source,
                "filename": meta.get("filename", "unknown"),
                "chunk_index": meta.get("chunk_index", 0),
                "relevance": 1 - dist,  # Convert distance to similarity
            })

        context = "\n\n".join(context_parts)

        prompt = (
            "You are a helpful assistant that answers questions based on the "
            "provided documentation context. Use ONLY the information from the "
            "context below to answer. If the context does not contain enough "
            "information to answer the question, say so clearly. "
            "Always cite the source number [Source N] when using information "
            "from a specific document.\n\n"
            f"Context:\n{context}\n\n"
            f"Question: {question}\n\n"
            "Answer:"
        )

        # Step 4: Generate the answer
        response = requests.post(OLLAMA_API, json={
            "model": self.chat_model,
            "prompt": prompt,
            "stream": False,
            "options": {
                "temperature": 0.3,
                "num_predict": 2048,
            }
        }, timeout=120)

        answer = response.json().get("response", "Failed to generate response.")

        return {
            "question": question,
            "answer": answer,
            "sources": sources,
            "context_chunks": len(documents),
        }

Complete Pipeline: Ingest and Query

#!/usr/bin/env python3
"""rag_cli.py — Command-line interface for the RAG pipeline."""

import argparse
import json
import sys
from pathlib import Path

from document_loader import load_directory, load_document
from chunker import chunk_text
from embedder import get_embeddings_batch
from vector_store import VectorStore
from rag_engine import RAGEngine

def cmd_ingest(args):
    """Ingest documents into the vector store."""
    source = Path(args.source)
    store = VectorStore(args.db_path)

    if source.is_dir():
        print(f"Loading documents from {source}...")
        documents = load_directory(source, recursive=args.recursive)
    elif source.is_file():
        print(f"Loading {source}...")
        documents = [load_document(source)]
    else:
        print(f"Error: {source} not found", file=sys.stderr)
        sys.exit(1)

    print(f"Loaded {len(documents)} documents")

    # Chunk all documents
    all_chunks = []
    for doc in documents:
        chunks = chunk_text(
            doc["content"],
            chunk_size=args.chunk_size,
            chunk_overlap=args.overlap,
            metadata=doc["metadata"],
        )
        all_chunks.extend(chunks)

    print(f"Created {len(all_chunks)} chunks")

    # Generate embeddings
    print("Generating embeddings...")
    texts = [c["text"] for c in all_chunks]
    embeddings = get_embeddings_batch(texts)

    # Store in ChromaDB
    store.add_documents(all_chunks, embeddings, args.collection)
    stats = store.get_stats(args.collection)
    print(f"Done. Collection now has {stats['document_count']} chunks.")

def cmd_query(args):
    """Query the RAG pipeline."""
    engine = RAGEngine(args.db_path, args.model)
    question = " ".join(args.question)

    result = engine.query(
        question=question,
        n_context=args.context_chunks,
        collection=args.collection,
    )

    print(f"\nQuestion: {result['question']}\n")
    print(f"Answer:\n{result['answer']}\n")
    print(f"Sources ({len(result['sources'])}):")
    for s in result["sources"]:
        print(f"  - {s['filename']} (chunk {s['chunk_index']}, "
              f"relevance: {s['relevance']:.2%})")

def cmd_stats(args):
    """Show vector store statistics."""
    store = VectorStore(args.db_path)
    stats = store.get_stats(args.collection)
    print(json.dumps(stats, indent=2))

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Self-hosted RAG Pipeline")
    parser.add_argument("--db-path", default="/var/lib/rag/chromadb")
    parser.add_argument("--collection", default="documents")
    sub = parser.add_subparsers(dest="command", required=True)

    ing = sub.add_parser("ingest", help="Ingest documents")
    ing.add_argument("source", help="File or directory to ingest")
    ing.add_argument("--chunk-size", type=int, default=512)
    ing.add_argument("--overlap", type=int, default=50)
    ing.add_argument("--recursive", action="store_true", default=True)

    qry = sub.add_parser("query", help="Query the knowledge base")
    qry.add_argument("question", nargs="+")
    qry.add_argument("-n", "--context-chunks", type=int, default=5)
    qry.add_argument("-m", "--model", default="llama3.1:8b")

    st = sub.add_parser("stats", help="Show collection stats")

    args = parser.parse_args()
    {"ingest": cmd_ingest, "query": cmd_query, "stats": cmd_stats}[args.command](args)

Using the Pipeline

# Create the data directory
sudo mkdir -p /var/lib/rag/chromadb
sudo chown $USER:$USER /var/lib/rag -R

# Ingest your documentation
python3 rag_cli.py ingest /path/to/your/documentation/ --recursive

# Ingest a specific file
python3 rag_cli.py ingest /path/to/runbook.md

# Query your knowledge base
python3 rag_cli.py query "How do I restart the production database cluster?"

python3 rag_cli.py query "What is our backup retention policy?"

python3 rag_cli.py query "What monitoring alerts exist for the payment service?"

# Check collection statistics
python3 rag_cli.py stats

Improving Retrieval Quality

The quality of RAG output depends primarily on retrieval quality — finding the right document chunks for each question. Several techniques improve retrieval beyond basic similarity search.

Hybrid Search: Combine Vector and Keyword Search

# ChromaDB supports filtering by metadata, but for true hybrid search,
# combine vector similarity with BM25 keyword matching.

pip install rank-bm25

from rank_bm25 import BM25Okapi

def hybrid_search(question, store, n_results=5):
    """Combine vector similarity with BM25 keyword search."""
    # Vector search
    question_embedding = get_embedding(question)
    vector_results = store.query(question_embedding, n_results=n_results * 2)

    # BM25 search on the same collection
    collection = store.get_or_create_collection()
    all_docs = collection.get(include=["documents"])

    tokenized_docs = [doc.lower().split() for doc in all_docs["documents"]]
    bm25 = BM25Okapi(tokenized_docs)
    bm25_scores = bm25.get_scores(question.lower().split())

    # Combine scores (normalize and weight)
    # 0.7 weight for vector similarity, 0.3 for BM25
    combined = {}
    for i, (doc_id, dist) in enumerate(
        zip(vector_results["ids"][0], vector_results["distances"][0])
    ):
        combined[doc_id] = 0.7 * (1 - dist)

    for i, score in enumerate(bm25_scores):
        doc_id = all_docs["ids"][i]
        if doc_id in combined:
            combined[doc_id] += 0.3 * (score / max(bm25_scores))
        else:
            combined[doc_id] = 0.3 * (score / max(bm25_scores))

    # Return top N results
    sorted_results = sorted(combined.items(), key=lambda x: x[1], reverse=True)
    return sorted_results[:n_results]

Query Rewriting for Better Retrieval

# Use the LLM to rewrite ambiguous queries before searching

def rewrite_query(original_query, model="llama3.1:8b"):
    """Use the LLM to expand or clarify a search query."""
    prompt = (
        "Rewrite this search query to be more specific and include "
        "relevant technical terms. Output only the rewritten query, "
        "nothing else.\n\n"
        f"Original query: {original_query}\n"
        "Rewritten query:"
    )

    response = requests.post(OLLAMA_API, json={
        "model": model,
        "prompt": prompt,
        "stream": False,
        "options": {"temperature": 0.1, "num_predict": 100}
    })

    rewritten = response.json().get("response", original_query).strip()
    return rewritten

# Example:
# "how to fix the database" might be rewritten to:
# "database troubleshooting procedures connection errors performance issues PostgreSQL MySQL"

Running as a Service

# systemd service for the RAG API
sudo tee /etc/systemd/system/rag-pipeline.service <<'EOF'
[Unit]
Description=RAG Pipeline API Service
After=network.target ollama.service

[Service]
Type=simple
User=rag
Group=rag
WorkingDirectory=/opt/rag-pipeline
ExecStart=/opt/rag-pipeline/bin/python3 rag_api.py
Restart=on-failure
RestartSec=5

Environment=RAG_DB_PATH=/var/lib/rag/chromadb
Environment=OLLAMA_API=http://localhost:11434

NoNewPrivileges=true
ProtectSystem=strict
ProtectHome=true
ReadWritePaths=/var/lib/rag
PrivateTmp=true

[Install]
WantedBy=multi-user.target
EOF

sudo useradd -r -s /usr/sbin/nologin rag
sudo chown -R rag:rag /var/lib/rag
sudo systemctl daemon-reload
sudo systemctl enable --now rag-pipeline

RAG pipeline architecture — The complete RAG pipeline — from document ingestion and chunking through embedding, retrieval, and augmented generation. Source: *An Illustrated Guide to AI Agents*

MemoryBank flow for persistent RAG knowledge — MemoryBank architecture showing how persistent knowledge is stored and retrieved in RAG systems. Source: *An Illustrated Guide to AI Agents*

Building a self-hosted RAG pipeline on Linux with Ollama implements the retrieval-augmented generation architecture that Grootendorst and Alammar detail extensively in An Illustrated Guide to AI Agents. Their RAG pipeline diagram shows the complete flow: documents are chunked and embedded into a vector store, user queries are similarly embedded, relevant chunks are retrieved via similarity search, and the retrieved context is prepended to the prompt before generation. The MemoryBank architecture they describe extends this with persistent knowledge management across sessions. Running this entire pipeline locally with Ollama for inference and ChromaDB or Qdrant for vector storage gives Linux administrators full control over their organization's knowledge retrieval system.

Frequently Asked Questions

How much documentation can the pipeline handle?

ChromaDB on a single server handles up to about 1 million vector chunks comfortably. With a chunk size of 512 tokens, that represents roughly 500,000 pages of documentation — far more than most organizations have. Ingestion speed depends on the embedding model: nomic-embed-text through Ollama processes about 50-100 chunks per second on a GPU. A 10,000-page documentation set takes about 15-30 minutes to ingest.

Which embedding model should I use?

For English documentation, nomic-embed-text through Ollama offers an excellent balance of quality and speed. It produces 768-dimensional vectors and runs quickly even on modest hardware. For multilingual documentation, mxbai-embed-large or snowflake-arctic-embed handle multiple languages well. The embedding model has a larger impact on retrieval quality than the chat model has on answer quality — invest in the best embedding model your hardware can support.

How do I update documents after they change?

The simplest approach is to re-ingest changed documents. Delete the old chunks for the specific file using ChromaDB's metadata filtering (collection.delete(where={"source": "/path/to/file.md"})), then re-ingest the updated file. For large documentation sets that change frequently, implement incremental ingestion that tracks file modification times and only re-processes changed files.

Why does the RAG pipeline sometimes return irrelevant context?

Three common causes: chunk size is too large (dilutes relevance), the embedding model does not capture domain-specific terminology well, or the query is too vague. Start by reducing chunk size (try 256 tokens instead of 512). If that does not help, try query rewriting to expand abbreviated or ambiguous questions. Finally, consider whether your documents need preprocessing — tables, code blocks, and bullet lists sometimes embed poorly as raw text.

Can I use this with Open WebUI instead of the command line?

Yes. Open WebUI has built-in RAG support. You can upload documents through its web interface, and it handles chunking, embedding, and retrieval internally. It uses the Ollama embedding endpoint and stores vectors in its own database. The pipeline described in this article gives you more control over chunking strategies, embedding models, and retrieval algorithms, but Open WebUI provides a simpler path if you want RAG without writing code.

linux ollama ai RAG ChromaDB Embeddings