Module 6 · RAG · Deep Dive

Building a RAG Pipeline

This is the lesson where DocChat becomes real. We assemble the whole loop — load a PDF, chunk it, embed it, store the vectors, then retrieve and ground an LLM answer in them.

BasicIntermediateBuild

Why this matters A RAG pipeline is the AI feature your capstone is judged on — and the thing an interviewer will ask you to whiteboard end to end. The good news: there is no magic. RAG is three plain stages glued together — ingest, retrieve, generate. Build it once with your own hands and you'll be able to explain every arrow on the diagram, which is exactly what "can you build a document Q&A feature?" really tests.
In this lesson
  1. The shape of a RAG pipeline
  2. Ingestion: PDF → chunks → vectors
  3. Retrieval: question → top-k chunks
  4. Generation: grounded prompt → answer
  5. Returning cited sources
  6. Wiring it into FastAPI
  7. Pitfalls & guardrails
  8. Build: DocChat's /ask end to end
  9. Check yourself

1 · The shape of it

RAG — Retrieval-Augmented Generation — answers a question by first fetching relevant text from your documents, then asking an LLM to answer using only that text. It splits cleanly into two phases that run at different times:

# INGESTION — runs once, when a document is uploaded
PDF → extract text → chunk → embed → store vectors in pgvector

# QUERY — runs every time a user asks a question
question → embed → top-k search → build prompt → call LLM → answer + sources

Everything below is just filling in those two lines with correct Python. Keep the diagram in your head — interviewers love when you narrate the flow before touching code.

PHP bridge: ingestion is a one-off batch job (think a cron import script); the query path is an ordinary request/response handler — the same split you already know from web work.

2 · Ingestion

The job of ingestion is to turn a messy PDF into rows in your chunks table: each row a short slice of text plus its embedding vector and some metadata. Four small steps.

step 1 Load the PDF & extract text

Use pypdf — the standard, current PDF library. Read every page and concatenate the text:

from pypdf import PdfReader

def extract_text(path: str) -> str:
    reader = PdfReader(path)
    pages = [page.extract_text() or "" for page in reader.pages]
    return "\n".join(pages)
Real PDFs are messy extract_text() returns None on image-only pages — hence the or "". Scanned documents have no text layer at all and need OCR (out of scope for DocChat v1). Always assume the text is imperfect.

step 2 Chunk it (with overlap)

You can't embed a whole book as one vector — embedding models have a token limit, and a giant chunk dilutes meaning. Split the text into overlapping windows of a few hundred words. Overlap matters: it stops a sentence that straddles a boundary from losing its context.

def chunk_text(text: str, size: int = 800, overlap: int = 150) -> list[str]:
    words = text.split()
    chunks = []
    start = 0
    while start < len(words):
        window = words[start:start + size]
        chunks.append(" ".join(window))
        start += size - overlap   # step forward, but leave an overlap
    return chunks

Sizes are a tuning knob, not a law. ~500–1000 words with ~10–20% overlap is a sane 2026 starting point; you adjust once you can measure answer quality.

step 3 Embed each chunk

An embedding turns text into a list of floats (a vector) where similar meanings sit close together. You call an embedding API and get back one vector per chunk. Keep it provider-neutral — wrap the call so swapping providers is a one-line change:

# Provider-neutral wrapper. Returns a vector (list[float]) for a string.
def embed(text: str) -> list[float]:
    # e.g. OpenAI's text-embedding-3-small, or another embeddings API.
    resp = embeddings_client.create(model=EMBED_MODEL, input=text)
    return resp.data[0].embedding
One model, everywhere Whatever embedding model you ingest with, you must use the same one at query time — vectors from different models aren't comparable. Pin the model name in config and never change it without re-embedding everything.

step 4 Store vectors + metadata in pgvector

pgvector is a Postgres extension that adds a vector column type and similarity search — so your embeddings live right next to your relational data, no separate vector database to run. The chunks table:

schema.sql
CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE chunks (
    id      bigserial PRIMARY KEY,
    doc_id  uuid NOT NULL,           -- which document this came from
    text    text NOT NULL,           -- the chunk's text (returned as a citation)
    embedding vector(1536)            -- vector dimension = your model's output size
);
def store_chunks(doc_id: str, chunks: list[str]) -> None:
    for chunk in chunks:
        vector = embed(chunk)
        db.execute(
            "INSERT INTO chunks (doc_id, text, embedding) VALUES (%s, %s, %s)",
            (doc_id, chunk, vector),
        )

That's the whole ingestion pipeline. Glue the four functions together and you turn a PDF into searchable knowledge:

def ingest(path: str, doc_id: str) -> None:
    text   = extract_text(path)
    chunks = chunk_text(text)
    store_chunks(doc_id, chunks)   # embeds + inserts each chunk

3 · Retrieval

Now the query path. A question comes in. Embed it with the same model, then ask pgvector for the chunks whose vectors are closest — this is the "retrieval" in RAG.

pgvector's <=> operator is cosine distance (smaller = more similar). ORDER BY ... LIMIT k gives you the top-k nearest chunks:

def retrieve(question: str, k: int = 5) -> list[dict]:
    q_vector = embed(question)                 # SAME embed() as ingestion
    rows = db.execute(
        """
        SELECT doc_id, text
        FROM chunks
        ORDER BY embedding <=> %s    -- cosine distance, nearest first
        LIMIT %s
        """,
        (q_vector, k),
    )
    return rows   # list of {doc_id, text}

Then assemble the retrieved chunks into one context string — this is the evidence you'll hand the LLM. Number them so you can cite them later:

def build_context(rows: list[dict]) -> str:
    return "\n\n".join(
        f"[{i}] {row['text']}" for i, row in enumerate(rows, 1)
    )
PHP bridge: this is just a SELECT ... ORDER BY ... LIMIT — the only new thing is the distance operator. If you can write a "find nearest" query, you can write retrieval.

4 · Generation

The final stage: hand the question and the retrieved context to an LLM, with a strict instruction — answer only from this context; if it's not there, say you don't know. This grounding instruction is what separates RAG from "an LLM guessing".

def build_prompt(question: str, context: str) -> str:
    return f"""You are DocChat. Answer the question using ONLY the context below.
If the answer is not in the context, say "I don't know based on these documents."
Cite the chunk numbers you used, like [1] or [3].

Context:
{context}

Question: {question}
Answer:"""

Then call a model. Keep it provider-neutral — the prompt above works with any modern LLM. Current 2026 options: Anthropic Claude (Opus 4.8 / Sonnet 4.6 / Haiku 4.5) via the Anthropic API, or OpenAI. Here's the concrete Anthropic example:

import anthropic
client = anthropic.Anthropic()   # reads ANTHROPIC_API_KEY from the environment

def generate(question: str, context: str) -> str:
    msg = client.messages.create(
        model="claude-opus-4-8",        # or claude-sonnet-4-6 for cheaper/faster
        max_tokens=1024,
        messages=[{"role": "user", "content": build_prompt(question, context)}],
    )
    return msg.content[0].text
Picking a model (2026, interview-aware) Default to Sonnet 4.6 for most DocChat traffic — strong quality, good price. Reach for Opus 4.8 when answers must be airtight, and Haiku 4.5 for high-volume or latency-sensitive paths. Being able to say "I'd default to Sonnet and escalate to Opus for hard queries" shows real judgment in an interview.

5 · Return the answer and its sources

A RAG feature that just returns prose is half-built. Return the cited source chunks too — that's what makes the answer trustworthy and lets users verify it. You already retrieved them; just pass them back alongside the answer:

def answer_question(question: str) -> dict:
    rows    = retrieve(question, k=5)
    context = build_context(rows)
    answer  = generate(question, context)
    return {
        "answer": answer,
        "sources": [{"doc_id": r["doc_id"], "text": r["text"]} for r in rows],
    }

That single function is RAG. Everything else is plumbing around it.

6 · Wiring it into FastAPI

Expose the query path as one endpoint: POST /ask takes a question, returns an answer plus sources. The Next.js frontend POSTs to it and renders the result.

from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()

class AskRequest(BaseModel):
    question: str

class Source(BaseModel):
    doc_id: str
    text: str

class AskResponse(BaseModel):
    answer: str
    sources: list[Source]

@app.post("/ask", response_model=AskResponse)
def ask(req: AskRequest) -> AskResponse:
    result = answer_question(req.question)
    return AskResponse(**result)
PHP bridge: a route that takes JSON in and returns JSON out — identical in spirit to a PHP controller action. Pydantic just validates the shapes for you, for free.

On the Next.js side, a form POSTs the question to /ask and renders answer with the sources listed under it — the user reads the answer and can click through to the exact chunks it came from.

7 · Pitfalls & guardrails

RAG fails in predictable ways. Knowing them — and the guardrail for each — is what separates "I followed a tutorial" from "I can ship this".

FailureWhat happensGuardrail
HallucinationModel answers from its own knowledge, not your docs.The "answer ONLY from context" instruction; refuse when context is empty.
Bad chunksRetrieval returns irrelevant text; the answer is built on noise.Tune chunk size/overlap; raise k; inspect what was retrieved.
Prompt injectionA document says "ignore your instructions" and the model obeys.Treat retrieved text as data, never as commands; keep instructions in the system role.
No evaluationYou ship blind and don't know if answers are grounded.Check: does the answer cite real, retrieved context? Log Q + retrieved chunks + answer.
.iv — "How do you reduce hallucination in RAG?" A crisp answer wins points: (1) instruct the model to answer only from the provided context and to say "I don't know" otherwise; (2) return cited source chunks so answers are verifiable; (3) improve retrieval quality (chunking, overlap, top-k) so the right context is actually present; (4) evaluate by checking whether answers are grounded in the retrieved text. Hallucination is usually a retrieval problem before it's a generation problem — say that.

8 · Build it

Your tangible win Implement DocChat's full POST /ask endpoint end to end — embed the question, run the top-k cosine query, build a grounded prompt, call the LLM, and return the answer plus cited sources. When you can POST a question and get back a grounded, cited answer, your capstone has its headline feature.

Assemble everything from this lesson into one file. (Assume documents are already ingested — that ran when they were uploaded.)

ask.py
import anthropic
from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()
llm = anthropic.Anthropic()

# ---- retrieval ----
def retrieve(question: str, k: int = 5) -> list[dict]:
    q_vector = embed(question)
    return db.execute(
        "SELECT doc_id, text FROM chunks ORDER BY embedding <=> %s LIMIT %s",
        (q_vector, k),
    )

# ---- prompt ----
def build_prompt(question: str, rows: list[dict]) -> str:
    context = "\n\n".join(f"[{i}] {r['text']}" for i, r in enumerate(rows, 1))
    return f"""Answer using ONLY the context. If the answer is not there,
say "I don't know based on these documents." Cite chunk numbers like [1].

Context:
{context}

Question: {question}
Answer:"""

# ---- endpoint ----
class AskRequest(BaseModel):
    question: str

@app.post("/ask")
def ask(req: AskRequest) -> dict:
    rows = retrieve(req.question)
    msg = llm.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        messages=[{"role": "user", "content": build_prompt(req.question, rows)}],
    )
    return {
        "answer": msg.content[0].text,
        "sources": [{"doc_id": r["doc_id"], "text": r["text"]} for r in rows],
    }

Run it, POST {"question": "..."} to /ask, and read the grounded answer with its sources. That is the entire feature an interviewer means when they ask "build a document Q&A".

9 · Check yourself

Answer from memory — narrating the pipeline out loud is exactly the interview skill you're building.

Recall quiz

What does the ingestion phase finally produce?

Why add overlap between adjacent chunks?

The query embedding must use which model?

What is the single biggest anti-hallucination rule?

What should POST /ask return to the frontend?

Primary source ⭐ pgvector RAG on managed Postgres (2026) — a current, end-to-end walkthrough of exactly the pgvector setup above. For the generation side, the Anthropic docs cover the Messages API, model IDs, and prompting used in this lesson.