Module 6 · RAG · Drills

Drills: Building RAG

You read the pipeline. Now build each stage with your own hands — chunk, embed-and-store, retrieve, prompt, endpoint — before you reveal a solution. Typing it is what makes you able to whiteboard it.

How to use this page Each drill rebuilds one stage of DocChat's RAG loop. Write it first — the embedding and DB calls can stay as pseudo (assume embed() and db.execute() exist); what matters is that the shape is right. Then click “Show solution” to compare. Tick each box as you go; progress is saved in this browser.

A · Ingestion stages Basic

Drill 1 chunking

Write chunk_text(text, size, overlap) that splits text into word windows of size words, stepping forward by size - overlap each time so adjacent chunks share context.

Show solution

def chunk_text(text: str, size: int = 800, overlap: int = 150) -> list[str]:
    words = text.split()
    chunks, start = [], 0
    while start < len(words):
        chunks.append(" ".join(words[start:start + size]))
        start += size - overlap   # overlap keeps boundary context
    return chunks

The step is size - overlap, not size — that's the whole trick. Step by size and you get no overlap.

Drill 2 embed + store

Write store_chunks(doc_id, chunks) (pseudo) that embeds each chunk and inserts doc_id, the chunk text, and its embedding into the pgvector chunks table.

Show solution

def store_chunks(doc_id: str, chunks: list[str]) -> None:
    for chunk in chunks:
        vector = embed(chunk)               # pseudo: embedding API call
        db.execute(
            "INSERT INTO chunks (doc_id, text, embedding) VALUES (%s, %s, %s)",
            (doc_id, chunk, vector),
        )

Store the text too, not just the vector — you return it later as a citation. A vector alone is useless to a human reader.

B · The query path Intermediate

Drill 3 retrieve

Write retrieve(question, k) that embeds the question and returns the top-k nearest chunks by cosine distance. Remember the same embed() as ingestion.

Show solution

def retrieve(question: str, k: int = 5) -> list[dict]:
    q_vector = embed(question)              # SAME model as ingestion
    return db.execute(
        """
        SELECT doc_id, text
        FROM chunks
        ORDER BY embedding <=> %s    -- cosine distance, nearest first
        LIMIT %s
        """,
        (q_vector, k),
    )

<=> is pgvector's cosine-distance operator; smaller means more similar, so ORDER BY ... LIMIT k gives the nearest k.

Drill 4 grounded prompt

Write a prompt template that injects numbered context and instructs the model to answer only from it — refusing with "I don't know" when the answer is absent — and to cite chunk numbers.

Show solution

def build_prompt(question: str, rows: list[dict]) -> str:
    context = "\n\n".join(
        f"[{i}] {r['text']}" for i, r in enumerate(rows, 1)
    )
    return f"""Answer using ONLY the context below.
If the answer is not in the context, say
"I don't know based on these documents."
Cite the chunk numbers you used, like [1] or [3].

Context:
{context}

Question: {question}
Answer:"""

The refusal clause is the load-bearing line — without it the model fills gaps from its own knowledge, which is hallucination.

Drill 5 /ask sketch

Sketch the FastAPI POST /ask endpoint: take a question, retrieve, build the prompt, call the LLM (provider-neutral — Anthropic shown), return answer + sources.

Show solution

class AskRequest(BaseModel):
    question: str

@app.post("/ask")
def ask(req: AskRequest) -> dict:
    rows = retrieve(req.question)
    msg = llm.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        messages=[{"role": "user",
                   "content": build_prompt(req.question, rows)}],
    )
    return {
        "answer": msg.content[0].text,
        "sources": [{"doc_id": r["doc_id"], "text": r["text"]}
                    for r in rows],
    }

Always return sources alongside answer — that's what makes the answer verifiable and the feature trustworthy.

C · Build challenge Build

Mini-project Assemble the complete ingestion + ask flow for DocChat in one file: ingest(path, doc_id) that turns a PDF into stored chunks, and POST /ask that answers a question grounded in them with cited sources. This is the headline feature of your capstone.

Build · DocChat end to end

Assume embed() and db.execute() exist. Write ingestion and the ask endpoint together.

Show solution

import anthropic
from pypdf import PdfReader
from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()
llm = anthropic.Anthropic()

# ---------- INGESTION (runs on upload) ----------
def chunk_text(text, size=800, overlap=150):
    words, chunks, start = text.split(), [], 0
    while start < len(words):
        chunks.append(" ".join(words[start:start + size]))
        start += size - overlap
    return chunks

def ingest(path: str, doc_id: str) -> None:
    text = "\n".join(p.extract_text() or "" for p in PdfReader(path).pages)
    for chunk in chunk_text(text):
        db.execute(
            "INSERT INTO chunks (doc_id, text, embedding) VALUES (%s, %s, %s)",
            (doc_id, chunk, embed(chunk)),
        )

# ---------- QUERY (runs on every question) ----------
def retrieve(question, k=5):
    return db.execute(
        "SELECT doc_id, text FROM chunks ORDER BY embedding <=> %s LIMIT %s",
        (embed(question), k),
    )

def build_prompt(question, rows):
    context = "\n\n".join(f"[{i}] {r['text']}" for i, r in enumerate(rows, 1))
    return f"""Answer using ONLY the context. If it is not there, say
"I don't know based on these documents." Cite chunks like [1].

Context:
{context}

Question: {question}
Answer:"""

class AskRequest(BaseModel):
    question: str

@app.post("/ask")
def ask(req: AskRequest) -> dict:
    rows = retrieve(req.question)
    msg = llm.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        messages=[{"role": "user", "content": build_prompt(req.question, rows)}],
    )
    return {
        "answer": msg.content[0].text,
        "sources": [{"doc_id": r["doc_id"], "text": r["text"]} for r in rows],
    }

Notice the two phases run at different times: ingest() on upload, /ask on every question. They meet only through the chunks table. That clean split is the whole architecture — be able to draw it.

D · Rapid recall Flashcards

Click a card to flip it. Say the answer out loud before you flip — that's the rep that builds storage strength.

The three steps of ingestion?

Extract text → chunk (with overlap) → embed → store vectors in pgvector.

click to flip

What goes into the generation prompt?

The user's question + the retrieved context chunks + the "answer only from context" instruction.

click to flip

Why "answer only from context"?

It stops the model filling gaps from its own knowledge — the core anti-hallucination guardrail.

click to flip

What is top-k retrieval?

Embed the question, return the k nearest chunks by cosine distance (ORDER BY embedding <=> q LIMIT k).

click to flip

Why cite source chunks?

Makes the answer verifiable and trustworthy — the user can check exactly where it came from.

click to flip

What is prompt injection here?

A document's text trying to override instructions. Treat retrieved text as data, never as commands.

click to flip

E · Self-check before moving on

Tick each only if you can do it without looking:

I can write a chunking function with overlap and explain why overlap matters
I can write the embed-and-store step into a pgvector chunks table
I can write the top-k cosine retrieval query
I can write a grounded prompt that refuses when context is missing
I built the full /ask endpoint returning answer + cited sources
I can name the main RAG pitfalls and a guardrail for each

Next All ticked? DocChat now has its AI brain. Next we make it shippable — package it, containerise it, and get it running the same everywhere: Module 7 — Git, Docker & Environments.