Module 6 · RAG · Drills
You read the pipeline. Now build each stage with your own hands — chunk, embed-and-store, retrieve, prompt, endpoint — before you reveal a solution. Typing it is what makes you able to whiteboard it.
embed() and db.execute() exist); what matters is that the shape is right. Then click “Show solution” to compare. Tick each box as you go; progress is saved in this browser.
Drill 1 chunking
Write chunk_text(text, size, overlap) that splits text into word windows of size words, stepping forward by size - overlap each time so adjacent chunks share context.
def chunk_text(text: str, size: int = 800, overlap: int = 150) -> list[str]: words = text.split() chunks, start = [], 0 while start < len(words): chunks.append(" ".join(words[start:start + size])) start += size - overlap # overlap keeps boundary context return chunks
The step is size - overlap, not size — that's the whole trick. Step by size and you get no overlap.
Drill 2 embed + store
Write store_chunks(doc_id, chunks) (pseudo) that embeds each chunk and inserts doc_id, the chunk text, and its embedding into the pgvector chunks table.
def store_chunks(doc_id: str, chunks: list[str]) -> None: for chunk in chunks: vector = embed(chunk) # pseudo: embedding API call db.execute( "INSERT INTO chunks (doc_id, text, embedding) VALUES (%s, %s, %s)", (doc_id, chunk, vector), )
Store the text too, not just the vector — you return it later as a citation. A vector alone is useless to a human reader.
Drill 3 retrieve
Write retrieve(question, k) that embeds the question and returns the top-k nearest chunks by cosine distance. Remember the same embed() as ingestion.
def retrieve(question: str, k: int = 5) -> list[dict]: q_vector = embed(question) # SAME model as ingestion return db.execute( """ SELECT doc_id, text FROM chunks ORDER BY embedding <=> %s -- cosine distance, nearest first LIMIT %s """, (q_vector, k), )
<=> is pgvector's cosine-distance operator; smaller means more similar, so ORDER BY ... LIMIT k gives the nearest k.
Drill 4 grounded prompt
Write a prompt template that injects numbered context and instructs the model to answer only from it — refusing with "I don't know" when the answer is absent — and to cite chunk numbers.
def build_prompt(question: str, rows: list[dict]) -> str: context = "\n\n".join( f"[{i}] {r['text']}" for i, r in enumerate(rows, 1) ) return f"""Answer using ONLY the context below. If the answer is not in the context, say "I don't know based on these documents." Cite the chunk numbers you used, like [1] or [3]. Context: {context} Question: {question} Answer:"""
The refusal clause is the load-bearing line — without it the model fills gaps from its own knowledge, which is hallucination.
Drill 5 /ask sketch
Sketch the FastAPI POST /ask endpoint: take a question, retrieve, build the prompt, call the LLM (provider-neutral — Anthropic shown), return answer + sources.
class AskRequest(BaseModel): question: str @app.post("/ask") def ask(req: AskRequest) -> dict: rows = retrieve(req.question) msg = llm.messages.create( model="claude-sonnet-4-6", max_tokens=1024, messages=[{"role": "user", "content": build_prompt(req.question, rows)}], ) return { "answer": msg.content[0].text, "sources": [{"doc_id": r["doc_id"], "text": r["text"]} for r in rows], }
Always return sources alongside answer — that's what makes the answer verifiable and the feature trustworthy.
ingest(path, doc_id) that turns a PDF into stored chunks, and POST /ask that answers a question grounded in them with cited sources. This is the headline feature of your capstone.
Build · DocChat end to end
Assume embed() and db.execute() exist. Write ingestion and the ask endpoint together.
import anthropic from pypdf import PdfReader from fastapi import FastAPI from pydantic import BaseModel app = FastAPI() llm = anthropic.Anthropic() # ---------- INGESTION (runs on upload) ---------- def chunk_text(text, size=800, overlap=150): words, chunks, start = text.split(), [], 0 while start < len(words): chunks.append(" ".join(words[start:start + size])) start += size - overlap return chunks def ingest(path: str, doc_id: str) -> None: text = "\n".join(p.extract_text() or "" for p in PdfReader(path).pages) for chunk in chunk_text(text): db.execute( "INSERT INTO chunks (doc_id, text, embedding) VALUES (%s, %s, %s)", (doc_id, chunk, embed(chunk)), ) # ---------- QUERY (runs on every question) ---------- def retrieve(question, k=5): return db.execute( "SELECT doc_id, text FROM chunks ORDER BY embedding <=> %s LIMIT %s", (embed(question), k), ) def build_prompt(question, rows): context = "\n\n".join(f"[{i}] {r['text']}" for i, r in enumerate(rows, 1)) return f"""Answer using ONLY the context. If it is not there, say "I don't know based on these documents." Cite chunks like [1]. Context: {context} Question: {question} Answer:""" class AskRequest(BaseModel): question: str @app.post("/ask") def ask(req: AskRequest) -> dict: rows = retrieve(req.question) msg = llm.messages.create( model="claude-sonnet-4-6", max_tokens=1024, messages=[{"role": "user", "content": build_prompt(req.question, rows)}], ) return { "answer": msg.content[0].text, "sources": [{"doc_id": r["doc_id"], "text": r["text"]} for r in rows], }
Notice the two phases run at different times: ingest() on upload, /ask on every question. They meet only through the chunks table. That clean split is the whole architecture — be able to draw it.
Click a card to flip it. Say the answer out loud before you flip — that's the rep that builds storage strength.
k nearest chunks by cosine distance (ORDER BY embedding <=> q LIMIT k).Tick each only if you can do it without looking:
chunks table/ask endpoint returning answer + cited sources