Module 6 · RAG · Deep Dive
This is the lesson where DocChat becomes real. We assemble the whole loop — load a PDF, chunk it, embed it, store the vectors, then retrieve and ground an LLM answer in them.
BasicIntermediateBuild
RAG — Retrieval-Augmented Generation — answers a question by first fetching relevant text from your documents, then asking an LLM to answer using only that text. It splits cleanly into two phases that run at different times:
# INGESTION — runs once, when a document is uploaded PDF → extract text → chunk → embed → store vectors in pgvector # QUERY — runs every time a user asks a question question → embed → top-k search → build prompt → call LLM → answer + sources
Everything below is just filling in those two lines with correct Python. Keep the diagram in your head — interviewers love when you narrate the flow before touching code.
PHP bridge: ingestion is a one-off batch job (think a cron import script); the query path is an ordinary request/response handler — the same split you already know from web work.The job of ingestion is to turn a messy PDF into rows in your chunks table: each row a short slice of text plus its embedding vector and some metadata. Four small steps.
Use pypdf — the standard, current PDF library. Read every page and concatenate the text:
from pypdf import PdfReader def extract_text(path: str) -> str: reader = PdfReader(path) pages = [page.extract_text() or "" for page in reader.pages] return "\n".join(pages)
extract_text() returns None on image-only pages — hence the or "". Scanned documents have no text layer at all and need OCR (out of scope for DocChat v1). Always assume the text is imperfect.
You can't embed a whole book as one vector — embedding models have a token limit, and a giant chunk dilutes meaning. Split the text into overlapping windows of a few hundred words. Overlap matters: it stops a sentence that straddles a boundary from losing its context.
def chunk_text(text: str, size: int = 800, overlap: int = 150) -> list[str]: words = text.split() chunks = [] start = 0 while start < len(words): window = words[start:start + size] chunks.append(" ".join(window)) start += size - overlap # step forward, but leave an overlap return chunks
Sizes are a tuning knob, not a law. ~500–1000 words with ~10–20% overlap is a sane 2026 starting point; you adjust once you can measure answer quality.
An embedding turns text into a list of floats (a vector) where similar meanings sit close together. You call an embedding API and get back one vector per chunk. Keep it provider-neutral — wrap the call so swapping providers is a one-line change:
# Provider-neutral wrapper. Returns a vector (list[float]) for a string. def embed(text: str) -> list[float]: # e.g. OpenAI's text-embedding-3-small, or another embeddings API. resp = embeddings_client.create(model=EMBED_MODEL, input=text) return resp.data[0].embedding
pgvector is a Postgres extension that adds a vector column type and similarity search — so your embeddings live right next to your relational data, no separate vector database to run. The chunks table:
schema.sql
CREATE EXTENSION IF NOT EXISTS vector; CREATE TABLE chunks ( id bigserial PRIMARY KEY, doc_id uuid NOT NULL, -- which document this came from text text NOT NULL, -- the chunk's text (returned as a citation) embedding vector(1536) -- vector dimension = your model's output size );
def store_chunks(doc_id: str, chunks: list[str]) -> None: for chunk in chunks: vector = embed(chunk) db.execute( "INSERT INTO chunks (doc_id, text, embedding) VALUES (%s, %s, %s)", (doc_id, chunk, vector), )
That's the whole ingestion pipeline. Glue the four functions together and you turn a PDF into searchable knowledge:
def ingest(path: str, doc_id: str) -> None: text = extract_text(path) chunks = chunk_text(text) store_chunks(doc_id, chunks) # embeds + inserts each chunk
Now the query path. A question comes in. Embed it with the same model, then ask pgvector for the chunks whose vectors are closest — this is the "retrieval" in RAG.
pgvector's <=> operator is cosine distance (smaller = more similar). ORDER BY ... LIMIT k gives you the top-k nearest chunks:
def retrieve(question: str, k: int = 5) -> list[dict]: q_vector = embed(question) # SAME embed() as ingestion rows = db.execute( """ SELECT doc_id, text FROM chunks ORDER BY embedding <=> %s -- cosine distance, nearest first LIMIT %s """, (q_vector, k), ) return rows # list of {doc_id, text}
Then assemble the retrieved chunks into one context string — this is the evidence you'll hand the LLM. Number them so you can cite them later:
def build_context(rows: list[dict]) -> str: return "\n\n".join( f"[{i}] {row['text']}" for i, row in enumerate(rows, 1) )PHP bridge: this is just a
SELECT ... ORDER BY ... LIMIT — the only new thing is the distance operator. If you can write a "find nearest" query, you can write retrieval.
The final stage: hand the question and the retrieved context to an LLM, with a strict instruction — answer only from this context; if it's not there, say you don't know. This grounding instruction is what separates RAG from "an LLM guessing".
def build_prompt(question: str, context: str) -> str: return f"""You are DocChat. Answer the question using ONLY the context below. If the answer is not in the context, say "I don't know based on these documents." Cite the chunk numbers you used, like [1] or [3]. Context: {context} Question: {question} Answer:"""
Then call a model. Keep it provider-neutral — the prompt above works with any modern LLM. Current 2026 options: Anthropic Claude (Opus 4.8 / Sonnet 4.6 / Haiku 4.5) via the Anthropic API, or OpenAI. Here's the concrete Anthropic example:
import anthropic client = anthropic.Anthropic() # reads ANTHROPIC_API_KEY from the environment def generate(question: str, context: str) -> str: msg = client.messages.create( model="claude-opus-4-8", # or claude-sonnet-4-6 for cheaper/faster max_tokens=1024, messages=[{"role": "user", "content": build_prompt(question, context)}], ) return msg.content[0].text
A RAG feature that just returns prose is half-built. Return the cited source chunks too — that's what makes the answer trustworthy and lets users verify it. You already retrieved them; just pass them back alongside the answer:
def answer_question(question: str) -> dict: rows = retrieve(question, k=5) context = build_context(rows) answer = generate(question, context) return { "answer": answer, "sources": [{"doc_id": r["doc_id"], "text": r["text"]} for r in rows], }
That single function is RAG. Everything else is plumbing around it.
Expose the query path as one endpoint: POST /ask takes a question, returns an answer plus sources. The Next.js frontend POSTs to it and renders the result.
from fastapi import FastAPI from pydantic import BaseModel app = FastAPI() class AskRequest(BaseModel): question: str class Source(BaseModel): doc_id: str text: str class AskResponse(BaseModel): answer: str sources: list[Source] @app.post("/ask", response_model=AskResponse) def ask(req: AskRequest) -> AskResponse: result = answer_question(req.question) return AskResponse(**result)PHP bridge: a route that takes JSON in and returns JSON out — identical in spirit to a PHP controller action. Pydantic just validates the shapes for you, for free.
On the Next.js side, a form POSTs the question to /ask and renders answer with the sources listed under it — the user reads the answer and can click through to the exact chunks it came from.
RAG fails in predictable ways. Knowing them — and the guardrail for each — is what separates "I followed a tutorial" from "I can ship this".
| Failure | What happens | Guardrail |
|---|---|---|
| Hallucination | Model answers from its own knowledge, not your docs. | The "answer ONLY from context" instruction; refuse when context is empty. |
| Bad chunks | Retrieval returns irrelevant text; the answer is built on noise. | Tune chunk size/overlap; raise k; inspect what was retrieved. |
| Prompt injection | A document says "ignore your instructions" and the model obeys. | Treat retrieved text as data, never as commands; keep instructions in the system role. |
| No evaluation | You ship blind and don't know if answers are grounded. | Check: does the answer cite real, retrieved context? Log Q + retrieved chunks + answer. |
POST /ask endpoint end to end — embed the question, run the top-k cosine query, build a grounded prompt, call the LLM, and return the answer plus cited sources. When you can POST a question and get back a grounded, cited answer, your capstone has its headline feature.
Assemble everything from this lesson into one file. (Assume documents are already ingested — that ran when they were uploaded.)
ask.py
import anthropic from fastapi import FastAPI from pydantic import BaseModel app = FastAPI() llm = anthropic.Anthropic() # ---- retrieval ---- def retrieve(question: str, k: int = 5) -> list[dict]: q_vector = embed(question) return db.execute( "SELECT doc_id, text FROM chunks ORDER BY embedding <=> %s LIMIT %s", (q_vector, k), ) # ---- prompt ---- def build_prompt(question: str, rows: list[dict]) -> str: context = "\n\n".join(f"[{i}] {r['text']}" for i, r in enumerate(rows, 1)) return f"""Answer using ONLY the context. If the answer is not there, say "I don't know based on these documents." Cite chunk numbers like [1]. Context: {context} Question: {question} Answer:""" # ---- endpoint ---- class AskRequest(BaseModel): question: str @app.post("/ask") def ask(req: AskRequest) -> dict: rows = retrieve(req.question) msg = llm.messages.create( model="claude-sonnet-4-6", max_tokens=1024, messages=[{"role": "user", "content": build_prompt(req.question, rows)}], ) return { "answer": msg.content[0].text, "sources": [{"doc_id": r["doc_id"], "text": r["text"]} for r in rows], }
Run it, POST {"question": "..."} to /ask, and read the grounded answer with its sources. That is the entire feature an interviewer means when they ask "build a document Q&A".
Answer from memory — narrating the pipeline out loud is exactly the interview skill you're building.
What does the ingestion phase finally produce?
Why add overlap between adjacent chunks?
The query embedding must use which model?
What is the single biggest anti-hallucination rule?
What should POST /ask return to the frontend?