Module 6 · RAG · Deep Dive

RAG Foundations

The single feature that turns DocChat from a toy chatbot into a hireable portfolio piece — teaching an LLM to answer from your documents, not its training data.

BasicIntermediateBuild

Why this matters RAG is the headline feature of DocChat and the reason a UAE employer will look twice at your portfolio in 2026 — every company wants to chat with their own contracts, policies, and PDFs. Here's your edge: most candidates can name-drop RAG but freeze when asked to explain it end-to-end. After this lesson you'll walk an interviewer through all five steps — chunk, embed, store, retrieve, generate — and write the pgvector SQL on the whiteboard. That's the difference between "I used a RAG library once" and "I can build one."
In this lesson
  1. The problem RAG solves
  2. Embeddings & cosine similarity
  3. The 5-step RAG loop
  4. Chunking strategies
  5. pgvector setup & retrieval SQL
  6. Build: DocChat's chunks table
  7. Check yourself

1 · The problem RAG solves

An LLM like Claude is brilliant, but it only knows what it saw in training. It has never seen your company's employee handbook, last quarter's contract, or the PDF a user uploaded ten seconds ago. Ask it about those and it either says "I don't know" or — worse — confidently makes something up (a hallucination).

RAG — Retrieval-Augmented Generation — fixes this without retraining the model. The idea is simple: at question-time, retrieve the few relevant passages from your own documents and paste them into the prompt alongside the question. The LLM then answers from that supplied context instead of its memory.

# Without RAG — the model guesses
Q: "What is our refund window?"
A: "Typically 30 days."   # made up — it never saw YOUR policy

# With RAG — the model is handed the real passage first
context = "Refunds are accepted within 14 days of purchase..."
Q: "What is our refund window?"
A: "14 days, per your refund policy."   # grounded in YOUR data
Mental model: RAG is an open-book exam. The LLM is a smart student; you hand it the right page before it answers, so it stops guessing.

2 · Embeddings: text as vectors

To "retrieve the relevant passages" you need a way to measure meaning, not just matching words. That's what an embedding is: a function that turns a piece of text into a vector — a long list of numbers that captures what the text means.

embed("refund policy")  →  [0.021, -0.88, 0.13, ... ]   # e.g. 1536 numbers

The magic: text with similar meaning lands at nearby points in this number-space. "refund window" and "money-back period" use different words but produce vectors that sit close together. "refund window" and "office coffee machine" sit far apart.

The length of the list is the dimensions. A common modern size is 1536 (OpenAI's text-embedding-3-small); other models use 768, 1024, or 3072. Every chunk you store and every question you ask must use the same model, so the numbers are comparable.

Cosine similarity — the intuition To measure "how close" two vectors are, we look at the angle between them, not their length. A small angle (vectors pointing the same way) = very similar meaning; a 90° angle = unrelated. Cosine similarity scores this from 1 (identical direction) down to −1 (opposite). In SQL you'll use cosine distance, which is just 1 − similarity: smaller distance = closer meaning.
PHP bridge: think of similar_text() or levenshtein(), but instead of comparing letters it compares meaning. That's the leap vectors give you.

3 · The 5-step RAG loop

Every RAG system, however fancy, is these five steps. The first three happen once when a document is uploaded (indexing); the last two happen every time a user asks a question (querying).

  1. CHUNK — split each document into bite-sized passages.
  2. EMBED — turn every chunk into a vector with an embedding model.
  3. STORE — save chunk text + its vector in the database (pgvector).
  4. RETRIEVE — embed the user's question, find the nearest chunks.
  5. GENERATE — hand those chunks + the question to the LLM for the answer.

Steps 1–3 build your searchable index. Steps 4–5 run live. Keep this list in your head — it's the exact answer to "walk me through how RAG works" in an interview.

4 · Chunking strategies

You can't embed a 50-page PDF as one vector — meaning would blur into mush, and you'd hand the LLM far too much text. So you chunk: split the document into passages, typically ~200–800 tokens each (a token ≈ ¾ of a word).

The trade-off is real and interviewers probe it:

ChoiceRiskWhen it hurts
Chunks too bigNoisy — one chunk covers many topics, so retrieval returns irrelevant text and dilutes the answer.Dense legal docs.
Chunks too smallLost context — a sentence retrieved without its surroundings can't be understood.Step-by-step guides.

The fix for the small-chunk problem is overlap: let each chunk repeat the last ~10–15% of the previous one, so a sentence that straddles a boundary still appears whole somewhere.

# split 600-token chunks with 80 tokens of overlap
chunk_size    = 600
chunk_overlap = 80
# chunk A: tokens 0–600 · chunk B: tokens 520–1120 · ...
Rule of thumb (2026): start at ~500 tokens with ~50–80 overlap, then tune. There's no universal best — it depends on your documents, which is exactly the nuance that impresses interviewers.

5 · pgvector: storing & searching vectors

You already know Postgres. pgvector is an extension that teaches it a new vector column type and distance operators — so you store chunks and run similarity search in the database you already run, no separate vector DB needed. That's a deliberate, hireable choice for DocChat.

First, enable the extension (once per database):

CREATE EXTENSION IF NOT EXISTS vector;

Now create a table to hold each chunk, its source document, the text, and its embedding:

CREATE TABLE chunks (
    id          bigserial PRIMARY KEY,
    document_id bigint      NOT NULL,
    content     text        NOT NULL,
    embedding   vector(1536) NOT NULL
);

The vector(1536) column matches your embedding model's dimensions exactly. To search, pgvector gives you distance operators — the one you'll use most is <=>, cosine distance:

OperatorMeaning
<=>Cosine distance (your default for meaning).
<->Euclidean (L2) distance.
<#>Negative inner product.

The nearest-neighbour query — the heart of RETRIEVE — orders rows by distance to the question's vector and takes the top few. $1 is the embedded question:

SELECT id, content
FROM   chunks
ORDER BY embedding <=> $1
LIMIT  5;

That returns the 5 chunks closest in meaning to the question — exactly what you paste into the prompt for GENERATE.

Indexes for speed With thousands of rows that ORDER BY scans every row (a brute-force search). For production, add an approximate index — ivfflat or the newer, higher-recall hnsw — so nearest-neighbour search stays fast at scale. One line, big difference; mention it in interviews.

6 · Build it

Your tangible win Stand up pgvector for DocChat: enable the extension, create the chunks table with a vector(1536) column, and write the top-5 cosine-retrieval query. This is the literal storage layer of your capstone's RAG feature — the thing you'll demo.

Try it in psql or a migration first, then compare. A clean version:

001_chunks.sql
-- 1. enable pgvector (once per database)
CREATE EXTENSION IF NOT EXISTS vector;

-- 2. store one row per chunk, with its embedding
CREATE TABLE chunks (
    id          bigserial PRIMARY KEY,
    document_id bigint      NOT NULL,
    content     text        NOT NULL,
    embedding   vector(1536) NOT NULL
);

-- 3. speed up search once you have real data
CREATE INDEX ON chunks USING hnsw (embedding vector_cosine_ops);

-- 4. RETRIEVE: the 5 chunks nearest the question vector $1
SELECT id, content
FROM   chunks
ORDER BY embedding <=> $1
LIMIT  5;

That's the entire retrieval backbone of DocChat in a dozen lines. Next lesson you'll wire CHUNK, EMBED, and GENERATE around it into a working pipeline.

7 · Check yourself

Answer from memory — retrieval is what moves this from "I read it" to "I can whiteboard it".

Recall quiz

What core problem does RAG actually solve?

What exactly is a text embedding?

Which lists the five RAG steps in order?

Which pgvector operator gives cosine distance?

Why add overlap between neighbouring chunks?

Primary source ⭐ pgvector & RAG on managed Postgres (2026) — a current, practical walk-through of the exact setup above. The canonical reference for the extension itself is pgvector on GitHub — bookmark its README for operators and index tuning.