Module 3 · Postgres & Data · Deep Dive
CTEs, window functions, query plans, JSONB and transaction isolation — the layer that separates "I can write a SELECT" from "I own this database." This is the single biggest interview delta for a 2026 full-stack role.
IntermediateAdvancedBuild
date_trunc plus a running total. "Re-ingesting a document re-writes its chunks" — that's an UPSERT. "This endpoint is slow" — that's EXPLAIN ANALYZE. You already know MySQL from your PHP days; Postgres rewards you with sharper tools and stricter defaults. Learn these and you stop fearing the SQL round.
A Common Table Expression (CTE) is a named, temporary result set defined with WITH … AS (…) at the top of a query. Think of it as a local variable for a subquery: you name a step, then refer to that name below. It turns a deeply-nested mess into a top-to-bottom recipe.
-- "users who own more than 3 docs, with their newest doc's title" WITH doc_counts AS ( SELECT owner_id, COUNT(*) AS n FROM documents GROUP BY owner_id ) SELECT u.name, c.n FROM doc_counts c JOIN users u ON u.id = c.owner_id WHERE c.n > 3 ORDER BY c.n DESC;
You can chain several CTEs (comma-separated) so each step builds on the last — far easier to read and debug than one giant nested subquery. The query reads like prose: compute the counts, then join names, then filter.
PHP bridge: MySQL gained CTEs in 8.0, so the syntax is identical to what you may have seen. Postgres has had them for years and they're idiomatic — reach for a CTE the moment a subquery nests more than once.Add RECURSIVE and a CTE can reference itself — the standard way to walk hierarchies (a category tree, an org chart, a comment thread) without a loop in application code. It runs the base query once, then repeatedly feeds its own output back in until nothing new appears.
-- all descendants of category 1, with their depth WITH RECURSIVE subtree AS ( -- base case: the starting node SELECT id, name, parent_id, 0 AS depth FROM categories WHERE id = 1 UNION ALL -- recursive step: children of rows already found SELECT c.id, c.name, c.parent_id, s.depth + 1 FROM categories c JOIN subtree s ON c.parent_id = s.id ) SELECT * FROM subtree ORDER BY depth;Interview: "How do you query a tree stored as parent_id?" Say: a recursive CTE — a base case selecting the root,
UNION ALL, and a recursive term that joins the table to the CTE on parent_id = id. Postgres iterates until the recursive term returns no rows. The alternative (looping in app code, N queries deep) is what you're replacing.
This is the topic. A window function computes a value across a set of rows related to the current row — but, crucially, keeps every row in the output. That's the whole difference from GROUP BY, which collapses each group into a single summary row. A window lets you say "rank this row within its user's documents" while still returning the row itself.
The shape is always fn() OVER (PARTITION BY … ORDER BY …). PARTITION BY splits rows into groups (like GROUP BY but without collapsing); ORDER BY orders rows inside each partition so functions like ranking and running totals know the sequence.
-- number each user's docs, newest = 1 SELECT id, owner_id, title, created_at, ROW_NUMBER() OVER ( PARTITION BY owner_id ORDER BY created_at DESC ) AS rn FROM documents;
Now wrap it in a CTE and filter WHERE rn <= 3 to get the classic top-3 docs per user — the canonical window-function interview question:
WITH ranked AS ( SELECT id, owner_id, title, created_at, ROW_NUMBER() OVER (PARTITION BY owner_id ORDER BY created_at DESC) AS rn FROM documents ) SELECT owner_id, title, created_at FROM ranked WHERE rn <= 3;
The three ranking functions differ only on ties:
ROW_NUMBER() — always distinct: 1, 2, 3, 4. Ties broken arbitrarily.RANK() — ties share a rank, then it skips: 1, 1, 3, 4.DENSE_RANK() — ties share a rank, no gap: 1, 1, 2, 3.-- cumulative documents uploaded over time, per user SELECT owner_id, created_at, COUNT(*) OVER ( PARTITION BY owner_id ORDER BY created_at ) AS running_total FROM documents;
With an ORDER BY inside the window, an aggregate like SUM or COUNT becomes a running total — it accumulates row by row instead of returning one grand total. Drop the ORDER BY and you get the partition's overall total repeated on every row, handy for "this row's value vs the group total."
-- gap in days between each upload and the user's previous one SELECT owner_id, created_at, created_at - LAG(created_at) OVER ( PARTITION BY owner_id ORDER BY created_at ) AS since_previous FROM documents;
LAG(col) reads the column from the previous row in the window; LEAD(col) reads the next. Perfect for "change since last time" — week-over-week growth, time-between-events, price deltas — without a self-join.
"They both partition rows — when do you use which?" Answer crisply: GROUP BY collapses each group into one summary row, so you lose the individual rows. A window function keeps every row and adds a computed column alongside it.
So: "total docs per user" is GROUP BY (one row per user). "Each doc, labelled with its rank among that user's docs" or "each doc next to its user's running total" is a window function (every doc still returned). If the question is "show me the rows and something computed over their group," it's a window. The giveaway phrases: per-row, top-N per group, running total, compared to previous.
When a query is slow, EXPLAIN shows you how Postgres intends to run it — the query plan — without running it. EXPLAIN ANALYZE actually executes the query and reports real timings and real row counts. That distinction matters: EXPLAIN alone gives estimates; EXPLAIN ANALYZE gives the truth (so never run it on a destructive statement in production — it really does the work).
EXPLAIN ANALYZE SELECT * FROM documents WHERE owner_id = 42;
A plan is a tree of nodes, read inside-out. The lines you care about:
| You see | It means |
|---|---|
Seq Scan | Reading every row to find matches. Fine on tiny tables; a red flag on big ones with a selective filter — usually a missing index. |
Index Scan | Jumping straight to matching rows via an index. What you want for selective lookups. |
cost=0.00..18.5 | Estimated startup..total cost in arbitrary units. Lower is better; compare alternatives. |
rows=… | Estimated rows. With ANALYZE you also get actual rows — a big gap means stale stats (run ANALYZE the table). |
actual time=… | Real milliseconds for that node (only with ANALYZE). Find the node eating the time. |
The skill is scanning the tree for the most expensive node and asking "why is this a Seq Scan?" If the filter is selective (returns few rows) but Postgres scans the whole table, you're missing an index on that column.
Walk the interviewer through the exact loop, naming the tools:
EXPLAIN ANALYZE on the real query with real parameters — get the actual plan and timings, not a guess.Seq Scan over a large table, or a rows estimate wildly off from actual rows.WHERE/JOIN, add the right index (a B-tree on the filtered/joined column; a composite or partial index if the pattern calls for it).EXPLAIN ANALYZE and confirm it flipped to an Index Scan and the time dropped. Measure, don't assume.Bonus points: mention checking that statistics are fresh (ANALYZE the table), and that an index won't help if the filter isn't selective or the query returns most of the table.
The fundamentals lesson introduced the B-tree index. Three refinements come up constantly in interviews and real tuning.
CREATE INDEX idx_docs_owner_created ON documents (owner_id, created_at DESC);
A multi-column index is sorted by the first column, then the second within that. The leftmost-prefix rule: this index helps queries filtering on owner_id, or on owner_id AND created_at — but not a query filtering on created_at alone, because the second column is only ordered within each owner_id. Put the column you always filter by first. This index is also perfect for our top-N-per-user query: it filters by owner and pre-sorts by date.
-- only active users are ever looked up by email CREATE INDEX idx_users_active_email ON users (email) WHERE is_active;
A WHERE on the index definition means Postgres indexes only the matching rows — smaller, faster, and cheaper to maintain. Ideal when queries always carry the same condition (e.g. WHERE is_active, or WHERE deleted_at IS NULL).
B-trees are for scalar equality and ranges. For "does this JSONB contain that key?" or "does this document match these search terms?" you need a GIN (Generalized Inverted) index — it indexes the contents of composite values. We use one below for JSONB.
WHERE lower(email) = … can't use a plain index on email — index the expression instead); a leading % in LIKE '%x' defeats a B-tree; or the table is so small the planner doesn't bother. EXPLAIN tells you which.
You often need "insert this row, but if it already exists, update it instead." Doing that as a SELECT-then-INSERT-or-UPDATE in app code is racy under concurrency. Postgres does it atomically with INSERT … ON CONFLICT.
-- re-ingesting a doc: same (document_id, chunk_index) overwrites the chunk INSERT INTO chunks (document_id, chunk_index, content, embedding) VALUES (7, 0, 'Retrieval augmented…', '[…]') ON CONFLICT (document_id, chunk_index) DO UPDATE SET content = EXCLUDED.content, embedding = EXCLUDED.embedding RETURNING id;
The conflict target (document_id, chunk_index) must be a UNIQUE constraint or primary key. EXCLUDED is the magic word: it's the row you tried to insert, so EXCLUDED.content is the new value. Two variants:
ON CONFLICT … DO UPDATE SET … — overwrite the existing row (a true upsert).ON CONFLICT … DO NOTHING — silently skip the duplicate, no error. Great for idempotent inserts.RETURNING hands back columns from the affected row in the same round-trip — the new id, a created_at, whatever you need. It works on INSERT, UPDATE and DELETE.
INSERT … ON DUPLICATE KEY UPDATE is the closest cousin. Postgres' version is more explicit (you name the conflicting columns) and pairs with RETURNING — there's no separate last_insert_id() call.
Postgres can store JSON natively in a jsonb column — parsed, binary, indexable. It's perfect for semi-structured data that doesn't deserve its own columns: per-document metadata, flexible settings, API payloads you want to keep whole.
SELECT metadata -> 'source' -- returns JSON: "pdf" , metadata ->> 'source' -- returns text: pdf , metadata -> 'tags' ->> 0 -- first tag as text FROM documents WHERE metadata @> '{"source": "pdf"}'; -- "contains" this key/value
-> extracts a field/element as JSON (chainable).->> extracts it as text (use this when comparing to a string).@> is "contains" — true if the left JSONB contains the right. This is the workhorse filter, and it's indexable.-- make @> / key-existence lookups fast on a big table CREATE INDEX idx_docs_metadata ON documents USING GIN (metadata);
@> the same key every request, promote it to a column.
JSON type too, but Postgres' jsonb is more mature — richer operators (@>, ?, path queries) and proper GIN indexing. Prefer jsonb over plain json in Postgres: json stores raw text and re-parses every time.
Analytics queries live and die on date handling. The essentials:
SELECT now(); -- current timestamptz SELECT now() - INTERVAL '7 days'; -- a week ago SELECT date_trunc('day', created_at); -- floor to midnight SELECT AGE(now(), created_at); -- human interval, e.g. "3 mons 2 days"
date_trunc('day', …) is how you bucket rows for daily/weekly/monthly analytics — it chops a timestamp down to the start of the day (or 'hour', 'week', 'month'). The DocChat usage chart is one query:
-- documents uploaded per day, last 30 days SELECT date_trunc('day', created_at) AS day, COUNT(*) AS uploads FROM documents WHERE created_at >= now() - INTERVAL '30 days' GROUP BY day ORDER BY day;
timestamp stores a wall-clock with no timezone — ambiguous and a source of bugs. timestamptz stores an absolute instant (internally UTC) and converts to the session's timezone on the way out. For a UAE app whose users span timezones, timestamptz is the only sane choice. The "tz" costs nothing and saves you a class of bug.
DATETIME vs TIMESTAMP and stored UTC strings by hand, timestamptz does that bookkeeping for you correctly.
The fundamentals lesson covered ACID. Isolation — the "I" — is the part interviewers push on, because it governs what concurrent transactions see of each other.
MVCC (Multi-Version Concurrency Control) is how Postgres pulls it off: instead of locking rows for reads, every write creates a new version of the row, and each transaction sees a consistent snapshot of the data as of a point in time. The headline consequence: readers never block writers, and writers never block readers. That's why Postgres stays responsive under load where a lock-everything database would stall.
You choose how strict the snapshot is with the isolation level:
| Level | What it guarantees |
|---|---|
READ COMMITTED (default) | Each statement sees rows committed before it began. Two reads in the same transaction can differ if another commits between them. |
REPEATABLE READ | The whole transaction sees one frozen snapshot from its start — re-reading a row always returns the same value. Blocks non-repeatable reads. |
SERIALIZABLE | Strongest: the result is as if transactions ran one-at-a-time. Postgres may abort one with a serialization error you must retry. |
BEGIN ISOLATION LEVEL REPEATABLE READ; -- … reads here all see one consistent snapshot … COMMIT;
Pick READ COMMITTED for ordinary web requests (the default, fine 95% of the time). Step up to REPEATABLE READ or SERIALIZABLE for multi-step money/quota logic where a mid-transaction change would corrupt a decision.
deadlock detected error. The fix is consistent lock ordering: if every transaction always touches rows in the same order (e.g. always lower id first), the cycle can't form. Keep transactions short, and be ready to retry the aborted one.
READ COMMITTED is cheap and why you rarely need explicit read locks.
Four constructs that punch above their weight in real queries:
-- DISTINCT ON: the newest doc per user, one row each (Postgres-only, very handy) SELECT DISTINCT ON (owner_id) owner_id, title, created_at FROM documents ORDER BY owner_id, created_at DESC; -- COALESCE: first non-null wins — a default for missing values SELECT COALESCE(nickname, name, 'Anonymous') FROM users; -- CASE: inline if/else inside a SELECT SELECT title, CASE WHEN char_length(body) > 5000 THEN 'long' WHEN body IS NULL THEN 'empty' ELSE 'short' END AS size_bucket FROM documents;
And the set operators combine two result sets (each must have matching columns): UNION stacks rows and removes duplicates (UNION ALL keeps them, faster), while INTERSECT returns only rows present in both. DISTINCT ON is a Postgres specialty — a terser "first row per group" than a window function when you just want one row each.
ChallengeSQL
Given documents(id, owner_id, title, created_at), return each user's 3 most recent documents — the owner_id, title, and created_at, with the newest first within each user. Bonus: which index makes it fast, and which single line would you change to get the single newest doc per user instead?
Number each user's docs newest-first with ROW_NUMBER(), then keep ranks 1–3:
WITH ranked AS ( SELECT owner_id, title, created_at, ROW_NUMBER() OVER ( PARTITION BY owner_id ORDER BY created_at DESC ) AS rn FROM documents ) SELECT owner_id, title, created_at FROM ranked WHERE rn <= 3 ORDER BY owner_id, created_at DESC;
Index: CREATE INDEX ON documents (owner_id, created_at DESC); — it partitions by owner_id and pre-sorts by date, exactly what the window needs, so Postgres avoids a sort.
Single newest per user: change the filter to WHERE rn = 1 — or skip the CTE entirely with SELECT DISTINCT ON (owner_id) … ORDER BY owner_id, created_at DESC.
Tap a card to flip it. Quiz yourself until the answer arrives before the flip.
fn() OVER (PARTITION BY … ORDER BY …). PARTITION groups; ORDER sequences rows inside each group.INSERT … ON CONFLICT (cols) DO UPDATE SET c = EXCLUDED.c (or DO NOTHING). Target must be a unique/primary key.-> returns the field as JSON (chainable); ->> returns it as text. @> tests containment and is GIN-indexable.a or a AND b, but not b alone — it's sorted by a first, then b within each a.Answer from memory — retrieval is what moves this from "I read it" to "I know it".
Which keeps every row while adding a per-group rank?
What does EXPLAIN ANALYZE do that EXPLAIN won't?
In an UPSERT, what does EXCLUDED refer to?
Which JSONB operator tests for containment?
How do you reliably avoid most deadlocks?