Interview Bank · 2026

🐘 Postgres & SQL

Joins, indexes, transactions, ACID, and the ORM gotchas that decide whether your API is fast or falls over. Say each answer aloud before you reveal it.

Rapid-fire flashcards flip to check

Click a card. Answer first, in one breath, then reveal.

PRIMARY KEY vs FOREIGN KEY?

Primary key = unique row identifier in this table. Foreign key = a column pointing at another table's primary key (the link between tables).

click to flip

INNER vs LEFT JOIN?

Inner keeps only rows that match in both tables. Left keeps every left-table row, filling NULL where the right has no match.

click to flip

What does an index do?

Keeps a sorted lookup structure so the DB finds rows without scanning the whole table — like a book's index instead of reading every page.

click to flip

WHERE vs HAVING?

WHERE filters individual rows before grouping; HAVING filters after GROUP BY, on aggregates.

click to flip

What does GROUP BY do?

Collapses rows sharing a value into one row per group, so aggregates (COUNT, SUM, AVG) run per group.

click to flip

What is a transaction?

A bundle of statements that all succeed (COMMIT) or all undo (ROLLBACK) — treated as one indivisible unit.

click to flip

What is ACID?

Atomic, Consistent, Isolated, Durable — the guarantees a transaction gives you.

click to flip

UNIQUE vs PRIMARY KEY?

Both enforce uniqueness, but a table has one primary key (not null, the identity), while you can have many unique constraints, and they may allow a NULL.

click to flip

What is normalization?

Splitting data into related tables to remove redundancy, so each fact lives in exactly one place.

click to flip

What is an ORM?

Object-Relational Mapper — maps DB rows to objects/classes so you write Python instead of raw SQL (e.g. SQLAlchemy).

click to flip

Core questions

Walk me through the JOIN types.

INNER = only matching rows in both tables. LEFT (outer) = all left rows + matches, NULL where none. RIGHT = the mirror. FULL OUTER = all rows from both, NULL-padded. CROSS = every combination (cartesian product). Day to day it's mostly INNER and LEFT.

When do you add an index — and what does it cost?

Add one on columns you frequently filter, join, or sort on (e.g. a user_id foreign key, an email you look up by). The trade-off: indexes speed up reads but slow down writes — every INSERT/UPDATE/DELETE must also update the index — and they take disk space. So index hot read paths, not every column.

Explain transactions and ACID.

A transaction groups statements so they commit or roll back together. ACID: Atomic (all or nothing), Consistent (constraints always hold), Isolated (concurrent transactions don't see each other's half-done work), Durable (once committed, it survives a crash). Postgres gives you all four. Use a transaction whenever multiple writes must stay in sync (e.g. debit one account, credit another).

Normalization vs denormalization — when each?

Normalize to remove duplication so each fact lives once — fewer update anomalies, cleaner writes (the default for OLTP apps). Denormalize (deliberately duplicate / pre-join) when reads are the bottleneck and joins are too slow — common for reporting/analytics. Start normalized; denormalize only with a measured reason.

How do you find and fix a slow query?

Run EXPLAIN ANALYZE on it — it shows the real execution plan and timings. Look for a Seq Scan on a big table where you expected an index, bad row estimates, or expensive sorts. Usual fixes: add a missing index, rewrite the query, or update statistics (ANALYZE). Measure before and after — never guess.

What is connection pooling and why does it matter?

Opening a fresh Postgres connection is expensive, and the server caps how many it allows. A pool keeps a set of connections open and hands them out, then reclaims them — so requests reuse connections instead of paying the setup cost each time. In FastAPI you let SQLAlchemy manage the pool; at scale a server-side pooler like PgBouncer sits in front.

Raw SQL vs an ORM like SQLAlchemy — when each?

An ORM (SQLAlchemy) maps tables to Python classes: safer (parameterised, no injection), less boilerplate, easier to maintain. Drop to raw SQL for complex analytics, window functions, or hand-tuned performance the ORM expresses awkwardly. Pragmatic answer: ORM for 90% of CRUD, raw SQL for the gnarly 10% — and SQLAlchemy lets you do both.

How do aggregate functions work with GROUP BY and HAVING?

Aggregates (COUNT, SUM, AVG, MIN, MAX) collapse many rows into one value. GROUP BY splits rows into groups so the aggregate runs per group. WHERE filters rows before grouping; HAVING filters the groups after. Example: SELECT user_id, COUNT(*) FROM docs GROUP BY user_id HAVING COUNT(*) > 5;

Theory deep-cuts the "why"

Explain ACID, one letter at a time. theory

Atomicity: a transaction is all-or-nothing — partial failure rolls the whole thing back. Consistency: it moves the DB from one valid state to another, constraints intact. Isolation: concurrent transactions behave as if run one at a time (tuned by isolation level). Durability: once committed, it's written to disk (via the write-ahead log) and survives a crash.

What are isolation levels and the anomalies they prevent? theory

From weakest to strongest: Read Committed (Postgres default — you only see committed data, but a re-read can change), Repeatable Read (stable reads within the transaction; in Postgres this is snapshot isolation), Serializable (as if transactions ran one-by-one). They progressively block anomalies — dirty reads, non-repeatable reads, phantom reads. Stronger = safer but more conflicts/retries.

What is MVCC in Postgres? theory

Multi-Version Concurrency Control: instead of locking rows for reads, Postgres keeps multiple row versions. Each transaction sees a consistent snapshot, so readers never block writers and writers never block readers. The cost: old dead row versions pile up, which is why VACUUM (and autovacuum) exists — to reclaim them.

How does a B-tree index actually speed up a lookup? theory

It's a balanced sorted tree. Instead of scanning N rows linearly (O(N)), the DB walks the tree from the root, halving the search space at each level — roughly O(log N). Because it's sorted, it also accelerates range queries (BETWEEN, <, >) and ORDER BY. B-tree is Postgres's default index type.

When is an index NOT used — and OLTP vs OLAP? theory

The planner skips an index when a Seq Scan is cheaper — e.g. the query returns most of the table, the table is tiny, or a function/cast on the column hides it. OLTP (your app DB) = many small, fast transactions, normalized, index-heavy. OLAP (analytics/warehouse) = few huge aggregating queries over lots of rows, often denormalized. Postgres handles both; the workloads shape your indexing.

Tricky & gotchas where candidates trip

What is the N+1 query problem? tricky

You fetch N parents (1 query), then loop and lazily load each one's children — firing N more queries. 100 documents → 101 round-trips, and the page crawls. ORMs cause it via lazy loading and solve it via eager loading: in SQLAlchemy use selectinload() / joinedload() to pull children in one extra query. Classic ORM interview trap.

Why is SELECT * a smell in production code? tricky

It pulls every column (more I/O and network), breaks the moment someone adds/reorders columns, can drag huge TEXT/JSONB blobs you didn't need, and prevents index-only scans. Select the columns you actually use. Fine for quick ad-hoc exploration, not for code.

Why might a perfectly good index be ignored? tricky

Wrapping the column in a function or casting it: WHERE lower(email) = '…' or comparing a varchar column to an int can't use a plain index on the raw column. Fixes: a matching expression index (CREATE INDEX ON users (lower(email))), or fix the type mismatch so the column stays "bare" on the left.

What goes wrong with NULL in comparisons and COUNT? tricky

NULL means "unknown", so NULL = NULL is NULL, not TRUE — use IS NULL. x <> 5 silently drops rows where x is null. And COUNT(col) skips nulls while COUNT(*) counts every row — a common reason two counts disagree.

How does a LEFT JOIN quietly become an INNER JOIN? tricky

Put a condition on the right table in WHERE (e.g. WHERE r.status = 'x') and the unmatched left rows — whose right columns are NULL — get filtered out, so you've effectively made it an inner join. To keep them, move that condition into the ON clause instead, or allow r.status IS NULL.

What is a deadlock and how do you avoid it? tricky

Two transactions each hold a lock the other needs, so neither can proceed. Postgres detects it and kills one with a deadlock error. Avoid it by acquiring locks in a consistent order everywhere, keeping transactions short, and being ready to retry the loser. It surfaces under concurrent writes to the same rows.

What's new in 2026 say this and stand out

Why is Postgres "the one database for everything" now? 2026

The 2026 consensus is "just use Postgres." With extensions it covers cases that used to mean a second datastore — including vector search via pgvector for storing and similarity-searching embeddings (exactly what a RAG app needs), full-text search, geospatial (PostGIS), and queues. One database, fewer moving parts. Naming pgvector signals you know where AI meets the DB.

When would you reach for JSONB? 2026

JSONB stores semi-structured data in a binary, indexable form (GIN indexes, operators like ->, ->>, @>) — great for flexible metadata, document chunks' attributes, or fields whose shape varies. Rule of thumb: model stable, queried fields as real columns; use JSONB for the genuinely variable bits, not as an excuse to skip schema design.

What's modern SQLAlchemy 2.0 style? 2026

SQLAlchemy 2.0 standardised a unified, type-annotated API: declare models with Mapped[...] / mapped_column() and query with select() + session.execute() — not the legacy session.query(). It's async-friendly (pairs with async FastAPI) and plays well with type checkers. Mention that recent Postgres releases also keep improving vacuum, parallelism, and backup — though I'd verify the exact version specifics.

Memory hooks Index = a book's index. You find the page without reading every page — that's O(log N), not a full scan.
ACID = Atomic, Consistent, Isolated, Durable. All-or-nothing, valid, unbothered by neighbours, survives a crash.
N+1 = "1 to find them, N to fetch them." Eager-load to collapse it back to two queries.

Tie it to DocChat When asked "do you know SQL?", point at the build: "DocChat's Postgres holds users, documents, and chunk embeddings in pgvector; I index the foreign keys, wrap multi-step writes in transactions, and use SQLAlchemy 2.0 with eager loading to dodge N+1." Concrete beats abstract.