Interview Bank · 2026
Joins, indexes, transactions, ACID, and the ORM gotchas that decide whether your API is fast or falls over. Say each answer aloud before you reveal it.
Click a card. Answer first, in one breath, then reveal.
PRIMARY KEY vs FOREIGN KEY?INNER vs LEFT JOIN?NULL where the right has no match.WHERE vs HAVING?WHERE filters individual rows before grouping; HAVING filters after GROUP BY, on aggregates.GROUP BY do?COUNT, SUM, AVG) run per group.COMMIT) or all undo (ROLLBACK) — treated as one indivisible unit.UNIQUE vs PRIMARY KEY?NULL.INNER = only matching rows in both tables. LEFT (outer) = all left rows + matches, NULL where none. RIGHT = the mirror. FULL OUTER = all rows from both, NULL-padded. CROSS = every combination (cartesian product). Day to day it's mostly INNER and LEFT.
Add one on columns you frequently filter, join, or sort on (e.g. a user_id foreign key, an email you look up by). The trade-off: indexes speed up reads but slow down writes — every INSERT/UPDATE/DELETE must also update the index — and they take disk space. So index hot read paths, not every column.
A transaction groups statements so they commit or roll back together. ACID: Atomic (all or nothing), Consistent (constraints always hold), Isolated (concurrent transactions don't see each other's half-done work), Durable (once committed, it survives a crash). Postgres gives you all four. Use a transaction whenever multiple writes must stay in sync (e.g. debit one account, credit another).
Normalize to remove duplication so each fact lives once — fewer update anomalies, cleaner writes (the default for OLTP apps). Denormalize (deliberately duplicate / pre-join) when reads are the bottleneck and joins are too slow — common for reporting/analytics. Start normalized; denormalize only with a measured reason.
Run EXPLAIN ANALYZE on it — it shows the real execution plan and timings. Look for a Seq Scan on a big table where you expected an index, bad row estimates, or expensive sorts. Usual fixes: add a missing index, rewrite the query, or update statistics (ANALYZE). Measure before and after — never guess.
Opening a fresh Postgres connection is expensive, and the server caps how many it allows. A pool keeps a set of connections open and hands them out, then reclaims them — so requests reuse connections instead of paying the setup cost each time. In FastAPI you let SQLAlchemy manage the pool; at scale a server-side pooler like PgBouncer sits in front.
An ORM (SQLAlchemy) maps tables to Python classes: safer (parameterised, no injection), less boilerplate, easier to maintain. Drop to raw SQL for complex analytics, window functions, or hand-tuned performance the ORM expresses awkwardly. Pragmatic answer: ORM for 90% of CRUD, raw SQL for the gnarly 10% — and SQLAlchemy lets you do both.
Aggregates (COUNT, SUM, AVG, MIN, MAX) collapse many rows into one value. GROUP BY splits rows into groups so the aggregate runs per group. WHERE filters rows before grouping; HAVING filters the groups after. Example: SELECT user_id, COUNT(*) FROM docs GROUP BY user_id HAVING COUNT(*) > 5;
Atomicity: a transaction is all-or-nothing — partial failure rolls the whole thing back. Consistency: it moves the DB from one valid state to another, constraints intact. Isolation: concurrent transactions behave as if run one at a time (tuned by isolation level). Durability: once committed, it's written to disk (via the write-ahead log) and survives a crash.
From weakest to strongest: Read Committed (Postgres default — you only see committed data, but a re-read can change), Repeatable Read (stable reads within the transaction; in Postgres this is snapshot isolation), Serializable (as if transactions ran one-by-one). They progressively block anomalies — dirty reads, non-repeatable reads, phantom reads. Stronger = safer but more conflicts/retries.
Multi-Version Concurrency Control: instead of locking rows for reads, Postgres keeps multiple row versions. Each transaction sees a consistent snapshot, so readers never block writers and writers never block readers. The cost: old dead row versions pile up, which is why VACUUM (and autovacuum) exists — to reclaim them.
It's a balanced sorted tree. Instead of scanning N rows linearly (O(N)), the DB walks the tree from the root, halving the search space at each level — roughly O(log N). Because it's sorted, it also accelerates range queries (BETWEEN, <, >) and ORDER BY. B-tree is Postgres's default index type.
The planner skips an index when a Seq Scan is cheaper — e.g. the query returns most of the table, the table is tiny, or a function/cast on the column hides it. OLTP (your app DB) = many small, fast transactions, normalized, index-heavy. OLAP (analytics/warehouse) = few huge aggregating queries over lots of rows, often denormalized. Postgres handles both; the workloads shape your indexing.
You fetch N parents (1 query), then loop and lazily load each one's children — firing N more queries. 100 documents → 101 round-trips, and the page crawls. ORMs cause it via lazy loading and solve it via eager loading: in SQLAlchemy use selectinload() / joinedload() to pull children in one extra query. Classic ORM interview trap.
SELECT * a smell in production code? trickyIt pulls every column (more I/O and network), breaks the moment someone adds/reorders columns, can drag huge TEXT/JSONB blobs you didn't need, and prevents index-only scans. Select the columns you actually use. Fine for quick ad-hoc exploration, not for code.
Wrapping the column in a function or casting it: WHERE lower(email) = '…' or comparing a varchar column to an int can't use a plain index on the raw column. Fixes: a matching expression index (CREATE INDEX ON users (lower(email))), or fix the type mismatch so the column stays "bare" on the left.
NULL means "unknown", so NULL = NULL is NULL, not TRUE — use IS NULL. x <> 5 silently drops rows where x is null. And COUNT(col) skips nulls while COUNT(*) counts every row — a common reason two counts disagree.
Put a condition on the right table in WHERE (e.g. WHERE r.status = 'x') and the unmatched left rows — whose right columns are NULL — get filtered out, so you've effectively made it an inner join. To keep them, move that condition into the ON clause instead, or allow r.status IS NULL.
Two transactions each hold a lock the other needs, so neither can proceed. Postgres detects it and kills one with a deadlock error. Avoid it by acquiring locks in a consistent order everywhere, keeping transactions short, and being ready to retry the loser. It surfaces under concurrent writes to the same rows.
The 2026 consensus is "just use Postgres." With extensions it covers cases that used to mean a second datastore — including vector search via pgvector for storing and similarity-searching embeddings (exactly what a RAG app needs), full-text search, geospatial (PostGIS), and queues. One database, fewer moving parts. Naming pgvector signals you know where AI meets the DB.
JSONB stores semi-structured data in a binary, indexable form (GIN indexes, operators like ->, ->>, @>) — great for flexible metadata, document chunks' attributes, or fields whose shape varies. Rule of thumb: model stable, queried fields as real columns; use JSONB for the genuinely variable bits, not as an excuse to skip schema design.
SQLAlchemy 2.0 standardised a unified, type-annotated API: declare models with Mapped[...] / mapped_column() and query with select() + session.execute() — not the legacy session.query(). It's async-friendly (pairs with async FastAPI) and plays well with type checkers. Mention that recent Postgres releases also keep improving vacuum, parallelism, and backup — though I'd verify the exact version specifics.
O(log N), not a full scan.pgvector; I index the foreign keys, wrap multi-step writes in transactions, and use SQLAlchemy 2.0 with eager loading to dodge N+1." Concrete beats abstract.