Module 3 · Postgres & Data · Deep Dive
You already write SQL from your PHP/MySQL days — this lesson sharpens it, fixes the MySQL-isms, and grounds it in Postgres, the database your DocChat backend will actually run on.
BasicIntermediateBuild
A relational database stores data in tables. A table has named columns (each with a fixed type) and holds rows (one record each). A users table is exactly a spreadsheet of users: columns across the top, one user per row.
| Type | Example | Use for |
|---|---|---|
integer | 42 | Counts, IDs, whole numbers. |
text | 'intro.pdf' | Strings of any length (Postgres has no penalty). |
timestamptz | now() | A date + time with timezone. |
boolean | true / false | Yes/no flags like is_active. |
Two constraints define how rows relate:
id). No two rows share it, and it's never null.documents.owner_id that references users.id is what makes "this document belongs to that user" true and enforced.PRIMARY KEY, FOREIGN KEY. Postgres is very close to MySQL in SQL syntax; it's just the more rigorous, standards-compliant database and the production standard for modern Python apps. Most of your MySQL muscle memory transfers directly.
You define a table with CREATE TABLE. Each column gets a name, a type, and optional constraints that the database will enforce for you — forever, on every write.
CREATE TABLE users ( id serial PRIMARY KEY, email text NOT NULL UNIQUE, name text NOT NULL, is_active boolean NOT NULL DEFAULT true, created_at timestamptz NOT NULL DEFAULT now() );
What each constraint buys you:
serial PRIMARY KEY — an auto-incrementing integer id, unique per row. (Postgres' serial ≈ MySQL's AUTO_INCREMENT.)NOT NULL — the column can never be empty; an insert without it is rejected.UNIQUE — no two rows may share this value. Two users can't register the same email.DEFAULT — the value used when you don't supply one. New users are active and stamped with the current time automatically.NOT NULL or UNIQUE in the schema is a guarantee the database keeps no matter what code writes to it. Push rules into the schema; don't rely on application code to remember them.
Create, Read, Update, Delete — INSERT, SELECT, UPDATE, DELETE. These are 90% of what you write daily. We'll use a documents table that belongs to users.
-- pick columns, filter, sort, cap the count SELECT id, title, created_at FROM documents WHERE owner_id = 1 ORDER BY created_at DESC LIMIT 10;
Read it left to right: from documents, keep rows where the owner is user 1, order newest first, return at most 10. Use SELECT * to grab every column, but name columns explicitly in real code — it's faster and won't break when the table grows.
SELECT … WHERE … ORDER BY … LIMIT you ran through mysqli/PDO. Same clauses, same order.
INSERT INTO documents (owner_id, title, body) VALUES (1, 'Intro to RAG', 'Retrieval augmented…') RETURNING id;
You list the columns, then the matching values. Columns you skip take their DEFAULT. The Postgres-only RETURNING clause hands back the new row's id in the same round-trip — no separate "last insert id" call like in MySQL.
UPDATE documents SET title = 'Intro to RAG (v2)' WHERE id = 7;
UPDATE or DELETE without a WHERE clause hits every row in the table. Always write the WHERE first. Many teams run inside a transaction (next section) so a mistake can be rolled back.
DELETE FROM documents WHERE id = 7;
Data is split across tables to avoid duplication — users in one, their documents in another, linked by the owner_id foreign key. A JOIN stitches them back together in a query.
-- every document, with its owner's name SELECT d.title, u.name AS owner FROM documents d INNER JOIN users u ON u.id = d.owner_id;
The ON clause says how the two tables line up: a document's owner_id matches a user's id. The d and u are short table aliases.
The difference between the two joins you'll use most:
NULL where there's no match. Use it to ask "all users, even those with zero documents."-- every user, plus their documents — users with none still show, body NULL SELECT u.name, d.title FROM users u LEFT JOIN documents d ON d.owner_id = u.id;Interview: "Explain a LEFT JOIN." Say: it returns all rows from the left table and the matching rows from the right; where there's no match, the right-hand columns come back
NULL. The classic use is finding records with nothing on the other side — e.g. LEFT JOIN … WHERE d.id IS NULL lists users who have uploaded no documents. An INNER JOIN would silently drop them.
Aggregate functions collapse many rows into one number: COUNT, SUM, AVG, MIN, MAX. On their own they summarise the whole table:
SELECT COUNT(*) AS total_docs FROM documents;
GROUP BY splits the rows into buckets first, then aggregates each bucket. "How many documents does each user own?" is one query:
SELECT owner_id, COUNT(*) AS doc_count FROM documents GROUP BY owner_id HAVING COUNT(*) > 3 ORDER BY doc_count DESC;
WHERE filters individual rows before grouping; HAVING filters the groups after aggregating. "Only documents from 2026" is a WHERE; "only users with more than 3 documents" is a HAVING. Mixing them up is a common interview slip.
Without an index, finding rows by a column means the database reads every row to check — a "sequential scan." On a big table that's slow. An index is a sorted lookup structure (a B-tree) that lets Postgres jump straight to the matching rows, the way a book index sends you to a page instead of skimming the whole book.
CREATE INDEX idx_documents_owner ON documents (owner_id);
Now WHERE owner_id = 1 and joins on owner_id get dramatically faster as the table grows. When to add one: on columns you frequently filter, join, or sort by — foreign keys and lookup columns are prime candidates. Don't index everything: each index makes INSERT/UPDATE a little slower and uses disk, so add them where reads actually need them.
To see whether a query uses an index, prefix it with EXPLAIN — Postgres shows the plan, where "Index Scan" means it's using your index and "Seq Scan" means it isn't:
EXPLAIN SELECT * FROM documents WHERE owner_id = 1;Interview: "What's an index?" A separate sorted data structure (usually a B-tree) on one or more columns that lets the database find matching rows without scanning the whole table — turning a slow linear search into a fast lookup. The trade-off: faster reads, slightly slower writes and extra storage, so you index the columns you actually query on (especially foreign keys), not every column.
A transaction groups several statements into one all-or-nothing unit. You open it with BEGIN, and either COMMIT to save everything or ROLLBACK to discard everything as if it never happened.
BEGIN; INSERT INTO documents (owner_id, title) VALUES (1, 'Report'); UPDATE users SET doc_quota = doc_quota - 1 WHERE id = 1; COMMIT; -- both happen, or if anything fails, ROLLBACK and neither does
This is atomicity — the "A" in ACID — and it's why transactions matter: the insert and the quota decrement must both succeed or both fail, never half. If the server crashed between them without a transaction, you'd have a document with no quota charged. The full set:
beginTransaction() / commit() / rollBack() with PDO, this is the same idea — Postgres just gives you stronger guarantees by default.
psql is the official command-line client — your REPL for SQL. Create a database and connect:
# create a database from the shell createdb docchat # open an interactive session against it psql docchat docchat=# SELECT version(); docchat=# \dt -- list tables docchat=# \d users -- describe the users table docchat=# \q -- quit
Backslash commands (\dt, \d, \q) are psql's own helpers, not SQL. Prefer a GUI? pgAdmin ships with Postgres, and DBeaver is a free, popular cross-database client — both let you browse tables and run queries visually, like phpMyAdmin did for MySQL. In Week 7 we'll run Postgres in Docker so DocChat's database is reproducible on any machine.
users table and a documents table, linked by a foreign key so every document belongs to exactly one user. This is the literal foundation your capstone is built on — you'll add to it in every later lesson.
Open psql docchat and run these. Try writing them from memory first, then compare:
schema.sql
CREATE TABLE users ( id serial PRIMARY KEY, email text NOT NULL UNIQUE, name text NOT NULL, is_active boolean NOT NULL DEFAULT true, created_at timestamptz NOT NULL DEFAULT now() ); CREATE TABLE documents ( id serial PRIMARY KEY, owner_id integer NOT NULL REFERENCES users (id), title text NOT NULL, body text, created_at timestamptz NOT NULL DEFAULT now() ); -- foreign keys are filtered/joined constantly — index it CREATE INDEX idx_documents_owner ON documents (owner_id);
The REFERENCES users (id) on owner_id is the foreign key — Postgres now refuses any document whose owner_id doesn't match a real user, and refuses to delete a user who still owns documents. Seed a row and read it back:
INSERT INTO users (email, name) VALUES ('sam@docchat.ae', 'Sam') RETURNING id; INSERT INTO documents (owner_id, title) VALUES (1, 'My first doc'); SELECT u.name, d.title FROM users u JOIN documents d ON d.owner_id = u.id;
Answer from memory — retrieval is what moves this from "I read it" to "I know it".
What does a FOREIGN KEY actually enforce?
Which join keeps users who own no documents?
Which clause filters groups after aggregating rows?
Why add an index to a frequently-filtered column?
What does ROLLBACK do to a transaction's writes?