Module 1 — JOINS
Why JOINs exist
A real database almost never stores everything in one giant table. Instead, information is split across many small tables, each holding one kind of thing. This is called normalization — it avoids repeating the same data over and over.
dept_id (just a number, like "3"). Sheet 2 has a list of departments with their names, matched to that same number. Neither sheet alone tells you the full story — "employee John is in dept 3" is meaningless until you look up what dept 3 is called. A JOIN is the act of using VLOOKUP (Excel's version of a join) to stitch the two sheets together into one combined view.
What a JOIN actually is
A JOIN is a way to combine rows from two (or more) tables based on a related column between them. That related column is usually a foreign key in one table pointing to a primary key in another.
When to use it
- Whenever the answer to your question needs data that lives in more than one table.
- Whenever you want to enrich raw IDs (like
dept_id = 3) with human-readable data (like"Engineering"). - Whenever you need to check relationships — who reports to whom, which orders belong to which customer, etc.
When NOT to use it
- When all the data you need already lives in one table — joining unnecessarily just slows the query down.
- When you only need to check existence of a related row (does this customer have any order at all?) — often an
EXISTSsubquery is faster and clearer than a JOIN, because a JOIN can multiply rows while EXISTS just answers yes/no. We'll cover this in the Subqueries module.
The schema and data we'll use for this entire module
We'll reuse these two tables — employees and departments — for every join type, so you see how the exact same data behaves differently under each join. The data is deliberately messy: it has an employee with no department, a department with no employees, duplicate names, and a self-referencing manager column.
DDL
CREATE TABLE departments (
dept_id INT PRIMARY KEY,
dept_name VARCHAR(50) NOT NULL,
location VARCHAR(50)
);
CREATE TABLE employees (
emp_id INT PRIMARY KEY,
emp_name VARCHAR(50) NOT NULL,
dept_id INT, -- can be NULL: not every employee is assigned yet
manager_id INT, -- self-referencing FK to emp_id, can be NULL (top of hierarchy)
salary DECIMAL(10,2),
hire_date DATE,
FOREIGN KEY (dept_id) REFERENCES departments(dept_id),
FOREIGN KEY (manager_id) REFERENCES employees(emp_id)
);
INSERT statements
INSERT INTO departments (dept_id, dept_name, location) VALUES
(1, 'Engineering', 'Bangalore'),
(2, 'Sales', 'Mumbai'),
(3, 'Marketing', 'Delhi'),
(4, 'HR', 'Bangalore'),
(5, 'Finance', 'Pune'),
(6, 'Legal', 'Delhi'); -- Legal has ZERO employees on purpose
INSERT INTO employees (emp_id, emp_name, dept_id, manager_id, salary, hire_date) VALUES
(101, 'Arjun Mehta', 1, NULL, 185000, '2019-03-01'), -- CTO, no manager (top of tree)
(102, 'Priya Nair', 1, 101, 145000, '2020-01-15'),
(103, 'Rohit Sharma', 1, 101, 145000, '2020-06-10'), -- tie salary with Priya
(104, 'Sneha Kapoor', 1, 102, 98000, '2021-02-20'),
(105, 'Karan Verma', 2, NULL, 160000, '2018-11-05'), -- Sales head
(106, 'Ananya Iyer', 2, 105, 92000, '2021-07-01'),
(107, 'Vikram Singh', 2, 105, 92000, '2022-01-10'), -- tie salary with Ananya
(108, 'Rohit Sharma', 2, 105, 88000, '2022-03-15'), -- DUPLICATE NAME, different person
(109, 'Meera Pillai', 3, NULL, 110000, '2019-08-01'),
(110, 'Aditya Rao', 3, 109, 75000, '2023-01-05'),
(111, 'Divya Menon', 4, NULL, 105000, '2020-04-01'),
(112, 'Farhan Khan', 4, 111, 60000, '2023-05-20'),
(113, 'Ishita Bose', 5, NULL, 120000, '2019-09-15'),
(114, 'Nikhil Joshi', 5, 113, 70000, '2022-08-01'),
(115, 'Tanvi Desai', NULL, NULL, 55000, '2024-01-10'), -- NOT assigned to any dept yet
(116, 'Sameer Gupta', NULL, NULL, 52000, '2024-02-14'); -- another unassigned employee
| dept_id | dept_name | location |
|---|---|---|
| 1 | Engineering | Bangalore |
| 2 | Sales | Mumbai |
| 3 | Marketing | Delhi |
| 4 | HR | Bangalore |
| 5 | Finance | Pune |
| 6 | Legal | Delhi |
| emp_id | emp_name | dept_id | mgr_id | salary |
|---|---|---|---|---|
| 101 | Arjun Mehta | 1 | NULL | 185000 |
| 102 | Priya Nair | 1 | 101 | 145000 |
| 103 | Rohit Sharma | 1 | 101 | 145000 |
| 104 | Sneha Kapoor | 1 | 102 | 98000 |
| 105 | Karan Verma | 2 | NULL | 160000 |
| 106 | Ananya Iyer | 2 | 105 | 92000 |
| 107 | Vikram Singh | 2 | 105 | 92000 |
| 108 | Rohit Sharma | 2 | 105 | 88000 |
| 109 | Meera Pillai | 3 | NULL | 110000 |
| 110 | Aditya Rao | 3 | 109 | 75000 |
| 111 | Divya Menon | 4 | NULL | 105000 |
| 112 | Farhan Khan | 4 | 111 | 60000 |
| 113 | Ishita Bose | 5 | NULL | 120000 |
| 114 | Nikhil Joshi | 5 | 113 | 70000 |
| 115 | Tanvi Desai | NULL | NULL | 55000 |
| 116 | Sameer Gupta | NULL | NULL | 52000 |
INNER JOIN
What it is: returns only the rows where the join condition matches on both sides. If an employee has no matching department, or a department has no matching employee, those rows are dropped.
Why it exists: most of the time you only care about complete, matched pairs — "give me employees together with their department name" only makes sense for employees who actually have a department.
Easy example
Find employee name and department name, for employees who have a department.
SELECT e.emp_name, d.dept_name
FROM employees e
INNER JOIN departments d
ON e.dept_id = d.dept_id;
Line by line:
FROM employees e— start reading from the employees table, alias iteso we can refer to it shorter.INNER JOIN departments d— bring in the departments table, aliasd.ON e.dept_id = d.dept_id— this is the matching rule: only keep pairs where the department id lines up.- Result: 14 rows (Tanvi and Sameer are dropped because their
dept_idis NULL and can't match anything; Legal is dropped because no employee points to dept 6).
NULL = NULL is true in SQL. It isn't — it's UNKNOWN. That is exactly why Tanvi and Sameer (dept_id = NULL) never match any row in departments, even though departments also never has a NULL dept_id to compare against. NULL never equals anything, not even another NULL.How the database executes this internally (conceptually)
Logically, the engine considers the Cartesian product first — every employee row paired with every department row (16 × 6 = 96 combinations) — and then keeps only the pairs where e.dept_id = d.dept_id is TRUE. In practice, no real database is dumb enough to actually build all 96 rows; the query optimizer picks a smarter physical strategy (nested loop, hash join, or merge join — covered in Level 3). But the logical result is always as if it did the full cross product first and filtered after.
LEFT JOIN (LEFT OUTER JOIN)
What it is: returns all rows from the left table, plus matching rows from the right table. If there's no match on the right, the right side's columns come back as NULL — but the left row is never dropped.
Why it exists: sometimes "no match" is itself important information. You want to know which employees don't have a department yet, not just hide them.
Easy example
SELECT e.emp_name, e.dept_id, d.dept_name
FROM employees e
LEFT JOIN departments d
ON e.dept_id = d.dept_id;
Result: all 16 employees appear. Tanvi and Sameer show dept_name = NULL because they have no matching department — but they're still in the output.
Medium example — finding "orphans"
The most powerful real use of LEFT JOIN: find rows in the left table that have no match at all.
SELECT e.emp_name
FROM employees e
LEFT JOIN departments d
ON e.dept_id = d.dept_id
WHERE d.dept_id IS NULL; -- keep only the rows where the join failed to find a match
Result: Tanvi Desai, Sameer Gupta. This "LEFT JOIN + WHERE right.key IS NULL" pattern is called an anti-join pattern and shows up constantly in production data-quality checks (e.g. "find orders with no matching customer").
WHERE clause (instead of the ON clause) silently turns your LEFT JOIN back into something that behaves like an INNER JOIN. Example: WHERE d.location = 'Bangalore' in the WHERE clause removes all rows where d.location is NULL — i.e. it throws away exactly the unmatched rows you were trying to keep. If you want to filter the right table without losing unmatched left rows, put the condition inside the ON clause instead: LEFT JOIN departments d ON e.dept_id = d.dept_id AND d.location = 'Bangalore'.RIGHT JOIN & FULL OUTER JOIN
RIGHT JOIN
What it is: the mirror image of LEFT JOIN — keeps all rows from the right table, and NULLs out the left side where there's no match.
SELECT e.emp_name, d.dept_name
FROM employees e
RIGHT JOIN departments d
ON e.dept_id = d.dept_id;
Result: Legal (dept 6) now appears with emp_name = NULL, because RIGHT JOIN keeps every department even if no employee belongs to it.
FROM departments d LEFT JOIN employees e ON ... is identical to the RIGHT JOIN above, and it's easier for a reader to scan top-to-bottom. Codebases standardize on LEFT JOIN for consistency.FULL OUTER JOIN
What it is: keeps everything from both sides. If a row matches, you get the combined row. If it doesn't match on either side, that side is padded with NULLs.
SELECT e.emp_name, d.dept_name
FROM employees e
FULL OUTER JOIN departments d
ON e.dept_id = d.dept_id;
Result: 18 rows — all 16 employees (including Tanvi and Sameer with NULL dept_name) PLUS Legal appearing once with NULL emp_name.
FULL OUTER JOIN directly (as of standard MySQL). The workaround is to UNION a LEFT JOIN and a RIGHT JOIN: SELECT ... FROM a LEFT JOIN b ON ... UNION SELECT ... FROM a RIGHT JOIN b ON .... Interviewers sometimes ask you to know this MySQL limitation specifically.CROSS JOIN & SELF JOIN
CROSS JOIN
What it is: every row from table A paired with every row from table B. No ON condition at all. If A has m rows and B has n rows, the result has m × n rows.
SELECT e.emp_name, d.dept_name
FROM employees e
CROSS JOIN departments d;
-- 16 employees x 6 departments = 96 rows
When to use it: generating all combinations — e.g. every product × every size × every color, or every date in a calendar × every store (to build a report scaffold with zero-filled rows for days with no sales).
ON clause, or writing FROM employees, departments (old comma-join syntax) without a WHERE condition to relate them. This silently produces a Cartesian product — your row count explodes and numbers like SUM() get wildly inflated. This is one of the most common real production bugs.SELF JOIN
What it is: a table joined to itself, using two different aliases to treat it as if it were two tables. Used whenever a row references another row in the same table — like an employee referencing their manager, who is also an employee.
SELECT
e.emp_name AS employee,
m.emp_name AS manager
FROM employees e
LEFT JOIN employees m
ON e.manager_id = m.emp_id;
Line by line: we treat the employees table as two roles at once — e plays "the employee", m plays "the manager". We LEFT JOIN so that top-level people (like Arjun, whose manager_id is NULL) still show up, with manager = NULL.
parent_id pointing back into the same table. To print "child — parent" pairs, you join the table to itself.Multi-table joins
Real queries rarely join just two tables. You chain joins, left to right, and each one narrows or widens the working row set.
SELECT
e.emp_name,
d.dept_name,
m.emp_name AS manager_name
FROM employees e
LEFT JOIN departments d ON e.dept_id = d.dept_id
LEFT JOIN employees m ON e.manager_id = m.emp_id
ORDER BY e.emp_id;
EXPLAIN.Hard example — three-way join with aggregation
Find, for each department, the average salary and the department's own manager count (how many distinct managers exist among its employees).
SELECT
d.dept_name,
ROUND(AVG(e.salary), 2) AS avg_salary,
COUNT(DISTINCT e.manager_id) AS distinct_managers
FROM departments d
LEFT JOIN employees e ON d.dept_id = e.dept_id
GROUP BY d.dept_name;
Notice we started FROM departments this time (not employees) so that Legal still shows up in the output, with avg_salary = NULL and distinct_managers = 0.
Non-equi joins
What it is: a join where the condition isn't a plain equality (=) — instead it uses <, >, BETWEEN, etc.
When to use it: salary-band lookups, date-range matching, geographic containment — anything where "belongs to a range" replaces "equals an id".
Setup: a salary bands table
CREATE TABLE salary_bands (
band_name VARCHAR(20),
min_salary DECIMAL(10,2),
max_salary DECIMAL(10,2)
);
INSERT INTO salary_bands VALUES
('Junior', 50000, 89999),
('Mid', 90000, 129999),
('Senior', 130000, 169999),
('Staff', 170000, 999999);
SELECT e.emp_name, e.salary, b.band_name
FROM employees e
JOIN salary_bands b
ON e.salary BETWEEN b.min_salary AND b.max_salary;
SEMI JOIN & ANTI JOIN
These aren't separate SQL keywords in standard SQL — they're patterns, usually written with EXISTS/NOT EXISTS or IN/NOT IN, or sometimes simulated with a LEFT JOIN.
SEMI JOIN — "does a match exist?"
What it is: return rows from table A where at least one match exists in table B — but never duplicate A's rows even if B has multiple matches, and never pull any columns from B.
-- Departments that have at least one employee
SELECT d.dept_name
FROM departments d
WHERE EXISTS (
SELECT 1 FROM employees e WHERE e.dept_id = d.dept_id
);
Compare this to a naive INNER JOIN version:
-- BAD for this purpose: returns Engineering 4 times (once per employee)
SELECT DISTINCT d.dept_name
FROM departments d
JOIN employees e ON d.dept_id = e.dept_id;
The INNER JOIN version needs a DISTINCT bolted on to fix the duplication it caused. The EXISTS version never had that problem, because it only checks "does a match exist", it never actually multiplies rows.
ANTI JOIN — "no match exists"
-- Departments with ZERO employees (Legal)
SELECT d.dept_name
FROM departments d
WHERE NOT EXISTS (
SELECT 1 FROM employees e WHERE e.dept_id = d.dept_id
);
NOT IN behaves dangerously if the subquery can return any NULL. WHERE dept_id NOT IN (SELECT dept_id FROM employees) — if even ONE row in the subquery has dept_id = NULL, the entire NOT IN returns an empty result set (not an error, just silently wrong / empty), because comparing anything to NULL gives UNKNOWN, and NOT IN requires every comparison to definitively succeed. In our data, employees.dept_id contains NULLs (Tanvi, Sameer) — so NOT IN on this column would silently return zero rows! NOT EXISTS has no such trap, which is why professionals default to NOT EXISTS over NOT IN whenever the column can contain NULLs.How joins execute internally
The database's query optimizer picks one of three physical strategies to actually run a logical join. Knowing these separates people who memorized syntax from people who understand databases.
1. Nested Loop Join
For every row in the outer (usually smaller) table, scan the inner table looking for matches. Simple, but O(n × m) in the worst case — unless the inner table has an index on the join column, which turns the inner scan into a fast lookup (O(n × log m) or better).
2. Hash Join
Build a hash table in memory from the smaller table, keyed on the join column. Then scan the bigger table once, and for each row, do an O(1) hash lookup into that hash table. Overall roughly O(n + m).
=) conditions — they can't be used for non-equi joins like BETWEEN, since a hash table only supports exact-key lookups.3. Merge Join (Sort-Merge Join)
If both tables are already sorted on the join key (or the optimizer sorts them first), the engine can walk both sorted lists with two pointers simultaneously, like merging two sorted decks of cards. Very efficient — O(n + m) after sorting, and works for non-equi range conditions too.
EXPLAIN (or EXPLAIN ANALYZE) on any slow join to see which strategy the planner actually picked, and why.Cardinality & the duplicate explosion trap
"Cardinality" of a join describes the relationship between matching rows: one-to-one, one-to-many, or many-to-many.
- One-to-one: each row on both sides matches exactly one row on the other side. Row count stays the same after the join.
- One-to-many: one row on the "one" side matches multiple rows on the "many" side. Example: one department matches multiple employees. The department's row gets repeated once per matching employee.
- Many-to-many: multiple rows on both sides can match each other. Row count can explode multiplicatively.
orders (one row per order) to order_items (multiple rows per order — one per product in the order), then try to SUM(orders.total_amount). Because the join repeats each order row once per item, you're now summing the same order total multiple times — your revenue number is inflated. The fix: aggregate order_items down to one row per order first (in a subquery or CTE), THEN join to orders — or aggregate carefully with SUM(DISTINCT ...) style logic (which has its own caveats), or restructure the query entirely. This single bug pattern is probably the #1 real-world SQL mistake in reporting pipelines.
Concrete demo with our data
Departments-to-employees is one-to-many. Watch what happens if we naively try to compute "total company salary" while joined to employees without pre-aggregating:
-- WRONG idea if departments itself had a "budget" column and we joined + summed budget
-- naive: SELECT SUM(d.budget) FROM departments d JOIN employees e ON d.dept_id = e.dept_id;
-- This would count each department's budget once PER EMPLOYEE in that department — inflated!
A. Join + Aggregation traps (the #1 production SQL bug)
You already met this idea in the cardinality section. Now let's make it concrete with the schema every data engineer eventually deals with: orders (one row per order) and order_items (many rows per order — one per line item).
DDL & sample data
CREATE TABLE orders (
order_id INT PRIMARY KEY,
customer_id INT,
order_date DATE,
total_amount DECIMAL(10,2) -- the order's true total, computed once at checkout
);
CREATE TABLE order_items (
item_id INT PRIMARY KEY,
order_id INT,
product_id INT,
quantity INT,
unit_price DECIMAL(10,2)
);
INSERT INTO orders VALUES
(5001, 201, '2024-01-10', 2500.00),
(5002, 202, '2024-01-11', 1200.00),
(5003, 201, '2024-01-15', 800.00);
INSERT INTO order_items VALUES
(1, 5001, 'P1', 2, 1000.00), -- order 5001 has 3 line items
(2, 5001, 'P2', 1, 500.00),
(3, 5001, 'P3', 1, 500.00),
(4, 5002, 'P1', 1, 1200.00), -- order 5002 has 1 line item
(5, 5003, 'P4', 2, 400.00); -- order 5003 has 1 line item
| order_id | customer_id | total_amount |
|---|---|---|
| 5001 | 201 | 2500.00 |
| 5002 | 202 | 1200.00 |
| 5003 | 201 | 800.00 |
Real total revenue = 2500 + 1200 + 800 = 4500.00
Trap 1 — the naive join-then-sum
-- WRONG
SELECT SUM(o.total_amount) AS revenue
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id;
-- Result: 8300.00 (should be 4500.00)
Why it's wrong: order 5001 has 3 line items, so the JOIN produces 3 output rows for it — and each of those 3 rows still carries total_amount = 2500.00. Summing that column now adds 2500 three times instead of once. Order 5001 alone contributes 2500 × 3 = 7500 to the wrong sum instead of 2500. This is the single most common "why is our dashboard revenue number wrong" bug in real companies.
Trap 2 — "just add DISTINCT" (still wrong, or fragile)
-- STILL WRONG, and fragile
SELECT SUM(DISTINCT o.total_amount) AS revenue
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id;
-- Works here by accident, but breaks the moment two DIFFERENT orders share the same total_amount
If order 5002 and a hypothetical order 5004 both happened to total exactly 1200.00, SUM(DISTINCT ...) would collapse them into one and undercount. DISTINCT deduplicates by value, not by which row it came from — never use it as a duplication fix for aggregates.
Trap 3 — the correct fix: pre-aggregate before joining
-- CORRECT: don't even join to order_items if you only need the order total
SELECT SUM(o.total_amount) AS revenue
FROM orders o;
-- Result: 4500.00 ✓
If you actually need something from order_items (say, total quantity sold) alongside the order total, aggregate order_items down to one row per order first, in a subquery/CTE, then join — never join first and aggregate the "one" side's column after.
-- CORRECT: pre-aggregate the "many" side to order-grain BEFORE joining
SELECT
o.order_id,
o.total_amount,
item_totals.total_qty
FROM orders o
JOIN (
SELECT order_id, SUM(quantity) AS total_qty
FROM order_items
GROUP BY order_id
) item_totals ON o.order_id = item_totals.order_id;
Now the join is one-to-one (order to its own pre-aggregated summary), so o.total_amount is never duplicated.
Trap 4 — the same bug hiding inside AVG()
-- WRONG: looks innocent, still broken
SELECT o.customer_id, AVG(o.total_amount) AS avg_order_value
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
GROUP BY o.customer_id;
-- Customer 201's order 5001 (weight x3 due to 3 items) skews the average toward 2500
AVG isn't safe just because it "averages things out" — it's still averaging over duplicated rows, so orders with more line items get more influence on the average than orders with fewer line items. The fix is identical: pre-aggregate order_items to order-grain first, or simply compute the average directly from the un-joined orders table.
JOIN ... GROUP BY ... SUM/AVG/COUNT(some_column), ask: "does some_column belong to a table whose grain got finer because of this join?" If yes, pre-aggregate that table to the right grain first, in a subquery or CTE, before joining.B. Slowly Changing Dimension (SCD Type 2) joins
What it is: in a data warehouse, dimension attributes change over time (an employee's salary band changes, a product's price changes, a customer moves cities). SCD Type 2 keeps every historical version of a dimension row, each tagged with an effective_start and effective_end date, instead of overwriting the old value.
Why it exists: if you overwrote the dimension row in place (SCD Type 1), historical fact rows would silently be re-attributed to the current value — a sale made in 2022 at an old price would suddenly look like it used today's price when you re-run a report. SCD2 preserves "what was true at the time the fact happened."
DDL & sample data
CREATE TABLE dim_customer_scd2 (
customer_sk INT PRIMARY KEY, -- surrogate key: unique per VERSION of the customer
customer_id INT, -- natural/business key: same across all versions
customer_name VARCHAR(50),
city VARCHAR(50),
effective_start DATE,
effective_end DATE, -- '9999-12-31' means "currently active"
is_current BOOLEAN
);
INSERT INTO dim_customer_scd2 VALUES
(1, 301, 'Rahul Bansal', 'Delhi', '2022-01-01', '2023-06-30', FALSE), -- old version
(2, 301, 'Rahul Bansal', 'Bangalore', '2023-07-01', '9999-12-31', TRUE), -- current version
(3, 302, 'Neha Kulkarni','Pune', '2021-05-01', '9999-12-31', TRUE);
CREATE TABLE fact_sales (
sale_id INT PRIMARY KEY,
customer_id INT,
sale_date DATE,
amount DECIMAL(10,2)
);
INSERT INTO fact_sales VALUES
(9001, 301, '2023-02-14', 5000.00), -- happened while Rahul was in Delhi
(9002, 301, '2023-09-01', 3000.00), -- happened while Rahul was in Bangalore
(9003, 302, '2022-03-10', 1500.00);
The SCD2 join — matching a fact to the dimension version active on that date
SELECT
f.sale_id,
f.sale_date,
f.amount,
d.city AS city_at_time_of_sale
FROM fact_sales f
JOIN dim_customer_scd2 d
ON f.customer_id = d.customer_id
AND f.sale_date BETWEEN d.effective_start AND d.effective_end;
Result: sale 9001 (Feb 2023) correctly joins to Rahul's Delhi version; sale 9002 (Sep 2023) correctly joins to his Bangalore version — even though both facts share the same customer_id. This is a non-equi join (from Level 3) applied to real dimensional modeling.
customer_id (the natural key) without the date-range condition causes a fan-out: each sale matches every historical version of that customer, multiplying rows and corrupting any SUM/COUNT downstream — this is the SCD2 flavor of the aggregation trap from section A.customer_sk) when possible — resolved once at ETL load time via exactly this date-range join — rather than re-resolving the SCD2 range join on every query. It's faster and avoids every analyst having to remember the BETWEEN logic.C. Range joins with dates — price validity example
The same non-equi range-join pattern from salary bands and SCD2 shows up constantly for "what was the price/rate/rule in effect on date X" problems — a product_price_history table is the classic example.
CREATE TABLE product_price_history (
price_id INT PRIMARY KEY,
product_id VARCHAR(10),
price DECIMAL(10,2),
valid_from DATE,
valid_to DATE -- '9999-12-31' if still the current price
);
INSERT INTO product_price_history VALUES
(1, 'P1', 1000.00, '2023-01-01', '2023-12-31'),
(2, 'P1', 1100.00, '2024-01-01', '9999-12-31'), -- price increase
(3, 'P4', 380.00, '2023-06-01', '2024-02-28'),
(4, 'P4', 400.00, '2024-03-01', '9999-12-31');
-- Attach the price that was valid on each order_item's order date
SELECT
oi.item_id,
oi.product_id,
o.order_date,
pph.price AS price_at_order_time
FROM order_items oi
JOIN orders o ON oi.order_id = o.order_id
JOIN product_price_history pph
ON oi.product_id = pph.product_id
AND o.order_date BETWEEN pph.valid_from AND pph.valid_to;
valid_from/valid_to ranges per key before relying on range joins against it.price_id to each order_item once) instead of re-running the range join on every downstream query.D. Many-to-many bridge tables
What it is: a many-to-many relationship (one student can take many courses, one course can have many students) can't be modeled with a single foreign key on either side. The fix is a third table — a bridge (a.k.a. junction/associative) table — holding one row per (student, course) pairing.
CREATE TABLE students (
student_id INT PRIMARY KEY,
student_name VARCHAR(50)
);
CREATE TABLE courses (
course_id INT PRIMARY KEY,
course_name VARCHAR(50)
);
CREATE TABLE student_courses ( -- the bridge table
student_id INT,
course_id INT,
enrolled_on DATE,
PRIMARY KEY (student_id, course_id),
FOREIGN KEY (student_id) REFERENCES students(student_id),
FOREIGN KEY (course_id) REFERENCES courses(course_id)
);
INSERT INTO students VALUES (1,'Aarav'), (2,'Diya'), (3,'Kabir');
INSERT INTO courses VALUES (10,'Databases'), (20,'Algorithms'), (30,'Statistics');
INSERT INTO student_courses VALUES
(1, 10, '2024-01-05'), -- Aarav takes Databases
(1, 20, '2024-01-05'), -- Aarav also takes Algorithms
(2, 10, '2024-01-06'), -- Diya takes Databases
(3, 30, '2024-01-07'); -- Kabir takes Statistics only
-- Notice: Algorithms(20) has only 1 student, Statistics(30) has only 1, nobody takes nothing (Kabir isn't in Algorithms)
Easy example — list each student with each of their course names
SELECT s.student_name, c.course_name
FROM students s
JOIN student_courses sc ON s.student_id = sc.student_id
JOIN courses c ON sc.course_id = c.course_id;
Two joins in a row: student → bridge → course. This double-hop is the defining shape of every bridge-table query.
Medium example — students enrolled in NEITHER Databases nor Algorithms
SELECT s.student_name
FROM students s
WHERE s.student_id NOT IN (
SELECT sc.student_id
FROM student_courses sc
JOIN courses c ON sc.course_id = c.course_id
WHERE c.course_name IN ('Databases', 'Algorithms')
);
-- Result: Kabir (he's only in Statistics)
Hard example — courses with zero students (need a LEFT JOIN through the bridge)
SELECT c.course_name
FROM courses c
LEFT JOIN student_courses sc ON c.course_id = sc.course_id
WHERE sc.student_id IS NULL;
-- With this data, every course has at least one student, so this returns empty —
-- add a course with no enrollments to see it trigger.
E. Join elimination (optimizer internals)
What it is: a query optimization where the database engine detects that a join in your query is unnecessary to produce the requested output, and silently removes it from the execution plan — without changing the result.
When it happens: most commonly when you join to a table only to filter or just to prove a row exists, but never actually select any column from it, AND the join is provably safe to skip — for example, joining on a foreign key that's guaranteed unique and non-null (so the join can't duplicate or drop rows).
-- You wrote this:
SELECT e.emp_name, e.salary
FROM employees e
JOIN departments d ON e.dept_id = d.dept_id;
-- If dept_id has a NOT NULL foreign key constraint referencing departments.dept_id (guaranteed to exist and be unique),
-- a smart optimizer can recognize that the join to `departments` can never filter out
-- or duplicate any row of `employees` -- since every non-null dept_id is guaranteed to match exactly one row --
-- so it eliminates the join entirely and just scans employees.
dept_id column isn't actually declared as a NOT NULL foreign key (even if it "happens" to always be valid), most optimizers won't risk eliminating the join — they can't prove it's safe. This is a good real-world argument for actually declaring your foreign key constraints instead of only enforcing them in application code.F. Broadcast joins (distributed warehouse concept)
What it is: in distributed engines like Apache Spark, Snowflake, or BigQuery, data is split across many worker nodes. A normal join between two large distributed tables requires shuffling — physically moving rows across the network so matching keys land on the same node. Shuffling is expensive.
The trick: if one side of the join is small (a dimension table that fits comfortably in memory — say, a few MB to a few hundred MB), the engine can instead copy ("broadcast") that entire small table to every worker node. Then each node does a local, in-memory join against its slice of the big fact table — no shuffling of the huge table required at all.
-- Conceptual Spark SQL hint (syntax varies by engine):
SELECT /*+ BROADCAST(d) */
f.sale_id, f.amount, d.category
FROM fact_sales f
JOIN dim_product d ON f.product_id = d.product_id;
-- dim_product is small (e.g. a few thousand rows) -> broadcast it to every node
-- fact_sales is huge (billions of rows) -> stays put, never shuffled
G. Skewed joins & salting
What it is: in a distributed shuffle join, rows are partitioned across worker nodes based on a hash of the join key. If one key value is wildly overrepresented (say, 90% of your fact_sales rows have customer_id = 999 — a bulk "guest checkout" account), every one of those rows hashes to the same worker node. That one node ends up doing 90% of the work while every other node sits idle — a classic data skew bottleneck.
Recognizing skew
-- Diagnostic: check for skew before blaming "the join is just slow"
SELECT customer_id, COUNT(*) AS row_count
FROM fact_sales
GROUP BY customer_id
ORDER BY row_count DESC
LIMIT 5;
-- If the #1 row dwarfs the rest (e.g. 90M rows vs a normal customer's 200 rows), that's skew.
The fix: salting
Salting artificially splits an overloaded key into several fake sub-keys, spreading its rows across multiple worker nodes, then joins against a correspondingly "exploded" copy of the small side, and finally removes the salt.
-- Step 1: add a random salt (0-9) to the skewed fact table's join key
SELECT
f.*,
CONCAT(f.customer_id, '_', CAST(FLOOR(RAND() * 10) AS INT)) AS salted_customer_id
FROM fact_sales f;
-- Step 2: explode the small dimension side so every real customer_id
-- exists once for each possible salt value (0-9), matching the scheme above
SELECT
d.*,
CONCAT(d.customer_id, '_', salt.n) AS salted_customer_id
FROM dim_customer d
CROSS JOIN (SELECT 0 AS n UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3
UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7
UNION ALL SELECT 8 UNION ALL SELECT 9) salt;
-- Step 3: join on the salted key instead of the raw key.
-- The skewed customer's rows are now spread across 10 different salted keys,
-- landing on up to 10 different worker nodes instead of piling onto one.
How interviewers trick people on JOINs
- The NULL trap: they slip a NULL foreign key into the sample data (like Tanvi/Sameer here) and ask for "all employees and their department" — if you use INNER JOIN, you silently drop people, and most candidates don't even notice the row count is wrong.
- The WHERE-vs-ON trap: they ask you to LEFT JOIN and then filter on the right table in
WHERE, secretly turning it back into an INNER JOIN, and ask "why did dept X disappear?" - The duplicate row count trap: they ask "how many employees are there?" after a multi-table join and expect you to realize a one-to-many join inflated
COUNT(*)— the correct answer needsCOUNT(DISTINCT e.emp_id). - The self-join off-by-one trap: asking for "employee and manager's manager" (two levels up) forces you to self-join the same table twice with three aliases — many candidates fumble the aliasing.
- The NOT IN + NULL trap: covered above — a favorite "gotcha" question is simply "what's wrong with this NOT IN query?" where the subquery column contains NULLs.
- The implicit cross join trap: old-style comma joins with a forgotten WHERE clause. They show you a query with an unexpectedly huge row count and ask you to spot the bug.
3 Hard Interview Questions
H1.Find employees earning strictly more than their own manager. hard▶
emp_name, emp_salary, manager_name, manager_salary — for any employee whose salary beats their manager's.
This needs a self join. Compare each employee's salary against the salary of the row where emp_id = manager_id.
SELECT
e.emp_name AS employee,
e.salary AS employee_salary,
m.emp_name AS manager,
m.salary AS manager_salary
FROM employees e
JOIN employees m ON e.manager_id = m.emp_id
WHERE e.salary > m.salary;
With this dataset: no one currently beats their manager, so the result is empty — which is itself a valid, correct answer. (Try changing Sneha's salary to 150000 mentally — she'd then beat Priya.)
Same logic using a correlated subquery instead of a join:
SELECT e.emp_name, e.salary
FROM employees e
WHERE e.salary > (
SELECT m.salary FROM employees m WHERE m.emp_id = e.manager_id
);
The JOIN version is generally preferred over the correlated subquery at scale, because most optimizers turn the JOIN into a single efficient hash join, while the correlated subquery form can force a row-by-row nested loop unless the optimizer is smart enough to rewrite it (many are, but don't rely on it).
O(n) with an index on emp_id (self join via index lookup); O(n²) worst case with no index and a naive nested loop.
- Using LEFT JOIN instead of INNER JOIN here would incorrectly include top-level employees (manager_id IS NULL) with a NULL manager_salary, and the comparison
salary > NULLis UNKNOWN, so they'd silently drop out anyway — but it's sloppy to include them at all conceptually. - Forgetting the self-join needs two distinct aliases.
- "What if manager_id could point to someone in a different table?" (it can't here since it's self-referencing, but discuss FK design)
- "How would you find employees earning more than the AVERAGE of their department?" (bridges into subqueries/window functions)
H2.List every department together with its employee count — including departments with zero employees. hard▶
dept_name, employee_count — Legal should show 0, not be missing from the output.
Start FROM departments (the side you want guaranteed to fully appear), LEFT JOIN employees, and be careful which column you COUNT.
SELECT
d.dept_name,
COUNT(e.emp_id) AS employee_count
FROM departments d
LEFT JOIN employees e ON d.dept_id = e.dept_id
GROUP BY d.dept_name;
- Using
COUNT(*)instead ofCOUNT(e.emp_id)— this is the classic trap. For Legal, the LEFT JOIN produces exactly one output row with all-NULL employee columns.COUNT(*)counts that row as 1 (wrong! Legal has 0 employees).COUNT(e.emp_id)correctly counts 0, becauseCOUNT(column)ignores NULLs, ande.emp_idis NULL in that unmatched row. - Using INNER JOIN instead of LEFT JOIN — Legal disappears from the output entirely instead of showing 0.
- "Now also show average salary per department, safely handling the zero-employee case." (AVG over an empty group already returns NULL safely — no special handling needed, unlike SUM which returns NULL too but people sometimes expect 0.)
H3.Find pairs of employees in the same department who have the exact same salary (salary "twins"), without pairing an employee with themselves and without listing the same pair twice. hard▶
For our data: (Priya Nair, Rohit Sharma[103], 145000) and (Ananya Iyer, Vikram Singh, 92000).
Self join on dept_id AND salary, then use e1.emp_id < e2.emp_id to avoid self-pairing and duplicate reversed pairs in one shot.
SELECT
e1.emp_name AS employee_1,
e2.emp_name AS employee_2,
e1.dept_id,
e1.salary
FROM employees e1
JOIN employees e2
ON e1.dept_id = e2.dept_id
AND e1.salary = e2.salary
AND e1.emp_id < e2.emp_id;
- Using
e1.emp_id != e2.emp_idinstead of<— this avoids self-pairing but still produces BOTH (A,B) and (B,A), doubling the results. - Forgetting to also match on
dept_id— without it, you'd find salary twins across different departments too, which wasn't asked for (that's a different, also valid, question — always re-read the requirement).
- "Extend this to find triplets with the same salary." (Doesn't scale well with self-joins — this is where window functions with RANK/COUNT OVER PARTITION become the better tool — foreshadowing the Window Functions module.)
2 FAANG-Level Questions
F1.Deduplicate employees table: some rows are exact duplicates by (emp_name, dept_id, salary) except for a different emp_id. Delete the extras, keeping the lowest emp_id per group. faang▶
This is a classic production data-quality problem: a broken ETL job double-inserted rows. Note: in our actual dataset, Rohit Sharma appears twice but they are genuinely different people (different dept_id and salary) — so they are NOT duplicates, and this question is about the general pattern, imagine a hypothetical exact duplicate crept in.
Self-join the table to itself on the duplicate-defining columns, keep only pairs where the "other" row has a smaller id, and delete those.
DELETE e1 FROM employees e1
JOIN employees e2
ON e1.emp_name = e2.emp_name
AND e1.dept_id = e2.dept_id
AND e1.salary = e2.salary
AND e1.emp_id > e2.emp_id; -- delete the "later" duplicate, keep the smaller id
Using a window function (foreshadowing Module 6) — often cleaner and portable across databases that don't support multi-table DELETE syntax:
DELETE FROM employees
WHERE emp_id IN (
SELECT emp_id FROM (
SELECT
emp_id,
ROW_NUMBER() OVER (
PARTITION BY emp_name, dept_id, salary
ORDER BY emp_id
) AS rn
FROM employees
) ranked
WHERE rn > 1
);
For huge tables, doing this as a single DELETE can lock a lot of rows / generate huge undo logs. Production approach: write the deduplicated result to a new table (CREATE TABLE employees_clean AS SELECT ... WHERE rn = 1), validate row counts, then swap table names — much safer and often faster than in-place delete at scale.
Self-join approach: O(n²) without an index on the duplicate-key columns, O(n log n) with one. Window function approach: O(n log n) for the sort/partition.
- Using
<>instead of>in the self-join delete condition — this deletes BOTH copies of every duplicate pair instead of keeping one. - Not wrapping this in a transaction / not testing with a SELECT first in production — always convert your DELETE to a SELECT of the same rows and eyeball it before running the real delete.
- "What if the table has a billion rows and no useful index — how would you approach it differently?" (batch deletes, or rebuild via CTAS as above)
F2.For every employee, find their "skip-level" manager (their manager's manager) using only joins — no recursive CTE allowed. faang▶
This tests whether you understand that a self-join can be chained multiple times for fixed-depth hierarchy questions, and where that approach breaks down (motivating recursive CTEs later).
Self join employees to itself twice: once to get the direct manager, once more to get that manager's manager.
SELECT
e.emp_name AS employee,
m.emp_name AS manager,
gm.emp_name AS skip_level_manager
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.emp_id
LEFT JOIN employees gm ON m.manager_id = gm.emp_id;
Example result: Sneha Kapoor → Priya Nair → Arjun Mehta.
Same idea but only surfacing rows that actually have a skip-level manager:
SELECT e.emp_name, gm.emp_name AS skip_level_manager
FROM employees e
JOIN employees m ON e.manager_id = m.emp_id
JOIN employees gm ON m.manager_id = gm.emp_id;
This chained-self-join approach only works for a fixed number of hierarchy levels you hardcode (2, 3, 4 joins...). It cannot handle arbitrary-depth trees where you don't know the depth in advance. That's exactly the limitation that motivates Recursive CTEs (Module 4) — flag this explicitly in an interview, it shows depth of understanding.
O(n) per join level with an index on emp_id/manager_id; k chained joins for k levels of depth.
- Using INNER JOIN throughout when the question implies you should still show employees without a skip-level manager — pick LEFT JOIN or INNER JOIN deliberately based on the exact requirement, and state your assumption out loud.
- "What if the org chart is 10 levels deep and varies per employee?" — correct answer: "That's not solvable cleanly with fixed self-joins; I'd use a Recursive CTE."
1 Production Scenario
ETL enrichment join with late-arriving dimension data
You run a nightly ETL job that joins a fact table daily_sales (transaction-level, millions of rows/day) to a dimension table products to attach product category and brand for a reporting dashboard.
The real problem: occasionally a sale references a product_id that hasn't been inserted into products yet — the product dimension load runs slightly after the sales load ("late-arriving dimension"). An INNER JOIN silently drops those sales rows from the report entirely — revenue looks lower than it actually was, and nobody notices until someone reconciles totals against the source system.
Production-grade fix:
SELECT
s.sale_id,
s.sale_amount,
COALESCE(p.category, 'UNKNOWN') AS category,
COALESCE(p.brand, 'UNKNOWN') AS brand
FROM daily_sales s
LEFT JOIN products p ON s.product_id = p.product_id;
Use LEFT JOIN (never INNER JOIN) for fact-to-dimension enrichment in ETL, and default unmatched dimension attributes to an explicit 'UNKNOWN' placeholder rather than leaving NULL — this makes the gap visible in dashboards (a report slice literally labeled "UNKNOWN" prompts investigation) instead of silently disappearing. Teams typically also add a monitoring query that counts COUNT(*) WHERE category = 'UNKNOWN' daily and alerts if it spikes.
10 Practice Questions
Write the SQL yourself before expanding each answer. Use the employees / departments schema above.
P1.List every employee with their department name (only employees who have a department). easy▶
SELECT e.emp_name, d.dept_name
FROM employees e
JOIN departments d ON e.dept_id = d.dept_id;P2.List every employee, including those with no department (show NULL for dept_name). easy▶
SELECT e.emp_name, d.dept_name
FROM employees e
LEFT JOIN departments d ON e.dept_id = d.dept_id;P3.Find all departments that have zero employees. easy▶
SELECT d.dept_name
FROM departments d
LEFT JOIN employees e ON d.dept_id = e.dept_id
WHERE e.emp_id IS NULL;P4.Find all employees with no department assigned. easy▶
SELECT e.emp_name
FROM employees e
WHERE e.dept_id IS NULL;
-- (a join isn't even required here, but with a LEFT JOIN + IS NULL pattern also works)P5.For each employee, show their direct manager's name (NULL if none). medium▶
SELECT e.emp_name, m.emp_name AS manager_name
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.emp_id;P6.Count how many direct reports each manager has (managers with 0 reports don't need to appear). medium▶
SELECT m.emp_name AS manager, COUNT(e.emp_id) AS direct_reports
FROM employees m
JOIN employees e ON e.manager_id = m.emp_id
GROUP BY m.emp_name;P7.List every possible (department, salary_band) combination using the salary_bands table from the Non-equi Joins section, even combos with no employees. medium▶
SELECT d.dept_name, b.band_name
FROM departments d
CROSS JOIN salary_bands b;P8.Find departments located in the same city as at least one other department (self join on departments). medium▶
SELECT DISTINCT d1.dept_name, d1.location
FROM departments d1
JOIN departments d2
ON d1.location = d2.location
AND d1.dept_id != d2.dept_id;
-- Engineering & HR share Bangalore; Marketing & Legal share DelhiP9.Find employees who earn more than every employee in the Marketing department (non-equi style thinking). hard▶
SELECT e.emp_name, e.salary
FROM employees e
WHERE e.salary > (
SELECT MAX(m.salary)
FROM employees m
WHERE m.dept_id = 3
);
-- Pure-join version is awkward here; this is a preview of why subqueries exist.P10.Show each department with its headcount and average salary, ordered by average salary descending, including empty departments as 0/NULL. hard▶
SELECT
d.dept_name,
COUNT(e.emp_id) AS headcount,
ROUND(AVG(e.salary),2) AS avg_salary
FROM departments d
LEFT JOIN employees e ON d.dept_id = e.dept_id
GROUP BY d.dept_name
ORDER BY avg_salary DESC;Before you say NEXT: make sure you can explain, out loud and without looking, the difference between INNER/LEFT/RIGHT/FULL, why NOT IN is dangerous with NULLs, why COUNT(*) vs COUNT(column) matters after a LEFT JOIN, and the difference between a hash join and a nested loop join. When you're ready, reply NEXT and we'll move to Module 2: Subqueries.