Module 1 — JOINS

Everything from "what is a join" to how a query planner physically executes one, and how FAANG interviewers trap people who only memorized syntax.

Level 1 · Foundations

Why JOINs exist

A real database almost never stores everything in one giant table. Instead, information is split across many small tables, each holding one kind of thing. This is called normalization — it avoids repeating the same data over and over.

Analogy Imagine two Excel sheets. Sheet 1 has a list of employees with a column called dept_id (just a number, like "3"). Sheet 2 has a list of departments with their names, matched to that same number. Neither sheet alone tells you the full story — "employee John is in dept 3" is meaningless until you look up what dept 3 is called. A JOIN is the act of using VLOOKUP (Excel's version of a join) to stitch the two sheets together into one combined view.

What a JOIN actually is

A JOIN is a way to combine rows from two (or more) tables based on a related column between them. That related column is usually a foreign key in one table pointing to a primary key in another.

When to use it

Whenever the answer to your question needs data that lives in more than one table.
Whenever you want to enrich raw IDs (like dept_id = 3) with human-readable data (like "Engineering").
Whenever you need to check relationships — who reports to whom, which orders belong to which customer, etc.

When NOT to use it

When all the data you need already lives in one table — joining unnecessarily just slows the query down.
When you only need to check existence of a related row (does this customer have any order at all?) — often an EXISTS subquery is faster and clearer than a JOIN, because a JOIN can multiply rows while EXISTS just answers yes/no. We'll cover this in the Subqueries module.

🧠

Mental model: think of a JOIN as the database building a giant temporary combined table in memory (conceptually — it doesn't always literally materialize it), where every row from table A is checked against every row from table B, and only pairs that match the join condition survive into the result. Everything you'll learn about join types is really just: "what happens to the rows that don't find a match?"

The schema and data we'll use for this entire module

We'll reuse these two tables — employees and departments — for every join type, so you see how the exact same data behaves differently under each join. The data is deliberately messy: it has an employee with no department, a department with no employees, duplicate names, and a self-referencing manager column.

DDL

CREATE TABLE departments (
    dept_id     INT PRIMARY KEY,
    dept_name   VARCHAR(50) NOT NULL,
    location    VARCHAR(50)
);

CREATE TABLE employees (
    emp_id      INT PRIMARY KEY,
    emp_name    VARCHAR(50) NOT NULL,
    dept_id     INT,              -- can be NULL: not every employee is assigned yet
    manager_id  INT,              -- self-referencing FK to emp_id, can be NULL (top of hierarchy)
    salary      DECIMAL(10,2),
    hire_date   DATE,
    FOREIGN KEY (dept_id) REFERENCES departments(dept_id),
    FOREIGN KEY (manager_id) REFERENCES employees(emp_id)
);

INSERT statements

INSERT INTO departments (dept_id, dept_name, location) VALUES
(1, 'Engineering', 'Bangalore'),
(2, 'Sales',       'Mumbai'),
(3, 'Marketing',   'Delhi'),
(4, 'HR',          'Bangalore'),
(5, 'Finance',     'Pune'),
(6, 'Legal',       'Delhi');   -- Legal has ZERO employees on purpose

INSERT INTO employees (emp_id, emp_name, dept_id, manager_id, salary, hire_date) VALUES
(101, 'Arjun Mehta',    1,    NULL, 185000, '2019-03-01'),  -- CTO, no manager (top of tree)
(102, 'Priya Nair',     1,    101,  145000, '2020-01-15'),
(103, 'Rohit Sharma',   1,    101,  145000, '2020-06-10'),  -- tie salary with Priya
(104, 'Sneha Kapoor',   1,    102,  98000,  '2021-02-20'),
(105, 'Karan Verma',    2,    NULL, 160000, '2018-11-05'),  -- Sales head
(106, 'Ananya Iyer',    2,    105,  92000,  '2021-07-01'),
(107, 'Vikram Singh',   2,    105,  92000,  '2022-01-10'),  -- tie salary with Ananya
(108, 'Rohit Sharma',   2,    105,  88000,  '2022-03-15'),  -- DUPLICATE NAME, different person
(109, 'Meera Pillai',   3,    NULL, 110000, '2019-08-01'),
(110, 'Aditya Rao',     3,    109,  75000,  '2023-01-05'),
(111, 'Divya Menon',    4,    NULL, 105000, '2020-04-01'),
(112, 'Farhan Khan',    4,    111,  60000,  '2023-05-20'),
(113, 'Ishita Bose',    5,    NULL, 120000, '2019-09-15'),
(114, 'Nikhil Joshi',   5,    113,  70000,  '2022-08-01'),
(115, 'Tanvi Desai',    NULL, NULL, 55000,  '2024-01-10'),  -- NOT assigned to any dept yet
(116, 'Sameer Gupta',   NULL, NULL, 52000,  '2024-02-14');  -- another unassigned employee

departments (6 rows)
dept_id	dept_name	location
1	Engineering	Bangalore
2	Sales	Mumbai
3	Marketing	Delhi
4	HR	Bangalore
5	Finance	Pune
6	Legal	Delhi

employees (16 rows, trimmed columns)
emp_id	emp_name	dept_id	mgr_id	salary
101	Arjun Mehta	1	NULL	185000
102	Priya Nair	1	101	145000
103	Rohit Sharma	1	101	145000
104	Sneha Kapoor	1	102	98000
105	Karan Verma	2	NULL	160000
106	Ananya Iyer	2	105	92000
107	Vikram Singh	2	105	92000
108	Rohit Sharma	2	105	88000
109	Meera Pillai	3	NULL	110000
110	Aditya Rao	3	109	75000
111	Divya Menon	4	NULL	105000
112	Farhan Khan	4	111	60000
113	Ishita Bose	5	NULL	120000
114	Nikhil Joshi	5	113	70000
115	Tanvi Desai	NULL	NULL	55000
116	Sameer Gupta	NULL	NULL	52000

NoticeThe edge cases baked in on purpose: Legal (dept 6) has no employees. Tanvi and Sameer have no department. Two pairs of employees tie on salary. Two different employees are both named "Rohit Sharma". Keep this in your head — every join type below will treat these edge cases differently.

Level 1 · Beginner

INNER JOIN

What it is: returns only the rows where the join condition matches on both sides. If an employee has no matching department, or a department has no matching employee, those rows are dropped.

Why it exists: most of the time you only care about complete, matched pairs — "give me employees together with their department name" only makes sense for employees who actually have a department.

AnalogyTwo friends trying to find matching board-game nights. INNER JOIN only lists the nights where both of them were free. If only one of them was free that night, it doesn't count.

Easy example

Find employee name and department name, for employees who have a department.

SELECT e.emp_name, d.dept_name
FROM employees e
INNER JOIN departments d
    ON e.dept_id = d.dept_id;

Line by line:

FROM employees e — start reading from the employees table, alias it e so we can refer to it shorter.
INNER JOIN departments d — bring in the departments table, alias d.
ON e.dept_id = d.dept_id — this is the matching rule: only keep pairs where the department id lines up.
Result: 14 rows (Tanvi and Sameer are dropped because their dept_id is NULL and can't match anything; Legal is dropped because no employee points to dept 6).

Common mistakeBeginners think NULL = NULL is true in SQL. It isn't — it's UNKNOWN. That is exactly why Tanvi and Sameer (dept_id = NULL) never match any row in departments, even though departments also never has a NULL dept_id to compare against. NULL never equals anything, not even another NULL.

How the database executes this internally (conceptually)

Logically, the engine considers the Cartesian product first — every employee row paired with every department row (16 × 6 = 96 combinations) — and then keeps only the pairs where e.dept_id = d.dept_id is TRUE. In practice, no real database is dumb enough to actually build all 96 rows; the query optimizer picks a smarter physical strategy (nested loop, hash join, or merge join — covered in Level 3). But the logical result is always as if it did the full cross product first and filtered after.

Level 1 · Beginner

LEFT JOIN (LEFT OUTER JOIN)

What it is: returns all rows from the left table, plus matching rows from the right table. If there's no match on the right, the right side's columns come back as NULL — but the left row is never dropped.

Why it exists: sometimes "no match" is itself important information. You want to know which employees don't have a department yet, not just hide them.

AnalogyA wedding guest list (left table) and an RSVP sheet (right table). LEFT JOIN prints every guest, and if they RSVP'd, shows their response — if not, it just shows blank instead of leaving them off the invite list entirely.

Easy example

SELECT e.emp_name, e.dept_id, d.dept_name
FROM employees e
LEFT JOIN departments d
    ON e.dept_id = d.dept_id;

Result: all 16 employees appear. Tanvi and Sameer show dept_name = NULL because they have no matching department — but they're still in the output.

Medium example — finding "orphans"

The most powerful real use of LEFT JOIN: find rows in the left table that have no match at all.

SELECT e.emp_name
FROM employees e
LEFT JOIN departments d
    ON e.dept_id = d.dept_id
WHERE d.dept_id IS NULL;   -- keep only the rows where the join failed to find a match

Result: Tanvi Desai, Sameer Gupta. This "LEFT JOIN + WHERE right.key IS NULL" pattern is called an anti-join pattern and shows up constantly in production data-quality checks (e.g. "find orders with no matching customer").

Common mistakeFiltering on the right table's column in the WHERE clause (instead of the ON clause) silently turns your LEFT JOIN back into something that behaves like an INNER JOIN. Example: WHERE d.location = 'Bangalore' in the WHERE clause removes all rows where d.location is NULL — i.e. it throws away exactly the unmatched rows you were trying to keep. If you want to filter the right table without losing unmatched left rows, put the condition inside the ON clause instead: LEFT JOIN departments d ON e.dept_id = d.dept_id AND d.location = 'Bangalore'.

🧠

Mental model: LEFT JOIN = INNER JOIN's matched rows, PLUS one extra row for every unmatched left row, with all the right table's columns padded as NULL.

Level 2 · Intermediate

RIGHT JOIN & FULL OUTER JOIN

RIGHT JOIN

What it is: the mirror image of LEFT JOIN — keeps all rows from the right table, and NULLs out the left side where there's no match.

SELECT e.emp_name, d.dept_name
FROM employees e
RIGHT JOIN departments d
    ON e.dept_id = d.dept_id;

Result: Legal (dept 6) now appears with emp_name = NULL, because RIGHT JOIN keeps every department even if no employee belongs to it.

Production tipMost SQL style guides (and most engineers) avoid RIGHT JOIN entirely. Anything you can write with RIGHT JOIN, you can write as a LEFT JOIN by swapping which table is listed first: FROM departments d LEFT JOIN employees e ON ... is identical to the RIGHT JOIN above, and it's easier for a reader to scan top-to-bottom. Codebases standardize on LEFT JOIN for consistency.

FULL OUTER JOIN

What it is: keeps everything from both sides. If a row matches, you get the combined row. If it doesn't match on either side, that side is padded with NULLs.

SELECT e.emp_name, d.dept_name
FROM employees e
FULL OUTER JOIN departments d
    ON e.dept_id = d.dept_id;

Result: 18 rows — all 16 employees (including Tanvi and Sameer with NULL dept_name) PLUS Legal appearing once with NULL emp_name.

AnalogyFULL OUTER JOIN is like merging two contact lists from two phones. You want every contact from phone A and every contact from phone B — if someone's on both, merge into one entry; if they're only on one phone, still include them with the other phone's fields blank.

Common mistakeMySQL does not support FULL OUTER JOIN directly (as of standard MySQL). The workaround is to UNION a LEFT JOIN and a RIGHT JOIN:
SELECT ... FROM a LEFT JOIN b ON ... UNION SELECT ... FROM a RIGHT JOIN b ON .... Interviewers sometimes ask you to know this MySQL limitation specifically.

Level 2 · Intermediate

CROSS JOIN & SELF JOIN

CROSS JOIN

What it is: every row from table A paired with every row from table B. No ON condition at all. If A has m rows and B has n rows, the result has m × n rows.

SELECT e.emp_name, d.dept_name
FROM employees e
CROSS JOIN departments d;
-- 16 employees x 6 departments = 96 rows

AnalogyA restaurant combo menu generator: 4 mains × 3 sides = 12 possible combo meals. You want every combination, not a matched pair.

When to use it: generating all combinations — e.g. every product × every size × every color, or every date in a calendar × every store (to build a report scaffold with zero-filled rows for days with no sales).

Common mistakeThe classic "accidental cross join" bug: forgetting the ON clause, or writing FROM employees, departments (old comma-join syntax) without a WHERE condition to relate them. This silently produces a Cartesian product — your row count explodes and numbers like SUM() get wildly inflated. This is one of the most common real production bugs.

SELF JOIN

What it is: a table joined to itself, using two different aliases to treat it as if it were two tables. Used whenever a row references another row in the same table — like an employee referencing their manager, who is also an employee.

SELECT
    e.emp_name  AS employee,
    m.emp_name  AS manager
FROM employees e
LEFT JOIN employees m
    ON e.manager_id = m.emp_id;

Line by line: we treat the employees table as two roles at once — e plays "the employee", m plays "the manager". We LEFT JOIN so that top-level people (like Arjun, whose manager_id is NULL) still show up, with manager = NULL.

AnalogyA family tree drawn from a single "people" table, where every person has a parent_id pointing back into the same table. To print "child — parent" pairs, you join the table to itself.

Level 2 · Intermediate

Multi-table joins

Real queries rarely join just two tables. You chain joins, left to right, and each one narrows or widens the working row set.

SELECT
    e.emp_name,
    d.dept_name,
    m.emp_name AS manager_name
FROM employees e
LEFT JOIN departments d ON e.dept_id = d.dept_id
LEFT JOIN employees m   ON e.manager_id = m.emp_id
ORDER BY e.emp_id;

Production tipJoin order in the SQL text doesn't have to match join execution order — the query optimizer reorders joins internally based on cost estimates (table sizes, indexes, filters). Writing joins in a logical, readable order (start from the "main" table, then attach lookups) is for humans, not the engine. But the choice of which table to put first can occasionally hint the optimizer, and in some engines (like older MySQL) join order in the text does affect the plan more than in others — always check with EXPLAIN.

Hard example — three-way join with aggregation

Find, for each department, the average salary and the department's own manager count (how many distinct managers exist among its employees).

SELECT
    d.dept_name,
    ROUND(AVG(e.salary), 2)      AS avg_salary,
    COUNT(DISTINCT e.manager_id) AS distinct_managers
FROM departments d
LEFT JOIN employees e ON d.dept_id = e.dept_id
GROUP BY d.dept_name;

Notice we started FROM departments this time (not employees) so that Legal still shows up in the output, with avg_salary = NULL and distinct_managers = 0.

Level 3 · Advanced

Non-equi joins

What it is: a join where the condition isn't a plain equality (=) — instead it uses <, >, BETWEEN, etc.

When to use it: salary-band lookups, date-range matching, geographic containment — anything where "belongs to a range" replaces "equals an id".

Setup: a salary bands table

CREATE TABLE salary_bands (
    band_name VARCHAR(20),
    min_salary DECIMAL(10,2),
    max_salary DECIMAL(10,2)
);

INSERT INTO salary_bands VALUES
('Junior',  50000,  89999),
('Mid',     90000,  129999),
('Senior',  130000, 169999),
('Staff',   170000, 999999);

SELECT e.emp_name, e.salary, b.band_name
FROM employees e
JOIN salary_bands b
    ON e.salary BETWEEN b.min_salary AND b.max_salary;

AnalogySorting exam scores into grade bands (A: 90-100, B: 80-89 ...). You're not matching an exact score to an exact grade row — you're matching a score that falls inside a range.

Common mistakeNon-equi joins often can't use a simple index lookup the way equality joins can, so on large tables they tend to force a nested-loop plan, which is slow. If you do this often at scale, consider pre-bucketing the range into a lookup key, or using specialized range-index structures.

Level 3 · Advanced

SEMI JOIN & ANTI JOIN

These aren't separate SQL keywords in standard SQL — they're patterns, usually written with EXISTS/NOT EXISTS or IN/NOT IN, or sometimes simulated with a LEFT JOIN.

SEMI JOIN — "does a match exist?"

What it is: return rows from table A where at least one match exists in table B — but never duplicate A's rows even if B has multiple matches, and never pull any columns from B.

-- Departments that have at least one employee
SELECT d.dept_name
FROM departments d
WHERE EXISTS (
    SELECT 1 FROM employees e WHERE e.dept_id = d.dept_id
);

Compare this to a naive INNER JOIN version:

-- BAD for this purpose: returns Engineering 4 times (once per employee)
SELECT DISTINCT d.dept_name
FROM departments d
JOIN employees e ON d.dept_id = e.dept_id;

The INNER JOIN version needs a DISTINCT bolted on to fix the duplication it caused. The EXISTS version never had that problem, because it only checks "does a match exist", it never actually multiplies rows.

ANTI JOIN — "no match exists"

-- Departments with ZERO employees (Legal)
SELECT d.dept_name
FROM departments d
WHERE NOT EXISTS (
    SELECT 1 FROM employees e WHERE e.dept_id = d.dept_id
);

Big interview trap: NOT IN vs NOT EXISTS NOT IN behaves dangerously if the subquery can return any NULL. WHERE dept_id NOT IN (SELECT dept_id FROM employees) — if even ONE row in the subquery has dept_id = NULL, the entire NOT IN returns an empty result set (not an error, just silently wrong / empty), because comparing anything to NULL gives UNKNOWN, and NOT IN requires every comparison to definitively succeed. In our data, employees.dept_id contains NULLs (Tanvi, Sameer) — so NOT IN on this column would silently return zero rows! NOT EXISTS has no such trap, which is why professionals default to NOT EXISTS over NOT IN whenever the column can contain NULLs.

Level 3 · Advanced

How joins execute internally

The database's query optimizer picks one of three physical strategies to actually run a logical join. Knowing these separates people who memorized syntax from people who understand databases.

1. Nested Loop Join

For every row in the outer (usually smaller) table, scan the inner table looking for matches. Simple, but O(n × m) in the worst case — unless the inner table has an index on the join column, which turns the inner scan into a fast lookup (O(n × log m) or better).

🔁

Best when one table is small, or a good index exists on the join key of the inner table.

2. Hash Join

Build a hash table in memory from the smaller table, keyed on the join column. Then scan the bigger table once, and for each row, do an O(1) hash lookup into that hash table. Overall roughly O(n + m).

🗂️

Best for large, unsorted tables with an equality join condition and no useful index — this is the default workhorse for big analytical joins in warehouses like Snowflake, BigQuery, Redshift.

NoteHash joins only work for equality (=) conditions — they can't be used for non-equi joins like BETWEEN, since a hash table only supports exact-key lookups.

3. Merge Join (Sort-Merge Join)

If both tables are already sorted on the join key (or the optimizer sorts them first), the engine can walk both sorted lists with two pointers simultaneously, like merging two sorted decks of cards. Very efficient — O(n + m) after sorting, and works for non-equi range conditions too.

🔀

Best when both inputs are already sorted (e.g. both have a clustered index / are pre-sorted on the join key) — common when joining on primary keys or clustering keys in columnar warehouses.

Production tipYou rarely choose the join algorithm yourself — but you influence it. Adding an index on frequently-joined foreign key columns often flips a slow nested loop into a fast indexed lookup. Run EXPLAIN (or EXPLAIN ANALYZE) on any slow join to see which strategy the planner actually picked, and why.

Level 3 · Advanced

Cardinality & the duplicate explosion trap

"Cardinality" of a join describes the relationship between matching rows: one-to-one, one-to-many, or many-to-many.

One-to-one: each row on both sides matches exactly one row on the other side. Row count stays the same after the join.
One-to-many: one row on the "one" side matches multiple rows on the "many" side. Example: one department matches multiple employees. The department's row gets repeated once per matching employee.
Many-to-many: multiple rows on both sides can match each other. Row count can explode multiplicatively.

The classic production bug: duplicate explosion Say you join orders (one row per order) to order_items (multiple rows per order — one per product in the order), then try to SUM(orders.total_amount). Because the join repeats each order row once per item, you're now summing the same order total multiple times — your revenue number is inflated. The fix: aggregate order_items down to one row per order first (in a subquery or CTE), THEN join to orders — or aggregate carefully with SUM(DISTINCT ...) style logic (which has its own caveats), or restructure the query entirely. This single bug pattern is probably the #1 real-world SQL mistake in reporting pipelines.

Concrete demo with our data

Departments-to-employees is one-to-many. Watch what happens if we naively try to compute "total company salary" while joined to employees without pre-aggregating:

-- WRONG idea if departments itself had a "budget" column and we joined + summed budget
-- naive: SELECT SUM(d.budget) FROM departments d JOIN employees e ON d.dept_id = e.dept_id;
-- This would count each department's budget once PER EMPLOYEE in that department — inflated!

Rule of thumbBefore joining and aggregating, always ask: "what is the grain (one row = one what?) of each table, and what's the grain of the join result?" If the grain changed (got finer) because of the join, any aggregate you compute on the coarser table's original columns is now wrong unless you deduplicate or pre-aggregate first.

Level 3.5 · Data Engineering

A. Join + Aggregation traps (the #1 production SQL bug)

You already met this idea in the cardinality section. Now let's make it concrete with the schema every data engineer eventually deals with: orders (one row per order) and order_items (many rows per order — one per line item).

DDL & sample data

CREATE TABLE orders (
    order_id     INT PRIMARY KEY,
    customer_id  INT,
    order_date   DATE,
    total_amount DECIMAL(10,2)   -- the order's true total, computed once at checkout
);

CREATE TABLE order_items (
    item_id     INT PRIMARY KEY,
    order_id    INT,
    product_id  INT,
    quantity    INT,
    unit_price  DECIMAL(10,2)
);

INSERT INTO orders VALUES
(5001, 201, '2024-01-10', 2500.00),
(5002, 202, '2024-01-11', 1200.00),
(5003, 201, '2024-01-15',  800.00);

INSERT INTO order_items VALUES
(1, 5001, 'P1', 2, 1000.00),   -- order 5001 has 3 line items
(2, 5001, 'P2', 1,  500.00),
(3, 5001, 'P3', 1,  500.00),
(4, 5002, 'P1', 1, 1200.00),   -- order 5002 has 1 line item
(5, 5003, 'P4', 2,  400.00);   -- order 5003 has 1 line item

orders (true totals)
order_id	customer_id	total_amount
5001	201	2500.00
5002	202	1200.00
5003	201	800.00

Real total revenue = 2500 + 1200 + 800 = 4500.00

Trap 1 — the naive join-then-sum

-- WRONG
SELECT SUM(o.total_amount) AS revenue
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id;
-- Result: 8300.00  (should be 4500.00)

Why it's wrong: order 5001 has 3 line items, so the JOIN produces 3 output rows for it — and each of those 3 rows still carries total_amount = 2500.00. Summing that column now adds 2500 three times instead of once. Order 5001 alone contributes 2500 × 3 = 7500 to the wrong sum instead of 2500. This is the single most common "why is our dashboard revenue number wrong" bug in real companies.

💥

Mental model: any column that lives on the "one" side of a one-to-many join gets duplicated once per matching row on the "many" side. Summing a duplicated column always overcounts by exactly that duplication factor.

Trap 2 — "just add DISTINCT" (still wrong, or fragile)

-- STILL WRONG, and fragile
SELECT SUM(DISTINCT o.total_amount) AS revenue
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id;
-- Works here by accident, but breaks the moment two DIFFERENT orders share the same total_amount

If order 5002 and a hypothetical order 5004 both happened to total exactly 1200.00, SUM(DISTINCT ...) would collapse them into one and undercount. DISTINCT deduplicates by value, not by which row it came from — never use it as a duplication fix for aggregates.

Trap 3 — the correct fix: pre-aggregate before joining

-- CORRECT: don't even join to order_items if you only need the order total
SELECT SUM(o.total_amount) AS revenue
FROM orders o;
-- Result: 4500.00 ✓

If you actually need something from order_items (say, total quantity sold) alongside the order total, aggregate order_items down to one row per order first, in a subquery/CTE, then join — never join first and aggregate the "one" side's column after.

-- CORRECT: pre-aggregate the "many" side to order-grain BEFORE joining
SELECT
    o.order_id,
    o.total_amount,
    item_totals.total_qty
FROM orders o
JOIN (
    SELECT order_id, SUM(quantity) AS total_qty
    FROM order_items
    GROUP BY order_id
) item_totals ON o.order_id = item_totals.order_id;

Now the join is one-to-one (order to its own pre-aggregated summary), so o.total_amount is never duplicated.

Trap 4 — the same bug hiding inside AVG()

-- WRONG: looks innocent, still broken
SELECT o.customer_id, AVG(o.total_amount) AS avg_order_value
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
GROUP BY o.customer_id;
-- Customer 201's order 5001 (weight x3 due to 3 items) skews the average toward 2500

AVG isn't safe just because it "averages things out" — it's still averaging over duplicated rows, so orders with more line items get more influence on the average than orders with fewer line items. The fix is identical: pre-aggregate order_items to order-grain first, or simply compute the average directly from the un-joined orders table.

Production checklistWhenever you write JOIN ... GROUP BY ... SUM/AVG/COUNT(some_column), ask: "does some_column belong to a table whose grain got finer because of this join?" If yes, pre-aggregate that table to the right grain first, in a subquery or CTE, before joining.

Level 3.5 · Data Engineering

B. Slowly Changing Dimension (SCD Type 2) joins

What it is: in a data warehouse, dimension attributes change over time (an employee's salary band changes, a product's price changes, a customer moves cities). SCD Type 2 keeps every historical version of a dimension row, each tagged with an effective_start and effective_end date, instead of overwriting the old value.

Why it exists: if you overwrote the dimension row in place (SCD Type 1), historical fact rows would silently be re-attributed to the current value — a sale made in 2022 at an old price would suddenly look like it used today's price when you re-run a report. SCD2 preserves "what was true at the time the fact happened."

DDL & sample data

CREATE TABLE dim_customer_scd2 (
    customer_sk     INT PRIMARY KEY,   -- surrogate key: unique per VERSION of the customer
    customer_id     INT,               -- natural/business key: same across all versions
    customer_name   VARCHAR(50),
    city            VARCHAR(50),
    effective_start DATE,
    effective_end   DATE,              -- '9999-12-31' means "currently active"
    is_current      BOOLEAN
);

INSERT INTO dim_customer_scd2 VALUES
(1, 301, 'Rahul Bansal', 'Delhi',     '2022-01-01', '2023-06-30', FALSE),  -- old version
(2, 301, 'Rahul Bansal', 'Bangalore', '2023-07-01', '9999-12-31', TRUE),   -- current version
(3, 302, 'Neha Kulkarni','Pune',      '2021-05-01', '9999-12-31', TRUE);

CREATE TABLE fact_sales (
    sale_id     INT PRIMARY KEY,
    customer_id INT,
    sale_date   DATE,
    amount      DECIMAL(10,2)
);

INSERT INTO fact_sales VALUES
(9001, 301, '2023-02-14', 5000.00),  -- happened while Rahul was in Delhi
(9002, 301, '2023-09-01', 3000.00),  -- happened while Rahul was in Bangalore
(9003, 302, '2022-03-10', 1500.00);

The SCD2 join — matching a fact to the dimension version active on that date

SELECT
    f.sale_id,
    f.sale_date,
    f.amount,
    d.city AS city_at_time_of_sale
FROM fact_sales f
JOIN dim_customer_scd2 d
    ON f.customer_id = d.customer_id
   AND f.sale_date BETWEEN d.effective_start AND d.effective_end;

Result: sale 9001 (Feb 2023) correctly joins to Rahul's Delhi version; sale 9002 (Sep 2023) correctly joins to his Bangalore version — even though both facts share the same customer_id. This is a non-equi join (from Level 3) applied to real dimensional modeling.

AnalogyA passport's stamped visa history. If customs looks up "where did this person live on March 3rd 2023", they check which visa entry's date range covers March 3rd — not just "what's their current address today."

Common mistakeJoining fact to dimension using only customer_id (the natural key) without the date-range condition causes a fan-out: each sale matches every historical version of that customer, multiplying rows and corrupting any SUM/COUNT downstream — this is the SCD2 flavor of the aggregation trap from section A.

Production tipAlways join fact tables to the dimension's surrogate key (customer_sk) when possible — resolved once at ETL load time via exactly this date-range join — rather than re-resolving the SCD2 range join on every query. It's faster and avoids every analyst having to remember the BETWEEN logic.

Level 3.5 · Data Engineering

C. Range joins with dates — price validity example

The same non-equi range-join pattern from salary bands and SCD2 shows up constantly for "what was the price/rate/rule in effect on date X" problems — a product_price_history table is the classic example.

CREATE TABLE product_price_history (
    price_id    INT PRIMARY KEY,
    product_id  VARCHAR(10),
    price       DECIMAL(10,2),
    valid_from  DATE,
    valid_to    DATE   -- '9999-12-31' if still the current price
);

INSERT INTO product_price_history VALUES
(1, 'P1', 1000.00, '2023-01-01', '2023-12-31'),
(2, 'P1', 1100.00, '2024-01-01', '9999-12-31'),   -- price increase
(3, 'P4',  380.00, '2023-06-01', '2024-02-28'),
(4, 'P4',  400.00, '2024-03-01', '9999-12-31');

-- Attach the price that was valid on each order_item's order date
SELECT
    oi.item_id,
    oi.product_id,
    o.order_date,
    pph.price AS price_at_order_time
FROM order_items oi
JOIN orders o              ON oi.order_id = o.order_id
JOIN product_price_history pph
    ON oi.product_id = pph.product_id
   AND o.order_date BETWEEN pph.valid_from AND pph.valid_to;

Common mistakeOverlapping date ranges in the source data (a data quality bug where two rows for the same product both claim to be valid on the same date) will cause this join to silently produce duplicate rows — one per overlapping match. Always validate a price/dimension history table has no overlapping valid_from/valid_to ranges per key before relying on range joins against it.

Optimization tipRange joins on dates can't use a simple equality index. If this join runs on huge fact tables daily, many warehouses benefit from pre-resolving the price at ETL time (attaching a resolved price_id to each order_item once) instead of re-running the range join on every downstream query.

Level 3.5 · Data Engineering

D. Many-to-many bridge tables

What it is: a many-to-many relationship (one student can take many courses, one course can have many students) can't be modeled with a single foreign key on either side. The fix is a third table — a bridge (a.k.a. junction/associative) table — holding one row per (student, course) pairing.

CREATE TABLE students (
    student_id   INT PRIMARY KEY,
    student_name VARCHAR(50)
);

CREATE TABLE courses (
    course_id   INT PRIMARY KEY,
    course_name VARCHAR(50)
);

CREATE TABLE student_courses (       -- the bridge table
    student_id INT,
    course_id  INT,
    enrolled_on DATE,
    PRIMARY KEY (student_id, course_id),
    FOREIGN KEY (student_id) REFERENCES students(student_id),
    FOREIGN KEY (course_id)  REFERENCES courses(course_id)
);

INSERT INTO students VALUES (1,'Aarav'), (2,'Diya'), (3,'Kabir');
INSERT INTO courses  VALUES (10,'Databases'), (20,'Algorithms'), (30,'Statistics');

INSERT INTO student_courses VALUES
(1, 10, '2024-01-05'),   -- Aarav takes Databases
(1, 20, '2024-01-05'),   -- Aarav also takes Algorithms
(2, 10, '2024-01-06'),   -- Diya takes Databases
(3, 30, '2024-01-07');   -- Kabir takes Statistics only
-- Notice: Algorithms(20) has only 1 student, Statistics(30) has only 1, nobody takes nothing (Kabir isn't in Algorithms)

Easy example — list each student with each of their course names

SELECT s.student_name, c.course_name
FROM students s
JOIN student_courses sc ON s.student_id = sc.student_id
JOIN courses c          ON sc.course_id = c.course_id;

Two joins in a row: student → bridge → course. This double-hop is the defining shape of every bridge-table query.

Medium example — students enrolled in NEITHER Databases nor Algorithms

SELECT s.student_name
FROM students s
WHERE s.student_id NOT IN (
    SELECT sc.student_id
    FROM student_courses sc
    JOIN courses c ON sc.course_id = c.course_id
    WHERE c.course_name IN ('Databases', 'Algorithms')
);
-- Result: Kabir (he's only in Statistics)

Hard example — courses with zero students (need a LEFT JOIN through the bridge)

SELECT c.course_name
FROM courses c
LEFT JOIN student_courses sc ON c.course_id = sc.course_id
WHERE sc.student_id IS NULL;
-- With this data, every course has at least one student, so this returns empty — 
-- add a course with no enrollments to see it trigger.

Common mistakeJoining students directly to courses without going through the bridge table is impossible — there's no direct foreign key between them. Beginners sometimes try to "shortcut" a many-to-many relationship and get stuck; the bridge table is not optional, it's structurally required.

Level 3.5 · Data Engineering

E. Join elimination (optimizer internals)

What it is: a query optimization where the database engine detects that a join in your query is unnecessary to produce the requested output, and silently removes it from the execution plan — without changing the result.

When it happens: most commonly when you join to a table only to filter or just to prove a row exists, but never actually select any column from it, AND the join is provably safe to skip — for example, joining on a foreign key that's guaranteed unique and non-null (so the join can't duplicate or drop rows).

-- You wrote this:
SELECT e.emp_name, e.salary
FROM employees e
JOIN departments d ON e.dept_id = d.dept_id;

-- If dept_id has a NOT NULL foreign key constraint referencing departments.dept_id (guaranteed to exist and be unique),
-- a smart optimizer can recognize that the join to `departments` can never filter out
-- or duplicate any row of `employees` -- since every non-null dept_id is guaranteed to match exactly one row --
-- so it eliminates the join entirely and just scans employees.

Why this matters in interviewsSnowflake and other modern warehouses advertise join elimination as part of their optimizer. Interviewers at data platform companies sometimes ask "if you join to a table but never use its columns, does the join still cost anything?" — the sharp answer is "not necessarily, if the optimizer can prove it's safe to eliminate; but you shouldn't rely on it — write only the joins you actually need."

CaveatJoin elimination requires provable referential integrity (real FK constraints, or metadata the optimizer trusts). If your dept_id column isn't actually declared as a NOT NULL foreign key (even if it "happens" to always be valid), most optimizers won't risk eliminating the join — they can't prove it's safe. This is a good real-world argument for actually declaring your foreign key constraints instead of only enforcing them in application code.

Level 3.5 · Data Engineering

F. Broadcast joins (distributed warehouse concept)

What it is: in distributed engines like Apache Spark, Snowflake, or BigQuery, data is split across many worker nodes. A normal join between two large distributed tables requires shuffling — physically moving rows across the network so matching keys land on the same node. Shuffling is expensive.

The trick: if one side of the join is small (a dimension table that fits comfortably in memory — say, a few MB to a few hundred MB), the engine can instead copy ("broadcast") that entire small table to every worker node. Then each node does a local, in-memory join against its slice of the big fact table — no shuffling of the huge table required at all.

AnalogyInstead of mailing every warehouse's inventory list to one central office to cross-reference (shuffle), you photocopy the short "product catalog" (small dimension) and mail a copy to every warehouse (broadcast) — each warehouse can now check its own inventory against the catalog locally, without moving the big inventory data anywhere.

-- Conceptual Spark SQL hint (syntax varies by engine):
SELECT /*+ BROADCAST(d) */
    f.sale_id, f.amount, d.category
FROM fact_sales f
JOIN dim_product d ON f.product_id = d.product_id;
-- dim_product is small (e.g. a few thousand rows) -> broadcast it to every node
-- fact_sales is huge (billions of rows) -> stays put, never shuffled

Interview goldThis is a favorite Spark/Snowflake data engineering interview question: "you have a 5 billion row fact table and a 500 row dimension table — how do you make this join fast?" The expected answer: broadcast the small dimension table, avoiding a shuffle join on the huge fact table. Most engines (Spark, Snowflake, BigQuery) auto-detect this based on size thresholds, but you can also hint it explicitly, as shown above.

Common mistakeForcing a broadcast join on a table that's actually too large for worker node memory can cause out-of-memory errors on every node simultaneously — broadcast joins are a deliberate trade-off for genuinely small dimension tables, not a universal performance switch.

Level 3.5 · Data Engineering

G. Skewed joins & salting

What it is: in a distributed shuffle join, rows are partitioned across worker nodes based on a hash of the join key. If one key value is wildly overrepresented (say, 90% of your fact_sales rows have customer_id = 999 — a bulk "guest checkout" account), every one of those rows hashes to the same worker node. That one node ends up doing 90% of the work while every other node sits idle — a classic data skew bottleneck.

AnalogyTen cashiers open at a supermarket, but a mistake routes 9 out of every 10 customers to Cashier #1's line. Nine cashiers stand idle while Cashier #1 drowns. Adding more cashiers doesn't help unless customers are actually spread across them.

Recognizing skew

-- Diagnostic: check for skew before blaming "the join is just slow"
SELECT customer_id, COUNT(*) AS row_count
FROM fact_sales
GROUP BY customer_id
ORDER BY row_count DESC
LIMIT 5;
-- If the #1 row dwarfs the rest (e.g. 90M rows vs a normal customer's 200 rows), that's skew.

The fix: salting

Salting artificially splits an overloaded key into several fake sub-keys, spreading its rows across multiple worker nodes, then joins against a correspondingly "exploded" copy of the small side, and finally removes the salt.

-- Step 1: add a random salt (0-9) to the skewed fact table's join key
SELECT
    f.*,
    CONCAT(f.customer_id, '_', CAST(FLOOR(RAND() * 10) AS INT)) AS salted_customer_id
FROM fact_sales f;

-- Step 2: explode the small dimension side so every real customer_id
-- exists once for each possible salt value (0-9), matching the scheme above
SELECT
    d.*,
    CONCAT(d.customer_id, '_', salt.n) AS salted_customer_id
FROM dim_customer d
CROSS JOIN (SELECT 0 AS n UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3
            UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7
            UNION ALL SELECT 8 UNION ALL SELECT 9) salt;

-- Step 3: join on the salted key instead of the raw key.
-- The skewed customer's rows are now spread across 10 different salted keys,
-- landing on up to 10 different worker nodes instead of piling onto one.

Production tipSalting adds real complexity (an extra CROSS JOIN "explosion" step, extra shuffle for the dimension side), so it's reserved for genuinely skewed joins, not applied by default. Most warehouses (Spark 3+, for instance) also offer automatic skew-join handling as an engine feature — check your engine's adaptive query execution settings before hand-rolling salting yourself.

Interview framingA strong answer distinguishes symptoms from cause: "the join is slow" is a symptom; "one key dominates the join column's distribution, overloading a single partition" is the diagnosis; "salt the skewed key, or enable the engine's adaptive skew handling" is the fix. Naming all three shows real production experience, not memorized syntax.

Level 4 · Hard Interview

How interviewers trick people on JOINs

The NULL trap: they slip a NULL foreign key into the sample data (like Tanvi/Sameer here) and ask for "all employees and their department" — if you use INNER JOIN, you silently drop people, and most candidates don't even notice the row count is wrong.
The WHERE-vs-ON trap: they ask you to LEFT JOIN and then filter on the right table in WHERE, secretly turning it back into an INNER JOIN, and ask "why did dept X disappear?"
The duplicate row count trap: they ask "how many employees are there?" after a multi-table join and expect you to realize a one-to-many join inflated COUNT(*) — the correct answer needs COUNT(DISTINCT e.emp_id).
The self-join off-by-one trap: asking for "employee and manager's manager" (two levels up) forces you to self-join the same table twice with three aliases — many candidates fumble the aliasing.
The NOT IN + NULL trap: covered above — a favorite "gotcha" question is simply "what's wrong with this NOT IN query?" where the subquery column contains NULLs.
The implicit cross join trap: old-style comma joins with a forgotten WHERE clause. They show you a query with an unexpectedly huge row count and ask you to spot the bug.

Level 4 · Hard Interview

3 Hard Interview Questions

H1.Find employees earning strictly more than their own manager. hard▶

Expected output

emp_name, emp_salary, manager_name, manager_salary — for any employee whose salary beats their manager's.

Hint

This needs a self join. Compare each employee's salary against the salary of the row where emp_id = manager_id.

Full solution

SELECT
    e.emp_name  AS employee,
    e.salary    AS employee_salary,
    m.emp_name  AS manager,
    m.salary    AS manager_salary
FROM employees e
JOIN employees m ON e.manager_id = m.emp_id
WHERE e.salary > m.salary;

With this dataset: no one currently beats their manager, so the result is empty — which is itself a valid, correct answer. (Try changing Sneha's salary to 150000 mentally — she'd then beat Priya.)

Alternative solution

Same logic using a correlated subquery instead of a join:

SELECT e.emp_name, e.salary
FROM employees e
WHERE e.salary > (
    SELECT m.salary FROM employees m WHERE m.emp_id = e.manager_id
);

Optimized solution

The JOIN version is generally preferred over the correlated subquery at scale, because most optimizers turn the JOIN into a single efficient hash join, while the correlated subquery form can force a row-by-row nested loop unless the optimizer is smart enough to rewrite it (many are, but don't rely on it).

Time complexity

O(n) with an index on emp_id (self join via index lookup); O(n²) worst case with no index and a naive nested loop.

Common mistakes

Using LEFT JOIN instead of INNER JOIN here would incorrectly include top-level employees (manager_id IS NULL) with a NULL manager_salary, and the comparison salary > NULL is UNKNOWN, so they'd silently drop out anyway — but it's sloppy to include them at all conceptually.
Forgetting the self-join needs two distinct aliases.

Interview follow-ups

"What if manager_id could point to someone in a different table?" (it can't here since it's self-referencing, but discuss FK design)
"How would you find employees earning more than the AVERAGE of their department?" (bridges into subqueries/window functions)

H2.List every department together with its employee count — including departments with zero employees. hard▶

Expected output

dept_name, employee_count — Legal should show 0, not be missing from the output.

Hint

Start FROM departments (the side you want guaranteed to fully appear), LEFT JOIN employees, and be careful which column you COUNT.

Full solution

SELECT
    d.dept_name,
    COUNT(e.emp_id) AS employee_count
FROM departments d
LEFT JOIN employees e ON d.dept_id = e.dept_id
GROUP BY d.dept_name;

Common mistakes

Using COUNT(*) instead of COUNT(e.emp_id) — this is the classic trap. For Legal, the LEFT JOIN produces exactly one output row with all-NULL employee columns. COUNT(*) counts that row as 1 (wrong! Legal has 0 employees). COUNT(e.emp_id) correctly counts 0, because COUNT(column) ignores NULLs, and e.emp_id is NULL in that unmatched row.
Using INNER JOIN instead of LEFT JOIN — Legal disappears from the output entirely instead of showing 0.

Interview follow-ups

"Now also show average salary per department, safely handling the zero-employee case." (AVG over an empty group already returns NULL safely — no special handling needed, unlike SUM which returns NULL too but people sometimes expect 0.)

H3.Find pairs of employees in the same department who have the exact same salary (salary "twins"), without pairing an employee with themselves and without listing the same pair twice. hard▶

Expected output

For our data: (Priya Nair, Rohit Sharma[103], 145000) and (Ananya Iyer, Vikram Singh, 92000).

Hint

Self join on dept_id AND salary, then use e1.emp_id < e2.emp_id to avoid self-pairing and duplicate reversed pairs in one shot.

Full solution

SELECT
    e1.emp_name AS employee_1,
    e2.emp_name AS employee_2,
    e1.dept_id,
    e1.salary
FROM employees e1
JOIN employees e2
    ON e1.dept_id = e2.dept_id
   AND e1.salary  = e2.salary
   AND e1.emp_id  < e2.emp_id;

Common mistakes

Using e1.emp_id != e2.emp_id instead of < — this avoids self-pairing but still produces BOTH (A,B) and (B,A), doubling the results.
Forgetting to also match on dept_id — without it, you'd find salary twins across different departments too, which wasn't asked for (that's a different, also valid, question — always re-read the requirement).

Interview follow-ups

"Extend this to find triplets with the same salary." (Doesn't scale well with self-joins — this is where window functions with RANK/COUNT OVER PARTITION become the better tool — foreshadowing the Window Functions module.)

Level 5 · FAANG

2 FAANG-Level Questions

F1.Deduplicate employees table: some rows are exact duplicates by (emp_name, dept_id, salary) except for a different emp_id. Delete the extras, keeping the lowest emp_id per group. faang▶

Problem statement

This is a classic production data-quality problem: a broken ETL job double-inserted rows. Note: in our actual dataset, Rohit Sharma appears twice but they are genuinely different people (different dept_id and salary) — so they are NOT duplicates, and this question is about the general pattern, imagine a hypothetical exact duplicate crept in.

Hint

Self-join the table to itself on the duplicate-defining columns, keep only pairs where the "other" row has a smaller id, and delete those.

Full solution

DELETE e1 FROM employees e1
JOIN employees e2
    ON e1.emp_name = e2.emp_name
   AND e1.dept_id  = e2.dept_id
   AND e1.salary   = e2.salary
   AND e1.emp_id  > e2.emp_id;   -- delete the "later" duplicate, keep the smaller id

Alternative solution

Using a window function (foreshadowing Module 6) — often cleaner and portable across databases that don't support multi-table DELETE syntax:

DELETE FROM employees
WHERE emp_id IN (
    SELECT emp_id FROM (
        SELECT
            emp_id,
            ROW_NUMBER() OVER (
                PARTITION BY emp_name, dept_id, salary
                ORDER BY emp_id
            ) AS rn
        FROM employees
    ) ranked
    WHERE rn > 1
);

Optimized solution

For huge tables, doing this as a single DELETE can lock a lot of rows / generate huge undo logs. Production approach: write the deduplicated result to a new table (CREATE TABLE employees_clean AS SELECT ... WHERE rn = 1), validate row counts, then swap table names — much safer and often faster than in-place delete at scale.

Time complexity

Self-join approach: O(n²) without an index on the duplicate-key columns, O(n log n) with one. Window function approach: O(n log n) for the sort/partition.

Common mistakes

Using <> instead of > in the self-join delete condition — this deletes BOTH copies of every duplicate pair instead of keeping one.
Not wrapping this in a transaction / not testing with a SELECT first in production — always convert your DELETE to a SELECT of the same rows and eyeball it before running the real delete.

Interview follow-ups

"What if the table has a billion rows and no useful index — how would you approach it differently?" (batch deletes, or rebuild via CTAS as above)

F2.For every employee, find their "skip-level" manager (their manager's manager) using only joins — no recursive CTE allowed. faang▶

Problem statement

This tests whether you understand that a self-join can be chained multiple times for fixed-depth hierarchy questions, and where that approach breaks down (motivating recursive CTEs later).

Hint

Self join employees to itself twice: once to get the direct manager, once more to get that manager's manager.

Full solution

SELECT
    e.emp_name          AS employee,
    m.emp_name          AS manager,
    gm.emp_name         AS skip_level_manager
FROM employees e
LEFT JOIN employees m  ON e.manager_id = m.emp_id
LEFT JOIN employees gm ON m.manager_id = gm.emp_id;

Example result: Sneha Kapoor → Priya Nair → Arjun Mehta.

Alternative solution

Same idea but only surfacing rows that actually have a skip-level manager:

SELECT e.emp_name, gm.emp_name AS skip_level_manager
FROM employees e
JOIN employees m  ON e.manager_id = m.emp_id
JOIN employees gm ON m.manager_id = gm.emp_id;

Optimized solution

This chained-self-join approach only works for a fixed number of hierarchy levels you hardcode (2, 3, 4 joins...). It cannot handle arbitrary-depth trees where you don't know the depth in advance. That's exactly the limitation that motivates Recursive CTEs (Module 4) — flag this explicitly in an interview, it shows depth of understanding.

Time complexity

O(n) per join level with an index on emp_id/manager_id; k chained joins for k levels of depth.

Common mistakes

Using INNER JOIN throughout when the question implies you should still show employees without a skip-level manager — pick LEFT JOIN or INNER JOIN deliberately based on the exact requirement, and state your assumption out loud.

Interview follow-ups

"What if the org chart is 10 levels deep and varies per employee?" — correct answer: "That's not solvable cleanly with fixed self-joins; I'd use a Recursive CTE."

Level 5 · FAANG

1 Production Scenario

ETL enrichment join with late-arriving dimension data

You run a nightly ETL job that joins a fact table daily_sales (transaction-level, millions of rows/day) to a dimension table products to attach product category and brand for a reporting dashboard.

The real problem: occasionally a sale references a product_id that hasn't been inserted into products yet — the product dimension load runs slightly after the sales load ("late-arriving dimension"). An INNER JOIN silently drops those sales rows from the report entirely — revenue looks lower than it actually was, and nobody notices until someone reconciles totals against the source system.

Production-grade fix:

SELECT
    s.sale_id,
    s.sale_amount,
    COALESCE(p.category, 'UNKNOWN') AS category,
    COALESCE(p.brand, 'UNKNOWN')    AS brand
FROM daily_sales s
LEFT JOIN products p ON s.product_id = p.product_id;

Use LEFT JOIN (never INNER JOIN) for fact-to-dimension enrichment in ETL, and default unmatched dimension attributes to an explicit 'UNKNOWN' placeholder rather than leaving NULL — this makes the gap visible in dashboards (a report slice literally labeled "UNKNOWN" prompts investigation) instead of silently disappearing. Teams typically also add a monitoring query that counts COUNT(*) WHERE category = 'UNKNOWN' daily and alerts if it spikes.

Practice

10 Practice Questions

Write the SQL yourself before expanding each answer. Use the employees / departments schema above.

P1.List every employee with their department name (only employees who have a department). easy▶

SELECT e.emp_name, d.dept_name
FROM employees e
JOIN departments d ON e.dept_id = d.dept_id;

P2.List every employee, including those with no department (show NULL for dept_name). easy▶

SELECT e.emp_name, d.dept_name
FROM employees e
LEFT JOIN departments d ON e.dept_id = d.dept_id;

P3.Find all departments that have zero employees. easy▶

SELECT d.dept_name
FROM departments d
LEFT JOIN employees e ON d.dept_id = e.dept_id
WHERE e.emp_id IS NULL;

P4.Find all employees with no department assigned. easy▶

SELECT e.emp_name
FROM employees e
WHERE e.dept_id IS NULL;
-- (a join isn't even required here, but with a LEFT JOIN + IS NULL pattern also works)

P5.For each employee, show their direct manager's name (NULL if none). medium▶

SELECT e.emp_name, m.emp_name AS manager_name
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.emp_id;

P6.Count how many direct reports each manager has (managers with 0 reports don't need to appear). medium▶

SELECT m.emp_name AS manager, COUNT(e.emp_id) AS direct_reports
FROM employees m
JOIN employees e ON e.manager_id = m.emp_id
GROUP BY m.emp_name;

P7.List every possible (department, salary_band) combination using the salary_bands table from the Non-equi Joins section, even combos with no employees. medium▶

SELECT d.dept_name, b.band_name
FROM departments d
CROSS JOIN salary_bands b;

P8.Find departments located in the same city as at least one other department (self join on departments). medium▶

SELECT DISTINCT d1.dept_name, d1.location
FROM departments d1
JOIN departments d2
    ON d1.location = d2.location
   AND d1.dept_id != d2.dept_id;
-- Engineering & HR share Bangalore; Marketing & Legal share Delhi

P9.Find employees who earn more than every employee in the Marketing department (non-equi style thinking). hard▶

SELECT e.emp_name, e.salary
FROM employees e
WHERE e.salary > (
    SELECT MAX(m.salary)
    FROM employees m
    WHERE m.dept_id = 3
);
-- Pure-join version is awkward here; this is a preview of why subqueries exist.

P10.Show each department with its headcount and average salary, ordered by average salary descending, including empty departments as 0/NULL. hard▶

SELECT
    d.dept_name,
    COUNT(e.emp_id)        AS headcount,
    ROUND(AVG(e.salary),2) AS avg_salary
FROM departments d
LEFT JOIN employees e ON d.dept_id = e.dept_id
GROUP BY d.dept_name
ORDER BY avg_salary DESC;

Before you say NEXT: make sure you can explain, out loud and without looking, the difference between INNER/LEFT/RIGHT/FULL, why NOT IN is dangerous with NULLs, why COUNT(*) vs COUNT(column) matters after a LEFT JOIN, and the difference between a hash join and a nested loop join. When you're ready, reply NEXT and we'll move to Module 2: Subqueries.