Module 2 — Subqueries

A query inside a query. The tool for "compare against something you have to compute first" — and the tool interviewers use to test whether you understand execution order, not just syntax.

Level 1 · Foundations

Why subqueries exist

Sometimes a question can't be answered by looking at raw table data directly — you first need to compute something (a max, an average, a list of ids, a yes/no check), and only then compare each row against that computed thing. A subquery is a SELECT statement nested inside another SQL statement, used to produce that intermediate value, list, or table on the fly.

Analogy Think of asking "who in my class scored above the class average?" You can't answer that by looking at one student at a time — you first have to compute the class average (one full pass over everyone's scores), and only then can you go back and check each student against that number. The subquery is the "first, compute the average" step; the outer query is the "now compare everyone against it" step.

What it is

A subquery is just a SELECT wrapped in parentheses, placed somewhere another SQL statement expects a value, a list, or a table. It can appear in WHERE, SELECT, FROM, or even HAVING.

When to use it

You need to compare a row against an aggregate (max, avg, count) computed from the same or a different table.
You need to check existence of a related row without pulling back any of its columns (data validation, filtering).
You need to build a small computed lookup table on the fly, used only within this one query.

When NOT to use it

When a plain JOIN says the same thing more simply and you actually need columns from the other table — joins are usually easier for a reader to follow than deeply nested subqueries.
When a correlated subquery would force a slow row-by-row loop (see Execution Flow below) and an equivalent JOIN or window function would run in a single pass.

🧠

Mental model: a non-correlated subquery runs once, completely independent of the outer query, and hands back a value/list/table that the outer query then uses. A correlated subquery is different — it re-runs once per row of the outer query, because it references a column from the outer row. That distinction is the single most important idea in this whole module — everything else builds on it.

Schema & sample data for this module

We reuse employees/departments from Module 1 (same data), and add a customers/orders pair for the data-quality/production examples.

DDL (recap from Module 1)

-- employees / departments: identical to Module 1 (16 employees, 6 departments, Legal has 0 employees,
-- Tanvi Desai & Sameer Gupta have NULL dept_id, Priya/Rohit(103) tie at 145000, Ananya/Vikram tie at 92000)

CREATE TABLE customers (
    customer_id   INT PRIMARY KEY,
    customer_name VARCHAR(50),
    country       VARCHAR(50)
);

CREATE TABLE orders (
    order_id     INT PRIMARY KEY,
    customer_id  INT,             -- can be a "dangling" id that doesn't exist in customers (data quality issue)
    order_date   DATE,
    total_amount DECIMAL(10,2)
);

INSERT INTO customers VALUES
(201, 'Rahul Bansal',   'India'),
(202, 'Grace Lin',      'Singapore'),
(203, 'Omar Haddad',    'UAE'),
(204, 'Laura Fischer',  'Germany');

INSERT INTO orders VALUES
(5001, 201, '2024-01-10', 2500.00),
(5002, 202, '2024-01-11', 1200.00),
(5003, 201, '2024-01-15',  800.00),
(5004, 203, '2024-02-01', 4300.00),
(5005, 999, '2024-02-05',  650.00);   -- customer_id 999 DOES NOT EXIST in customers -- orphaned order (data bug)

customers (4 rows)
customer_id	customer_name	country
201	Rahul Bansal	India
202	Grace Lin	Singapore
203	Omar Haddad	UAE
204	Laura Fischer	Germany

orders (5 rows — one orphaned)
order_id	customer_id	total_amount
5001	201	2500.00
5002	202	1200.00
5003	201	800.00
5004	203	4300.00
5005	999	650.00

NoticeOrder 5005 points to customer_id = 999, which doesn't exist anywhere in customers. Laura Fischer (204) has never placed an order. These two facts drive most of the subquery examples below — subqueries are the natural tool for "find the mismatch."

Level 1 · Beginner

Scalar subquery

What it is: a subquery that returns exactly one row and one column — a single value. It can be used anywhere a plain value could go: in WHERE, in SELECT, even inside arithmetic.

Why it exists: you often need to compare a row against a single computed number — the overall max, the overall average, a specific lookup value — without hardcoding that number.

AnalogyA scalar subquery is like a calculator you run once, write the single result on a sticky note, and then use that sticky note's number throughout the rest of your work.

Easy example

Find employees who earn more than the company-wide average salary.

SELECT emp_name, salary
FROM employees
WHERE salary > (
    SELECT AVG(salary) FROM employees
);

Line by line: the inner query SELECT AVG(salary) FROM employees runs first, completely on its own, and produces one number (roughly 100,687.5). The outer query then just does a plain comparison, as if you'd typed WHERE salary > 100687.5 directly.

Common mistakeIf a "scalar" subquery accidentally returns more than one row (say you forgot a WHERE filter, or used AVG without realizing a GROUP BY was implicitly needed), most databases throw a runtime error like "subquery returns more than one row" — it's not silently truncated to the first row. Always double check your subquery, run it standalone, and confirm it really returns one row before nesting it.

Scalar subquery in SELECT

SELECT
    emp_name,
    salary,
    (SELECT AVG(salary) FROM employees) AS company_avg,
    salary - (SELECT AVG(salary) FROM employees) AS diff_from_avg
FROM employees;

Production tipA scalar subquery placed in the SELECT list re-runs once per outer row in many engines unless the optimizer is smart enough to compute it once and cache it (it often is, for a non-correlated constant subquery like this one — but don't assume, check with EXPLAIN on large tables). A CTE computed once and cross-joined is a safer bet at scale — covered in Module 3.

Level 1 · Beginner

IN subquery

What it is: a subquery that returns a list of values (one column, many rows), used with IN to check membership.

AnalogyA guest list at the door. The subquery builds the guest list; IN checks "is this person's name on the list?"

Easy example

Find all employees who work in a department located in Bangalore.

SELECT emp_name
FROM employees
WHERE dept_id IN (
    SELECT dept_id FROM departments WHERE location = 'Bangalore'
);

The inner query returns (1, 4) — Engineering and HR. The outer query keeps any employee whose dept_id is in that list.

Medium example — customers who have placed at least one order

SELECT customer_name
FROM customers
WHERE customer_id IN (
    SELECT customer_id FROM orders
);
-- Rahul, Grace, Omar -- Laura is excluded, she has never ordered

Common mistakeIN silently de-duplicates nothing on its own — if the subquery returns the same value many times (e.g. customer_id appearing in multiple orders), IN still just checks membership once per outer row; it does NOT multiply your outer rows the way a JOIN would. This is actually a strength of IN over a naive JOIN + DISTINCT for pure "does it exist" questions.

Level 2 · Intermediate

Correlated subquery

What it is: a subquery that references a column from the outer query. Because the inner query's result depends on which outer row is currently being processed, it conceptually has to be re-evaluated once per outer row.

AnalogyA non-correlated subquery is like computing "the national average height" once and comparing every person to it. A correlated subquery is like asking, for every person, "what is the average height of just people in your own city?" — the answer depends on which person you're currently looking at, so you can't precompute one single number up front.

Easy example

Find employees who earn more than the average salary of their own department (not the company-wide average).

SELECT e.emp_name, e.dept_id, e.salary
FROM employees e
WHERE e.salary > (
    SELECT AVG(e2.salary)
    FROM employees e2
    WHERE e2.dept_id = e.dept_id   -- <- this line correlates the subquery to the outer row
);

Line by line: for every outer row e, the inner query recomputes the average salary using only rows from e's own department (e2.dept_id = e.dept_id). Sneha Kapoor (Engineering, 98000) is compared against Engineering's average (~143,250), not the whole company's.

🔁

Mental model: mentally read a correlated subquery as a nested loop: "for each row in the outer table, run the entire inner query again, using this row's values wherever the inner query references the outer alias."

Common mistakeForgetting the correlation condition entirely (writing SELECT AVG(salary) FROM employees instead of filtering by dept_id = e.dept_id) silently turns your "compare to department average" query back into "compare to company average" — a very easy typo to make and very easy for an interviewer to plant.

Level 2 · Intermediate

NOT IN & the NULL trap

This is one of the most important gotchas in all of SQL, and it comes up constantly in interviews.

The problem, demonstrated

Goal: find customers who have never placed an order.

-- Looks reasonable...
SELECT customer_name
FROM customers
WHERE customer_id NOT IN (
    SELECT customer_id FROM orders
);
-- With OUR data (no NULLs in orders.customer_id), this correctly returns Laura Fischer.

Now watch what happens if just one row in the subquery has a NULL customer_id — which is extremely plausible in real data (an order placed as a guest checkout, with no customer linked yet):

INSERT INTO orders VALUES (5006, NULL, '2024-02-10', 300.00);  -- a guest checkout order

SELECT customer_name
FROM customers
WHERE customer_id NOT IN (
    SELECT customer_id FROM orders
);
-- Now returns ZERO ROWS. Not an error -- just silently, completely empty.

Why this happens

NOT IN (list) is logically equivalent to column != v1 AND column != v2 AND column != v3 ... for every value in the list. If any one value in that list is NULL, that one comparison becomes column != NULL, which evaluates to UNKNOWN — not TRUE, not FALSE. And TRUE AND TRUE AND UNKNOWN evaluates to UNKNOWN overall, which is treated as "not true", so the row gets filtered out. This happens for every row being tested, since every row's chain of ANDs includes that same poisoned NULL comparison — so the whole result set silently collapses to empty.

This is a favorite interview trapInterviewers love asking "what's wrong with this NOT IN query" or quietly inserting a NULL into sample data before asking you to write a NOT IN filter. The safe, professional default is always use NOT EXISTS instead of NOT IN whenever the subquery's column could ever contain a NULL — which, in practice, means almost always, unless the column has a NOT NULL constraint you've personally verified.

The safe fix

SELECT c.customer_name
FROM customers c
WHERE NOT EXISTS (
    SELECT 1 FROM orders o WHERE o.customer_id = c.customer_id
);
-- Correctly returns Laura Fischer, regardless of any NULLs in orders.customer_id

NOT EXISTS never compares values directly against NULL the way NOT IN does — it just asks "does at least one matching row exist," and a NULL row in the subquery simply never matches anything, which is exactly the harmless behavior you want.

Alternative fix if you must use NOT INExplicitly filter out NULLs from the subquery: WHERE customer_id NOT IN (SELECT customer_id FROM orders WHERE customer_id IS NOT NULL). This works, but it's an easy line to forget — NOT EXISTS is simply safer by default.

Level 2 · Intermediate

ANY & ALL

What they are: operators that compare a value against every value returned by a subquery, using a comparison operator (=, >, <, etc.) combined with ANY or ALL.

> ANY (subquery) — true if the value is greater than at least one value returned (equivalent to > MIN(...)).
> ALL (subquery) — true if the value is greater than every value returned (equivalent to > MAX(...)).

Analogy"Taller than ANY of the basketball team" means taller than at least the shortest player. "Taller than ALL of the basketball team" means taller than even the tallest player.

Example — employees earning more than ALL Marketing employees

SELECT emp_name, salary
FROM employees
WHERE salary > ALL (
    SELECT salary FROM employees WHERE dept_id = 3   -- Marketing
)
AND dept_id != 3;   -- exclude Marketing itself from the comparison

This is functionally identical to comparing against MAX(salary) FROM employees WHERE dept_id = 3, but written with ALL. Most engineers actually prefer the MAX()/MIN() scalar subquery form shown in the Level 1 section — it's more widely understood at a glance than ANY/ALL, which some engineers rarely encounter.

Interview tipKnow that = ANY (subquery) is exactly equivalent to IN (subquery), and != ALL (subquery) is exactly equivalent to NOT IN (subquery) — including inheriting the exact same NULL trap discussed above. Interviewers sometimes ask you to name this equivalence directly.

Level 3 · Advanced

EXISTS & NOT EXISTS

What it is: a boolean test — TRUE if the subquery returns at least one row, FALSE if it returns none. EXISTS never cares what columns the subquery selects (writing SELECT 1 or SELECT * inside makes no difference) — only whether a row exists.

This is the "SEMI JOIN" pattern from Module 1, formalized.

Easy example

SELECT c.customer_name
FROM customers c
WHERE EXISTS (
    SELECT 1 FROM orders o WHERE o.customer_id = c.customer_id
);
-- Rahul, Grace, Omar (Laura excluded -- she has no matching order row)

🧠

Mental model: EXISTS is a correlated subquery almost by definition — it's checking, per outer row, "is there a matching row for this specific outer value." Good query engines short-circuit the inner scan the moment they find one matching row — they don't need to find every match, just confirm at least one exists, which is often faster than a full JOIN + DISTINCT for pure existence checks.

Production use: validations & data qualityEXISTS/NOT EXISTS are the standard tools for data quality checks: "find orders with no matching customer" (referential integrity check), "find products never sold" (dead inventory check), "find employees not in any project" (utilization check). They answer yes/no questions without the row-duplication risk of a JOIN.

Level 3 · Advanced

Execution flow internals: how correlated subqueries really run

Logically, a correlated subquery behaves like a nested loop: for every row the outer query considers, the database substitutes that row's values into the inner query and re-evaluates it from scratch.

SELECT e.emp_name
FROM employees e
WHERE EXISTS (
    SELECT 1 FROM employees m WHERE m.manager_id = e.emp_id
);
-- "find employees who manage at least one other employee"

Conceptual execution, row by row:

Take outer row: Arjun Mehta (emp_id=101). Run inner query with manager_id = 101 → finds Priya & Rohit → EXISTS is TRUE → keep Arjun.
Take outer row: Priya Nair (emp_id=102). Run inner query with manager_id = 102 → finds Sneha → EXISTS is TRUE → keep Priya.
Take outer row: Sneha Kapoor (emp_id=104). Run inner query with manager_id = 104 → finds nobody → EXISTS is FALSE → drop Sneha.
...continue for all 16 rows.

Performance issueNaively, that's O(n) outer rows × O(n) inner scan each = O(n²) in the worst case, with no index. In practice, real query optimizers are smarter than a literal nested loop — many rewrite correlated subqueries into semi-joins or hash-based plans internally when they can prove it's safe to do so. But you should never assume the optimizer will save you — always check EXPLAIN on a correlated subquery running against a large table, and add an index on the correlation column (manager_id here) if it's missing.

Rule of thumbA correlated subquery inside a WHERE EXISTS is usually fine, even on large tables, if the correlated column is indexed — the "re-run per row" becomes a cheap indexed lookup rather than a full table scan. Without that index, correlated subqueries are one of the most common causes of "this query used to be fast and now it's slow" as tables grow.

Level 3 · Advanced

Rewriting subqueries as joins

Many subqueries have an equivalent JOIN form. Knowing both, and knowing when to prefer which, is a strong interview signal.

IN subquery → INNER JOIN + DISTINCT

-- Subquery form
SELECT customer_name FROM customers
WHERE customer_id IN (SELECT customer_id FROM orders);

-- Equivalent JOIN form
SELECT DISTINCT c.customer_name
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id;

Both give the same result here. The JOIN form needs DISTINCT because a customer with multiple orders would otherwise be duplicated (Module 1's cardinality issue) — the subquery form never had that problem in the first place, since IN only checks membership.

EXISTS → semi-join (same idea)

-- EXISTS form (preferred for pure existence checks)
SELECT c.customer_name
FROM customers c
WHERE EXISTS (SELECT 1 FROM orders o WHERE o.customer_id = c.customer_id);

-- JOIN + DISTINCT form (works, but needs the DISTINCT safety net)
SELECT DISTINCT c.customer_name
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id;

When to prefer which

Prefer EXISTS/IN when you only need a yes/no filter and don't need any columns from the "other" table — it avoids duplication entirely and is often clearer about intent.
Prefer a JOIN when you actually need to pull columns from the other table into your result — a subquery can't return those columns to the outer SELECT.
Prefer a correlated subquery for per-row comparisons against a per-group aggregate (like "above my own department's average") when a window function (Module 6) isn't available or feels heavier than needed for a one-off query.

Level 3 · Advanced

Subqueries in SELECT, FROM, and WHERE

Subquery in FROM (a "derived table" / inline view)

SELECT dept_id, avg_salary
FROM (
    SELECT dept_id, AVG(salary) AS avg_salary
    FROM employees
    GROUP BY dept_id
) dept_avgs
WHERE avg_salary > 100000;

The inner query is treated as if it were a real table — you must give it an alias (dept_avgs here), and every database requires this alias even though you never reference it by name in a simple case like this.

Subquery in SELECT (scalar, per-row)

SELECT
    d.dept_name,
    (SELECT COUNT(*) FROM employees e WHERE e.dept_id = d.dept_id) AS headcount
FROM departments d;

This is a correlated scalar subquery run once per department row — functionally similar to a LEFT JOIN + GROUP BY, but sometimes easier to read for a single extra computed column. At scale, prefer the JOIN + GROUP BY form (Module 1) since it's usually a single pass instead of one subquery execution per outer row.

Common mistakeUsing a subquery in FROM without an alias is a syntax error in most databases (MySQL, Postgres both require it) — always alias derived tables, even if you never reference the alias again.

Level 4 · Hard Interview

How interviewers trick people on subqueries

The NOT IN + NULL trap — covered in depth above. This is probably the single most common SQL interview gotcha across all companies.
The forgotten correlation trap — they ask for "above department average" and quietly check whether you correlate the inner query by dept_id, or accidentally compute the company-wide average instead.
The "subquery returns multiple rows" trap — they ask you to use a scalar subquery (= (SELECT ...)) against data where the subquery could return more than one row, to see if you catch the runtime error before running it.
The EXISTS vs IN vs JOIN performance question — "which is fastest?" The honest answer is "it depends on the optimizer, data size, and indexes — but EXISTS is generally safest for a pure existence check, since it can short-circuit and never risks duplicate rows." Interviewers want to see you avoid a dogmatic one-line answer.
The double-negative trap — "find employees NOT in a department that has NO employees earning over 100000" nests two negations; many candidates fumble translating this into nested NOT EXISTS clauses correctly.

Level 4 · Hard Interview

3 Hard Interview Questions

H1.Find the department(s) with the highest average salary. hard▶

Expected output

dept_name, avg_salary — for whichever department(s) tie for the single highest average (handle ties correctly, don't just LIMIT 1).

Hint

Compute all department averages in a derived table or CTE, then compare each against the MAX of that same set.

Full solution

SELECT dept_name, avg_salary
FROM (
    SELECT d.dept_name, AVG(e.salary) AS avg_salary
    FROM departments d
    JOIN employees e ON d.dept_id = e.dept_id
    GROUP BY d.dept_name
) dept_avgs
WHERE avg_salary = (
    SELECT MAX(avg_salary)
    FROM (
        SELECT AVG(salary) AS avg_salary
        FROM employees
        GROUP BY dept_id
    ) all_avgs
);
-- Result: Engineering (~143,250)

Common mistakes

Using ORDER BY avg_salary DESC LIMIT 1 — this silently drops a genuine tie if two departments happen to share the exact highest average. Always prefer the = (SELECT MAX(...)) pattern when ties matter.
Forgetting to exclude Legal (0 employees, NULL average) — it's automatically excluded here since we used an INNER JOIN grouping, but worth stating out loud in an interview.

Interview follow-ups

"Now do this with a window function instead." (foreshadowing: RANK() OVER (ORDER BY avg_salary DESC), Module 6)

H2.Find customers whose total order amount is above the average total order amount across all customers. hard▶

Hint

This needs two levels of aggregation: per-customer totals first, then the average of those totals, then a filter. A single-level GROUP BY + HAVING can't reach "average of the per-group sums" directly — you need a derived table.

Full solution

SELECT customer_id, total_spent
FROM (
    SELECT customer_id, SUM(total_amount) AS total_spent
    FROM orders
    WHERE customer_id IS NOT NULL
    GROUP BY customer_id
) customer_totals
WHERE total_spent > (
    SELECT AVG(total_spent)
    FROM (
        SELECT customer_id, SUM(total_amount) AS total_spent
        FROM orders
        WHERE customer_id IS NOT NULL
        GROUP BY customer_id
    ) t
);

Common mistakes

Trying to write HAVING SUM(total_amount) > AVG(SUM(total_amount)) in one single GROUP BY — this is invalid in standard SQL; you cannot nest an aggregate directly inside another aggregate in the same grouping level. The derived-table (or CTE) two-step is required.
Forgetting to exclude the orphaned/NULL customer_id order (5005/5006 style rows) from both the inner sums and the outer average — an ungrounded NULL customer_id would either error out or silently create a bogus grouping bucket depending on the engine.

H3.Find employees who are managers, but ALL of their direct reports earn less than 90000. hard▶

Hint

This is a double condition: (1) must actually have at least one report (EXISTS), and (2) must have zero reports earning ≥ 90000 (NOT EXISTS with a flipped condition) — or use ALL directly.

Full solution — using NOT EXISTS (safe, recommended)

SELECT m.emp_name
FROM employees m
WHERE EXISTS (
    SELECT 1 FROM employees e WHERE e.manager_id = m.emp_id
)
AND NOT EXISTS (
    SELECT 1 FROM employees e WHERE e.manager_id = m.emp_id AND e.salary >= 90000
);

Alternative solution — using ALL

SELECT m.emp_name
FROM employees m
WHERE EXISTS (SELECT 1 FROM employees e WHERE e.manager_id = m.emp_id)
AND 90000 > ALL (
    SELECT e.salary FROM employees e WHERE e.manager_id = m.emp_id
);

Common mistakes

Skipping the first EXISTS check — without it, a manager with zero reports would trivially satisfy "NOT EXISTS a report earning ≥ 90000" (vacuously true, since there are no reports to violate the rule), incorrectly including individual contributors with no reports at all.

Interview follow-ups

"What does 'ALL of an empty set satisfies any condition' mean, and why is it dangerous here?" — a classic logic gotcha: ALL / vacuous truth over an empty set is TRUE by definition, which is exactly why the EXISTS guard is required.

Level 5 · FAANG

2 FAANG-Level Questions

F1.Second-highest distinct salary overall, using only a subquery (no LIMIT/OFFSET, no window functions). faang▶

Problem statement

This is one of the most repeated interview questions in existence — precisely because there are several ways to solve it, and the "wrong" ways all fail on ties or missing values in specific, revealing ways.

Hint

"Second highest" = the max salary strictly less than the overall max.

Full solution

SELECT MAX(salary) AS second_highest_salary
FROM employees
WHERE salary < (SELECT MAX(salary) FROM employees);
-- Overall max is Arjun at 185000 -> second highest is 160000 (Karan Verma)

Alternative solution — correlated subquery counting how many distinct salaries are above each row

SELECT DISTINCT e1.salary
FROM employees e1
WHERE 1 = (
    SELECT COUNT(DISTINCT e2.salary)
    FROM employees e2
    WHERE e2.salary > e1.salary
);
-- "exactly 1 distinct salary is greater than mine" = I am the second highest

Optimized solution

For "Nth highest" generalized, a window function (DENSE_RANK() OVER (ORDER BY salary DESC), filtered to rank = N) is the cleanest and most maintainable production answer — see Module 6. The subquery forms above are the classic interview answer specifically because window functions weren't always available/allowed as a constraint in the question.

Time complexity

First solution: O(n) for each MAX scan, effectively O(n) overall with a single pass (or O(log n) with an index on salary). Second solution: O(n²) without an index, since it's a correlated subquery per row.

Common mistakes

Using ORDER BY salary DESC LIMIT 1 OFFSET 1 — works, but silently gives the wrong answer if there's a tie for first place (in our data, if the two highest earners tied, OFFSET 1 would return that same tied value, not the "true" second distinct value). The question explicitly says "distinct" salary for this reason.
Forgetting DISTINCT when the underlying data has salary ties elsewhere in the table (it doesn't affect this specific MAX-based solution, but does affect naive OFFSET-based ones).

Interview follow-ups

"Generalize this to the Nth highest salary." (Motivates window functions directly — this single question is the classic bridge from subqueries to Module 6.)

F2.Find all departments where every single employee earns above 90000 (a "fully senior" department). faang▶

Problem statement

Notice the trap: Legal has zero employees. Does "every employee earns above 90000" vacuously include Legal? This is a genuine judgment call interviewers use to test whether you think about edge cases, not just write syntax.

Hint

"Every employee is above X" = "NOT EXISTS an employee at or below X" — but decide deliberately whether to require at least one employee to exist first.

Full solution — excluding empty departments (the safer business interpretation)

SELECT d.dept_name
FROM departments d
WHERE EXISTS (
    SELECT 1 FROM employees e WHERE e.dept_id = d.dept_id
)
AND NOT EXISTS (
    SELECT 1 FROM employees e WHERE e.dept_id = d.dept_id AND e.salary <= 90000
);
-- Result: none of our 5 non-empty departments qualify (each has at least one employee at or below 90000)

Alternative solution — including empty departments (the "vacuous truth" interpretation)

SELECT d.dept_name
FROM departments d
WHERE NOT EXISTS (
    SELECT 1 FROM employees e WHERE e.dept_id = d.dept_id AND e.salary <= 90000
);
-- Result: Legal included, since it has no violating employees (vacuously true)

Common mistakes

Picking one interpretation without stating the ambiguity out loud. A strong interview answer explicitly says: "Do you want departments with zero employees to count as trivially satisfying this? I'll assume not, but here's the one-line change if you want the other interpretation" — and then shows both queries, exactly as above.

Interview follow-ups

"Now require it in SQL that treats vacuous truth as invalid by business rule — how would you encode 'must have at least 3 employees to qualify'?" (Add a HAVING COUNT(*) >= 3 in a GROUP BY version instead.)

Level 5 · FAANG

1 Production Scenario

Referential integrity check before a warehouse load

Before loading a new batch of orders into a data warehouse fact table, a standard pre-load data quality gate checks for orphaned foreign keys — orders referencing a customer_id that doesn't (yet) exist in the customer dimension. Loading them anyway would create the fan-out / "UNKNOWN" join gaps discussed in Module 1's production scenario.

-- Data quality gate: fail the load (or quarantine these rows) if any exist
SELECT o.order_id, o.customer_id
FROM orders o
WHERE NOT EXISTS (
    SELECT 1 FROM customers c WHERE c.customer_id = o.customer_id
);
-- Flags order 5005 (customer_id 999, doesn't exist in customers)

Production pipelines typically run this exact NOT EXISTS check as an automated assertion step (in dbt, this pattern is literally the built-in relationships test) and either: (a) halt the pipeline and alert, (b) route the offending rows to a quarantine table for manual review, or (c) load them anyway but flag them, matching the "UNKNOWN" category approach from Module 1 — the right choice depends on whether late-arriving dimension data is expected and normal, or a genuine data bug.

Why NOT EXISTS, not NOT IN, hereIf orders.customer_id can ever be NULL (guest checkouts, as seen earlier), a NOT IN version of this exact check would silently report zero problems even when real orphaned rows exist — the worst possible failure mode for a data quality gate: appearing to pass while hiding real issues.

Practice

10 Practice Questions

Write the SQL yourself before expanding each answer.

P1.Find employees who earn the maximum salary in the whole company. easy▶

SELECT emp_name, salary
FROM employees
WHERE salary = (SELECT MAX(salary) FROM employees);

P2.Find departments that have at least one employee earning over 150000. easy▶

SELECT dept_name
FROM departments d
WHERE EXISTS (
    SELECT 1 FROM employees e WHERE e.dept_id = d.dept_id AND e.salary > 150000
);

P3.Find employees who are NOT anyone's manager. easy▶

SELECT emp_name
FROM employees e
WHERE NOT EXISTS (
    SELECT 1 FROM employees m WHERE m.manager_id = e.emp_id
);

P4.Find customers who have placed more than one order. medium▶

SELECT customer_id
FROM (
    SELECT customer_id, COUNT(*) AS order_count
    FROM orders
    WHERE customer_id IS NOT NULL
    GROUP BY customer_id
) t
WHERE order_count > 1;
-- Rahul Bansal (201) placed 2 orders

P5.Find employees earning more than their department's average, using a correlated subquery. medium▶

SELECT e.emp_name
FROM employees e
WHERE e.salary > (
    SELECT AVG(e2.salary) FROM employees e2 WHERE e2.dept_id = e.dept_id
);

P6.Find departments with zero employees, using NOT EXISTS. medium▶

SELECT dept_name
FROM departments d
WHERE NOT EXISTS (
    SELECT 1 FROM employees e WHERE e.dept_id = d.dept_id
);
-- Legal

P7.Find the employee(s) with the lowest salary in each department (one subquery per department comparison). medium▶

SELECT e.emp_name, e.dept_id, e.salary
FROM employees e
WHERE e.salary = (
    SELECT MIN(e2.salary) FROM employees e2 WHERE e2.dept_id = e.dept_id
);

P8.Explain, without running it, why WHERE dept_id NOT IN (SELECT dept_id FROM employees) could return zero departments even if Legal genuinely has no employees. hard▶

Because employees.dept_id contains NULLs (Tanvi Desai, Sameer Gupta) — the subquery's result list includes a NULL, which poisons every comparison in the implicit AND-chain that NOT IN performs, collapsing the whole result to empty. The safe rewrite is NOT EXISTS (SELECT 1 FROM employees e WHERE e.dept_id = d.dept_id).

P9.Find employees whose salary is above the average salary of employees who report to the same manager as them (peer average, not department average). hard▶

SELECT e.emp_name, e.salary
FROM employees e
WHERE e.salary > (
    SELECT AVG(e2.salary)
    FROM employees e2
    WHERE e2.manager_id = e.manager_id
);
-- Note: employees with manager_id = NULL (Arjun, Karan, Meera, Divya, Ishita) compare against
-- the average of OTHER NULL-manager employees -- since NULL = NULL is never true in the correlation,
-- double check this edge case: most engines treat "e2.manager_id = e.manager_id" as UNKNOWN when both are NULL,
-- so top-level employees would each compare against an EMPTY set (AVG of nothing = NULL), and salary > NULL is UNKNOWN --
-- meaning all NULL-manager employees are silently excluded. Flag this out loud as a known limitation.

P10.Find orders placed by customers from a country other than India. hard▶

SELECT order_id, total_amount
FROM orders
WHERE customer_id IN (
    SELECT customer_id FROM customers WHERE country != 'India'
);
-- Grace Lin's and Omar Haddad's orders (5002, 5004)
-- Note: order 5005 (customer_id 999) is correctly excluded, since 999 isn't in the customers list at all

Before you say NEXT: make sure you can explain, out loud and without looking, why NOT IN can silently return zero rows when NULLs are present, the difference between a correlated and non-correlated subquery, when to prefer EXISTS over a JOIN, and how to rewrite an IN subquery as an equivalent JOIN. When you're ready, reply NEXT and we'll move to Module 3: CTEs (Common Table Expressions).