Module 3 — CTEs (Common Table Expressions)
Why CTEs exist
In Module 2 you wrote queries with subqueries nested inside WHERE or SELECT. That works, but once you need three or four layers of "first compute this, then compute that from it," nested subqueries turn into an unreadable pyramid — brackets inside brackets inside brackets, read from the inside out.
A CTE (WITH name AS (...)) lets you give a subquery a name and write it before the query that uses it, top to bottom, like a recipe: "first do this, call it X. Then do that using X, call it Y. Finally, use Y here." It's the same execution idea as a subquery — but readable in the order a human thinks.
What a CTE actually is
A CTE is a named, temporary result set that exists only for the duration of one SQL statement. It is defined with the WITH keyword, given a name, and can then be referenced by that name later in the same statement — in the main query, or even by another CTE defined after it.
When to use it
- When a query needs multiple logical steps (filter → aggregate → rank → filter again) and nesting subqueries would hurt readability.
- When the same derived result is needed more than once in a query (referencing a CTE by name twice is cleaner than duplicating a subquery twice — though not always cheaper, see Internals below).
- When you want a query that reads top-to-bottom like a checklist, so teammates (and future you) can follow the logic without rewinding.
- When you need recursion — a
RECURSIVECTE is the *only* standard way to walk a hierarchy in plain SQL (full topic in Module 4).
When NOT to use it
- For a single, simple, one-line filter — a plain subquery or even a plain
WHEREclause is shorter and just as clear. - When you actually need the result reused across multiple separate statements — that's what a view or a real staging table is for, not a CTE (a CTE dies the moment the statement finishes).
- Blindly, for performance, assuming "CTE = faster/cached" — in most databases a non-recursive CTE is just a readability tool; the engine is free to inline it back into a subquery. Never assume a CTE is materialized unless you check (see Internals).
Schema & sample data for this module
We reuse employees / departments and customers / orders exactly as in Modules 1–2. We add one new table, order_items, so we have a genuine multi-step pipeline (orders → line items → product-level totals) to build chained CTEs on top of.
DDL (recap + new table)
-- employees / departments: identical to Module 1
-- customers / orders: identical to Module 2 (order 5005 has a dangling customer_id = 999)
CREATE TABLE order_items (
item_id INT PRIMARY KEY,
order_id INT, -- FK to orders
product VARCHAR(50),
quantity INT,
unit_price DECIMAL(10,2)
);
INSERT INTO order_items VALUES
(9001, 5001, 'Keyboard', 2, 1200.00),
(9002, 5001, 'Mouse', 1, 100.00),
(9003, 5002, 'Monitor', 1, 1200.00),
(9004, 5003, 'Mouse', 2, 400.00),
(9005, 5004, 'Laptop', 1, 4300.00),
(9006, 5005, 'Keyboard', 1, 650.00); -- belongs to the orphaned order 5005
| item_id | order_id | product | qty | unit_price |
|---|---|---|---|---|
| 9001 | 5001 | Keyboard | 2 | 1200.00 |
| 9002 | 5001 | Mouse | 1 | 100.00 |
| 9003 | 5002 | Monitor | 1 | 1200.00 |
| 9004 | 5003 | Mouse | 2 | 400.00 |
| 9005 | 5004 | Laptop | 1 | 4300.00 |
| 9006 | 5005 | Keyboard | 1 | 650.00 |
| order_id | customer_id | total_amount |
|---|---|---|
| 5001 | 201 | 2500.00 |
| 5002 | 202 | 1200.00 |
| 5003 | 201 | 800.00 |
| 5004 | 203 | 4300.00 |
| 5005 | 999 | 650.00 |
total_amount exactly — a sanity check you'll use in the ETL pipeline below. Order 5005's items reference the orphaned customer again on purpose, so the "clean → aggregate → report" pipeline has to deliberately decide what to do with it, just like real ETL.The basic WITH syntax
What it is: WITH cte_name AS ( SELECT ... ) SELECT ... FROM cte_name. The part inside the parentheses runs conceptually first (or is folded into the outer query — see Internals), and everywhere you write cte_name afterward, the database substitutes that result.
Easy example
Find employees who earn more than the company-wide average salary — same question as Module 2, rewritten with a CTE.
WITH avg_salary AS (
SELECT AVG(salary) AS avg_sal
FROM employees
)
SELECT e.emp_name, e.salary
FROM employees e, avg_salary a
WHERE e.salary > a.avg_sal;
Line by line: avg_salary is defined as a named result containing one row, one column (avg_sal). The main query then treats avg_salary exactly like a real table — here it's cross-joined (comma join) with employees, which is safe because avg_salary only ever has exactly one row, so it doesn't multiply anything.
Cleaner version using an explicit JOIN
WITH avg_salary AS (
SELECT AVG(salary) AS avg_sal FROM employees
)
SELECT e.emp_name, e.salary, a.avg_sal
FROM employees e
JOIN avg_salary a ON TRUE; -- always-true join condition since avg_salary has 1 row
WHERE salary > (SELECT AVG(salary) FROM employees)) — the CTE version above exists mainly to teach the syntax. CTEs start earning their keep once there's more than one step, next section.CTE vs subquery vs view
These three all wrap a query, but they solve different problems. Confusing them is a very common beginner (and interview) mistake.
| Feature | Subquery | CTE | View |
|---|---|---|---|
| Lifespan | One statement | One statement | Permanent (until dropped) |
| Named? | No (anonymous) | Yes | Yes |
| Reusable across queries? | No | No | Yes |
| Can reference itself (recursion)? | No | Yes (WITH RECURSIVE) | No (not directly) |
| Stored in the database? | No | No | Yes (definition only, unless materialized) |
| Best for | One-off inline check | Multi-step readable logic, this query only | Logic reused by many queries/reports |
Multiple independent CTEs
What it is: you can define more than one CTE in the same WITH clause, separated by commas. Each one is its own named result. They don't have to depend on each other — you can just define two unrelated helper queries and use both in the main query.
Medium example
For each department, show headcount and, separately, the department's total order revenue from customers who happen to share the department's location as their country name (a contrived but realistic-shaped "combine two unrelated aggregates" report).
WITH dept_counts AS (
SELECT dept_id, COUNT(*) AS headcount
FROM employees
GROUP BY dept_id
),
customer_revenue AS (
SELECT c.country, SUM(o.total_amount) AS revenue
FROM customers c
JOIN orders o ON o.customer_id = c.customer_id
GROUP BY c.country
)
SELECT d.dept_name, dc.headcount, cr.revenue
FROM departments d
LEFT JOIN dept_counts dc ON d.dept_id = dc.dept_id
LEFT JOIN customer_revenue cr ON d.location = cr.country; -- location vs country rarely match in real data; illustrative only
Line by line: dept_counts and customer_revenue are two completely independent CTEs — neither references the other. Both are simply available, by name, to the final SELECT, exactly as if they were two real tables.
WITH clause doesn't matter for correctness (unless one references another — next section) — but list them in the order a reader would find most natural, top to bottom.Chained CTEs
What it is: a CTE defined later in the WITH clause can reference an earlier CTE by name. This is where CTEs really shine — each step builds on the previous one, like a pipeline.
Medium example
Build a per-customer revenue report in three clear steps: (1) compute line-item totals, (2) roll them up to order totals, (3) roll those up to customer totals.
WITH item_totals AS (
SELECT
order_id,
SUM(quantity * unit_price) AS line_total
FROM order_items
GROUP BY order_id
),
order_totals AS (
SELECT
o.customer_id,
it.order_id,
it.line_total
FROM item_totals it
JOIN orders o ON o.order_id = it.order_id
),
customer_totals AS (
SELECT
customer_id,
SUM(line_total) AS total_revenue
FROM order_totals
GROUP BY customer_id
)
SELECT c.customer_name, ct.total_revenue
FROM customer_totals ct
JOIN customers c ON c.customer_id = ct.customer_id
ORDER BY ct.total_revenue DESC;
Line by line: item_totals only touches order_items. order_totals only references item_totals (not the raw table) plus orders, to attach a customer_id. customer_totals only references order_totals. Each step has exactly one job, and you could test each one individually just by running SELECT * FROM item_totals style checks while building the query — this is the single biggest practical advantage of chaining CTEs over nesting subqueries.
Modular SQL design
Chained CTEs aren't just a syntax trick — they're a design philosophy borrowed from software engineering: break one big, scary problem into small, named, individually-testable steps.
The discipline
- One CTE, one responsibility. If a CTE's
SELECTis doing two unrelated things (filtering AND aggregating AND ranking all at once), split it into two CTEs. - Name CTEs like variables, not like "step1/step2".
item_totals,order_totals,customer_totalstell the reader what's inside without opening the query. - Test each layer independently while building. Comment out everything after a given CTE, add a
SELECT * FROM that_cte LIMIT 20, and eyeball it before building the next layer on top. - Keep the final SELECT thin. By the time you reach the last
SELECT, most of the hard logic should already be done — the final step should read almost like plain English.
How CTEs execute internally
This is the single most-tested "gotcha" concept about CTEs in interviews, and it varies by database engine — so understanding the concept matters more than memorizing one engine's behavior.
Two ways an engine can treat a CTE
- Inlining (a.k.a. "merging"): the engine treats the CTE as if you'd pasted its SQL text directly into every place you referenced it — like a macro. If you reference the CTE three times, its query effectively runs three times. The optimizer can then push filters from the outer query down into the CTE's logic, which can actually make it faster than a naive materialized version.
- Materialization: the engine runs the CTE's query exactly once, stores the result in a temporary work area (memory or disk), and every reference afterward just reads from that stored result. Good when the CTE is expensive and referenced multiple times; bad when the CTE is huge and only a tiny filtered slice of it is actually needed by the outer query, because the filter can't be pushed down into it.
MATERIALIZED explicitly. Some other engines (older Postgres, SQL Server in some plans) tend to materialize automatically, especially when a CTE is referenced multiple times. Always check with EXPLAIN on your actual engine before assuming.Why this matters
-- If this CTE scans a 500-million-row fact table...
WITH big_scan AS (
SELECT * FROM order_items
)
-- ...and you only need 3 rows here, inlining lets the filter
-- push down into big_scan's scan. Materialization would force
-- a full scan of order_items FIRST, then filter afterward.
SELECT * FROM big_scan WHERE order_id = 5001;
WITH big_scan AS MATERIALIZED (...) or WITH big_scan AS NOT MATERIALIZED (...). Use MATERIALIZED when the CTE is genuinely reused multiple times and is expensive to recompute; use NOT MATERIALIZED when you want the optimizer free to push filters down.Scope, reuse & a recursion preview
- A CTE is only visible within the single statement it's attached to. You cannot reference it from a completely different query afterward — it's gone the moment that statement finishes.
- A CTE can be referenced multiple times within the same main query (e.g., once to compute a total, once to compute a rank) — this is exactly the situation where "does it inline or materialize" matters most, since each reference re-runs the logic unless it's materialized.
- By default, a CTE cannot reference itself —
WITH cte AS (SELECT ... FROM cte)is an error, unless you explicitly writeWITH RECURSIVE. That's a deliberate safety rail, and the entire subject of Module 4. - A later CTE can reference an earlier one (chaining, as above), but an earlier CTE can never reference a later one — the dependency graph only flows downward, top to bottom.
MATERIALIZED), it may run three separate times — each potentially expensive. If you genuinely need "compute once, use three times" guaranteed, materialize it explicitly or stage it into a real temp table.Staged ETL pipelines with CTEs
Let's fix the silent data-loss bug from the Chained CTEs section, the way a real data engineer would: make every stage's data-quality decision explicit and visible, instead of letting INNER JOINs quietly drop rows.
WITH cleaned_orders AS (
-- Stage 1: flag orphaned orders instead of silently dropping them
SELECT
o.order_id,
o.customer_id,
CASE WHEN c.customer_id IS NULL THEN TRUE ELSE FALSE END AS is_orphaned
FROM orders o
LEFT JOIN customers c ON c.customer_id = o.customer_id
),
item_totals AS (
-- Stage 2: line-item rollup, unrelated to the customer-quality question
SELECT order_id, SUM(quantity * unit_price) AS line_total
FROM order_items
GROUP BY order_id
),
enriched_orders AS (
-- Stage 3: combine stage 1 + stage 2
SELECT
co.order_id,
co.customer_id,
co.is_orphaned,
COALESCE(it.line_total, 0) AS line_total
FROM cleaned_orders co
LEFT JOIN item_totals it ON it.order_id = co.order_id
)
SELECT
order_id,
customer_id,
line_total,
CASE WHEN is_orphaned THEN 'REVIEW: missing customer' ELSE 'OK' END AS data_quality_flag
FROM enriched_orders
ORDER BY is_orphaned DESC;
Order 5005 now appears in the output with an explicit flag, instead of disappearing. This is the difference between a pipeline that fails loudly (good) and one that fails silently (dangerous) — the exact same lesson from the LEFT JOIN production example in Module 1, now applied across a multi-stage pipeline.
Deduplication with a CTE
A CTE combined with ROW_NUMBER() (full detail in Module 6, previewed here since it's such a common CTE use case) is the standard production pattern for removing duplicate rows.
WITH ranked AS (
SELECT
item_id, order_id, product, quantity, unit_price,
ROW_NUMBER() OVER (
PARTITION BY order_id, product -- "duplicate" means same order + same product
ORDER BY item_id -- keep the earliest inserted row
) AS rn
FROM order_items
)
SELECT item_id, order_id, product, quantity, unit_price
FROM ranked
WHERE rn = 1;
Why a CTE here specifically: ROW_NUMBER() cannot be referenced directly in the same SELECT's WHERE clause (window functions are computed after WHERE conceptually) — so you must compute it in one layer (the CTE) and filter it in the next. This is one of the few cases where a CTE isn't just "for readability," it's functionally required by SQL's rules of execution order.
WHERE ROW_NUMBER() OVER (...) = 1 directly — this is a syntax error in every major engine. Window functions can only be filtered on in an outer layer (CTE, subquery, or QUALIFY in engines that support it, like Snowflake/BigQuery).When CTEs hurt performance
- Re-computation on every reference. If a CTE is referenced 4 times in the outer query and the engine inlines it, you've effectively written the same expensive subquery 4 times.
- Blocking predicate pushdown when forced-materialized. If an engine (or an explicit
MATERIALIZEDhint) computes the full CTE before any outer filter is applied, a CTE that scans a huge table but is only needed for 10 rows can become a full-table-scan bottleneck. - Long chains hiding a bad early step. A mistake in CTE #1 of a 6-CTE chain silently propagates through every later stage — always spot-check the earliest CTEs first when a long pipeline gives wrong output.
- Over-normalizing a simple query into 5 CTEs "for style" — readability is the goal, not CTE count. If two steps are trivially simple and always used together, merging them can be clearer, not worse.
EXPLAIN (or EXPLAIN ANALYZE) and look specifically for whether each CTE shows as a separate materialized "CTE Scan" node or is fused into the main plan. That single check tells you exactly where to focus tuning effort.How interviewers trick you with CTEs
- They ask you to filter on a window function result directly in the same
SELECT("just addWHERE rn = 1") to see if you know you need a CTE/subquery layer for that. - They ask "does using a CTE make this query faster than a subquery?" — the honest, senior-level answer is "usually not by itself; it depends on inlining vs materialization for this specific engine," not a flat yes or no.
- They give you a query with the same CTE referenced 3 times and ask you to estimate cost — testing whether you assume "defined once = computed once."
- They ask you to write a CTE that references itself without saying the word "recursive" — watching whether you catch that it needs
WITH RECURSIVE(Module 4) rather than plainWITH. - They ask you to convert a deeply nested subquery into CTEs live on a whiteboard — really testing whether you can decompose a problem into clean, independent steps under pressure.
3 Hard Interview Questions
H1.Using CTEs, find the department with the highest average salary — and explain why a naive single CTE isn't quite enough if you also need the number itself for a later comparison. hard▶
One row: dept_name, avg_salary for the single highest-average department (Engineering, since 185000/145000/145000/98000 averages to 143,250).
Compute per-department averages in one CTE, then pick the max in the outer query — don't try to do both in one step.
WITH dept_avg AS (
SELECT d.dept_name, AVG(e.salary) AS avg_salary
FROM departments d
JOIN employees e ON e.dept_id = d.dept_id
GROUP BY d.dept_name
)
SELECT dept_name, avg_salary
FROM dept_avg
ORDER BY avg_salary DESC
LIMIT 1;
Using a second CTE to compute the max explicitly, useful when you need to keep comparing against that number later in a longer pipeline:
WITH dept_avg AS (
SELECT d.dept_name, AVG(e.salary) AS avg_salary
FROM departments d
JOIN employees e ON e.dept_id = d.dept_id
GROUP BY d.dept_name
),
max_avg AS (
SELECT MAX(avg_salary) AS top_avg FROM dept_avg
)
SELECT da.dept_name, da.avg_salary
FROM dept_avg da
JOIN max_avg ma ON da.avg_salary = ma.top_avg;
ORDER BY ... LIMIT 1 is simplest and fastest for "just the top one." The two-CTE MAX version is better when there could be a tie for first place and you want ALL of them returned, not just one arbitrary winner — LIMIT 1 silently picks only one row even if two departments tie exactly.
O(n) to compute the join + aggregate, O(d log d) to sort d departments for the LIMIT version; O(d) for the MAX-join version (no sort needed).
- Using
LIMIT 1when the real requirement is "all departments tied for first" — always ask if ties matter before picking this shortcut. - Forgetting that empty departments (Legal, 0 employees) simply don't appear here at all, since the JOIN drops them — worth calling out explicitly, mirroring the Module 1 lesson.
- "What if you need the top 3 departments, with ties fully expanded?" (foreshadows RANK/DENSE_RANK in Module 6.)
H2.Build a three-stage CTE pipeline that reports, per customer, their total revenue AND what percentage that customer contributes to total company revenue. hard▶
customer_name, total_revenue, pct_of_total — percentages summing to ~100% across all customers with orders.
Stage 1: per-customer totals. Stage 2: grand total (a single scalar-like CTE). Final SELECT: divide stage 1 by stage 2.
WITH customer_totals AS (
SELECT c.customer_name, SUM(o.total_amount) AS total_revenue
FROM customers c
JOIN orders o ON o.customer_id = c.customer_id
GROUP BY c.customer_name
),
grand_total AS (
SELECT SUM(total_revenue) AS company_total FROM customer_totals
)
SELECT
ct.customer_name,
ct.total_revenue,
ROUND(100.0 * ct.total_revenue / gt.company_total, 2) AS pct_of_total
FROM customer_totals ct
CROSS JOIN grand_total gt
ORDER BY pct_of_total DESC;
- Computing
grand_totalstraight fromordersinstead of fromcustomer_totals— both give the same number here (since every order has a valid customer... except 5005), but building it FROM the earlier CTE keeps the pipeline consistent and is safer if a filter is later added tocustomer_totalsthat should also apply to the grand total. - Integer division: writing
100 * total_revenue / company_totalwithout the.0can silently truncate to 0 in some engines if both values are integer types — always force a decimal/float context explicitly.
- "Now do the same thing but per department instead of per customer, reusing the same pattern." (Tests whether they understood the pattern or just memorized the query.)
H3.You inherit a query with 6 chained CTEs and it returns the wrong result. Describe your debugging process, then simulate it on a 3-CTE version that has a bug. hard▶
A demonstrated debugging methodology, not just a fixed query.
Debug top-down: materialize and inspect the earliest CTE first with a standalone SELECT * FROM cte1 LIMIT 20, confirm it, then move to the next.
WITH item_totals AS (
SELECT order_id, SUM(quantity * unit_price) AS line_total
FROM order_items
GROUP BY order_id
),
order_with_customer AS (
SELECT it.order_id, o.customer_id, it.line_total
FROM item_totals it
JOIN orders o ON o.order_id = it.order_id -- BUG lives one layer up, but this looks fine in isolation
),
customer_totals AS (
SELECT customer_id, SUM(line_total) AS total_revenue
FROM order_with_customer
GROUP BY customer_id
)
SELECT c.customer_name, ct.total_revenue
FROM customer_totals ct
JOIN customers c ON c.customer_id = ct.customer_id;
-- Bug: order 5002 (Grace Lin) has no rows in order_items? No — check again: it does (9003).
-- The real bug: order_items has NO row for some future order_id that never got line items yet
-- (e.g. a new order 5006 with $900 total but items not loaded). item_totals silently drops it
-- because GROUP BY only emits rows for order_ids that exist in order_items, and item_totals
-- is INNER JOINed away entirely -- the order's own known total_amount is never used as a fallback.
- Run
SELECT * FROM item_totalsalone — compare row count againstSELECT COUNT(DISTINCT order_id) FROM orders. If they don't match, the bug is already at stage 1. - Confirm: any
order_idinordersmissing fromorder_itemsvanishes right here, before later stages even run. - Fix at the source: use
o.total_amountas a documented fallback, or LEFT JOIN item_totals onto orders instead of the reverse, so the base table (source of truth for "an order exists") drives the pipeline.
WITH item_totals AS (
SELECT order_id, SUM(quantity * unit_price) AS line_total
FROM order_items
GROUP BY order_id
),
order_with_customer AS (
SELECT
o.order_id,
o.customer_id,
COALESCE(it.line_total, o.total_amount) AS line_total -- fallback to header total
FROM orders o
LEFT JOIN item_totals it ON it.order_id = o.order_id
)
SELECT customer_id, SUM(line_total) AS total_revenue
FROM order_with_customer
GROUP BY customer_id;
- "How would you add an automated data-quality check that catches this class of bug before it reaches a dashboard?" (row-count reconciliation between stages, alerting on drift.)
2 FAANG-Level Questions
F1.Build a modular "top N per group" report — the single highest-value order per customer — using only CTEs (no window functions, since that's Module 6). faang▶
This tests whether you can replicate a "top N per group" pattern with only CTEs + aggregation, and whether you understand why this approach doesn't scale to "top 3 per group" cleanly (motivating window functions later).
Stage 1: max order value per customer. Stage 2: join back to orders to find the row(s) that match that max.
WITH max_per_customer AS (
SELECT customer_id, MAX(total_amount) AS max_amount
FROM orders
GROUP BY customer_id
)
SELECT o.customer_id, o.order_id, o.total_amount
FROM orders o
JOIN max_per_customer m
ON o.customer_id = m.customer_id
AND o.total_amount = m.max_amount;
Rahul Bansal (201) has two orders, 2500 and 800 — only 2500 (order 5001) survives.
This "join back to the max" pattern is fine for top-1, but extending it to top-3 per group requires a completely different technique (ranking + filtering), because there's no clean way to say "the 3rd highest" using only MAX/GROUP BY. ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY total_amount DESC) inside a CTE, filtered to rn <= 3, is the production-grade generalization — Module 6.
O(n) to aggregate + O(n) to join back with an index on (customer_id, total_amount); without that index, the join-back is effectively O(n²).
- If two orders from the same customer tie exactly on
total_amount, this pattern returns BOTH — which is sometimes desired (all-time-highs), sometimes a bug (expecting exactly one row per customer). Always clarify tie-handling requirements explicitly with the interviewer.
- "Now do top-2 per customer." (Correct answer: "I'd switch to ROW_NUMBER/RANK — this MAX-based pattern doesn't generalize.")
F2.Design a CTE-based query that validates data quality across the whole orders pipeline in one shot: orphaned orders, orders with no line items, and orders where the line-item sum doesn't match the header total. faang▶
Real warehouse teams run exactly this kind of reconciliation query nightly. It tests whether you can compose several independent quality checks into one modular, extensible report.
One CTE per quality dimension, each producing a boolean flag, then UNION or combine flags in a final CTE so every order gets one row with all flags visible together.
WITH orphan_check AS (
SELECT o.order_id, (c.customer_id IS NULL) AS is_orphaned
FROM orders o
LEFT JOIN customers c ON c.customer_id = o.customer_id
),
item_check AS (
SELECT o.order_id, COUNT(oi.item_id) AS item_count, COALESCE(SUM(oi.quantity * oi.unit_price), 0) AS computed_total
FROM orders o
LEFT JOIN order_items oi ON oi.order_id = o.order_id
GROUP BY o.order_id
),
combined AS (
SELECT
o.order_id,
oc.is_orphaned,
ic.item_count = 0 AS has_no_items,
ABS(ic.computed_total - o.total_amount) > 0.01 AS total_mismatch
FROM orders o
JOIN orphan_check oc ON oc.order_id = o.order_id
JOIN item_check ic ON ic.order_id = o.order_id
)
SELECT *
FROM combined
WHERE is_orphaned OR has_no_items OR total_mismatch;
Order 5005 flags is_orphaned = true. In this dataset every order's items sum matches its header total exactly, so total_mismatch stays false for all — a realistic reminder that most rows in a QA report should be clean; the checks exist for the rare bad ones.
Each check-CTE here scans orders once independently; at huge scale, computing all checks in a single pass over a pre-joined base CTE (join orders/customers/order_items once, then derive every flag from that one result) avoids redundant scans of the same base tables.
O(n) per check with proper indexes on the join keys; the naive multi-scan version is O(k·n) for k checks, the single-pass version is O(n).
- Comparing floating point totals with
=instead of an epsilon tolerance (ABS(diff) > 0.01) — exact equality on computed decimals is fragile due to rounding. - Using INNER JOIN between
combined's sources when an order genuinely has zero items — must be LEFT JOIN or the "has no items" case disappears instead of being flagged.
- "How would you turn this into an automated daily alert?" (Schedule it, write the flagged row count to a monitoring table, alert on non-zero.)
1 Production Scenario
A dbt-style layered mart built from chained CTEs before it ever becomes separate models
You're asked for a "revenue by customer, with data-quality confidence" report for a dashboard. Rather than write one giant query, a senior data engineer sketches it first as a chain of CTEs, in this order, matching a typical staging → intermediate → mart layering:
WITH stg_orders AS ( -- staging: light cleanup only, one-to-one with source
SELECT order_id, customer_id, total_amount
FROM orders
),
stg_order_items AS (
SELECT order_id, quantity, unit_price
FROM order_items
),
int_order_totals AS ( -- intermediate: business logic, still order-grain
SELECT
o.order_id,
o.customer_id,
COALESCE(SUM(oi.quantity * oi.unit_price), o.total_amount) AS revenue,
(SELECT COUNT(*) FROM stg_order_items x WHERE x.order_id = o.order_id) = 0 AS missing_items
FROM stg_orders o
LEFT JOIN stg_order_items oi ON TRUE -- illustrative; real version joins on order_id
GROUP BY o.order_id, o.customer_id, o.total_amount
),
mart_customer_revenue AS ( -- mart: final grain the dashboard actually queries
SELECT customer_id, SUM(revenue) AS total_revenue, BOOL_OR(missing_items) AS has_data_gaps
FROM int_order_totals
GROUP BY customer_id
)
SELECT * FROM mart_customer_revenue;
Once this chain is proven correct (tested layer by layer, exactly as taught above), each CTE typically graduates into its own dbt model file — stg_orders.sql, int_order_totals.sql, mart_customer_revenue.sql — so each layer can be tested, scheduled, and reused independently across many downstream reports, not just this one query.
10 Practice Questions
Write the SQL yourself before expanding each answer. Use the employees / departments / customers / orders / order_items schema above.
P1.Using a CTE, list all employees earning more than the company-wide average salary. easy▶
WITH avg_sal AS (
SELECT AVG(salary) AS avg_salary FROM employees
)
SELECT e.emp_name, e.salary
FROM employees e, avg_sal a
WHERE e.salary > a.avg_salary;P2.Using a CTE, find the total order revenue per customer. easy▶
WITH revenue AS (
SELECT customer_id, SUM(total_amount) AS total_revenue
FROM orders
GROUP BY customer_id
)
SELECT c.customer_name, r.total_revenue
FROM customers c
JOIN revenue r ON r.customer_id = c.customer_id;P3.Using two independent CTEs, show department headcounts alongside the total number of orders in the whole system (a single repeated number next to every department). easy▶
WITH dept_counts AS (
SELECT dept_id, COUNT(*) AS headcount FROM employees GROUP BY dept_id
),
order_count AS (
SELECT COUNT(*) AS total_orders FROM orders
)
SELECT d.dept_name, dc.headcount, oc.total_orders
FROM departments d
LEFT JOIN dept_counts dc ON dc.dept_id = d.dept_id
CROSS JOIN order_count oc;P4.Chain two CTEs: first compute each order's line-item total, then find orders where that total differs from the order's stored total_amount. medium▶
WITH item_totals AS (
SELECT order_id, SUM(quantity * unit_price) AS line_total
FROM order_items
GROUP BY order_id
)
SELECT o.order_id, o.total_amount, it.line_total
FROM orders o
JOIN item_totals it ON it.order_id = o.order_id
WHERE o.total_amount <> it.line_total;
-- In our clean dataset this returns 0 rows -- confirming data integrity.P5.Using a CTE, list every department together with its average salary, including departments with zero employees (AVG should show NULL, not error). medium▶
WITH dept_avg AS (
SELECT dept_id, AVG(salary) AS avg_salary
FROM employees
GROUP BY dept_id
)
SELECT d.dept_name, da.avg_salary
FROM departments d
LEFT JOIN dept_avg da ON da.dept_id = d.dept_id;P6.Using a CTE + ROW_NUMBER, find the single most expensive line item in each order. medium▶
WITH ranked AS (
SELECT
order_id, product, quantity, unit_price,
ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY unit_price DESC) AS rn
FROM order_items
)
SELECT order_id, product, unit_price
FROM ranked
WHERE rn = 1;P7.Using a three-stage chained CTE, report each customer's revenue and rank them from highest to lowest spender (rank computed with a self-join style count, no window functions yet). medium▶
WITH customer_revenue AS (
SELECT c.customer_id, c.customer_name, SUM(o.total_amount) AS total_revenue
FROM customers c
JOIN orders o ON o.customer_id = c.customer_id
GROUP BY c.customer_id, c.customer_name
),
ranked AS (
SELECT
cr1.customer_name,
cr1.total_revenue,
(SELECT COUNT(*) FROM customer_revenue cr2 WHERE cr2.total_revenue > cr1.total_revenue) + 1 AS spend_rank
FROM customer_revenue cr1
)
SELECT * FROM ranked ORDER BY spend_rank;P8.Using a CTE, find customers who have placed orders but whose total revenue is below the average revenue across all customers who have ordered. hard▶
WITH customer_revenue AS (
SELECT customer_id, SUM(total_amount) AS total_revenue
FROM orders
GROUP BY customer_id
),
avg_revenue AS (
SELECT AVG(total_revenue) AS avg_rev FROM customer_revenue
)
SELECT c.customer_name, cr.total_revenue
FROM customer_revenue cr
JOIN customers c ON c.customer_id = cr.customer_id
CROSS JOIN avg_revenue a
WHERE cr.total_revenue < a.avg_rev;P9.Rewrite question P8 as pure nested subqueries with no CTE, then explain in one sentence which version you'd rather maintain. hard▶
SELECT c.customer_name, cr.total_revenue
FROM customers c
JOIN (
SELECT customer_id, SUM(total_amount) AS total_revenue
FROM orders
GROUP BY customer_id
) cr ON cr.customer_id = c.customer_id
WHERE cr.total_revenue < (
SELECT AVG(t.total_revenue) FROM (
SELECT SUM(total_amount) AS total_revenue
FROM orders
GROUP BY customer_id
) t
);
-- Note the average subquery duplicates the aggregation subquery entirely -- exactly
-- the readability/duplication cost that CTEs solve.P10.Using CTEs, build the data-quality style report: for every order, show whether it has a valid customer AND whether its line items exist, as two separate boolean columns in one row per order. hard▶
WITH customer_check AS (
SELECT o.order_id, (c.customer_id IS NOT NULL) AS has_valid_customer
FROM orders o
LEFT JOIN customers c ON c.customer_id = o.customer_id
),
item_check AS (
SELECT o.order_id, EXISTS (
SELECT 1 FROM order_items oi WHERE oi.order_id = o.order_id
) AS has_items
FROM orders o
)
SELECT cc.order_id, cc.has_valid_customer, ic.has_items
FROM customer_check cc
JOIN item_check ic ON ic.order_id = cc.order_id;Before you say NEXT: make sure you can explain, out loud and without looking, the difference between a CTE and a subquery, why "CTE" does not automatically mean "cached" or "faster," what inlining vs materialization means and which one your database defaults to, and why WHERE ROW_NUMBER() OVER (...) = 1 is illegal in the same SELECT. When you're ready, reply NEXT and we'll move to Module 4: Recursive CTEs.