Module 3 — CTEs (Common Table Expressions)

From "what is WITH" to building multi-stage ETL pipelines out of chained CTEs, and knowing exactly when the optimizer will and won't materialize one for you.
Level 1 · Foundations

Why CTEs exist

In Module 2 you wrote queries with subqueries nested inside WHERE or SELECT. That works, but once you need three or four layers of "first compute this, then compute that from it," nested subqueries turn into an unreadable pyramid — brackets inside brackets inside brackets, read from the inside out.

A CTE (WITH name AS (...)) lets you give a subquery a name and write it before the query that uses it, top to bottom, like a recipe: "first do this, call it X. Then do that using X, call it Y. Finally, use Y here." It's the same execution idea as a subquery — but readable in the order a human thinks.

Analogy A nested subquery is like a recipe written as one giant run-on sentence: "cook the sauce made from simmering the tomatoes that were chopped from the ones you washed after buying them." A CTE is the same recipe written as numbered steps: "1. Wash tomatoes. 2. Chop them. 3. Simmer them into sauce. 4. Cook with sauce." Same result, radically easier to follow — and easier to fix if step 2 is wrong.

What a CTE actually is

A CTE is a named, temporary result set that exists only for the duration of one SQL statement. It is defined with the WITH keyword, given a name, and can then be referenced by that name later in the same statement — in the main query, or even by another CTE defined after it.

When to use it

  • When a query needs multiple logical steps (filter → aggregate → rank → filter again) and nesting subqueries would hurt readability.
  • When the same derived result is needed more than once in a query (referencing a CTE by name twice is cleaner than duplicating a subquery twice — though not always cheaper, see Internals below).
  • When you want a query that reads top-to-bottom like a checklist, so teammates (and future you) can follow the logic without rewinding.
  • When you need recursion — a RECURSIVE CTE is the *only* standard way to walk a hierarchy in plain SQL (full topic in Module 4).

When NOT to use it

  • For a single, simple, one-line filter — a plain subquery or even a plain WHERE clause is shorter and just as clear.
  • When you actually need the result reused across multiple separate statements — that's what a view or a real staging table is for, not a CTE (a CTE dies the moment the statement finishes).
  • Blindly, for performance, assuming "CTE = faster/cached" — in most databases a non-recursive CTE is just a readability tool; the engine is free to inline it back into a subquery. Never assume a CTE is materialized unless you check (see Internals).
🧠
Mental model: a CTE is a label you stick on a subquery so you can talk about it by name instead of repeating it. Whether the database actually computes it once and stores it, or just pastes it back in wherever you use the name (like a macro), depends entirely on the engine and the specific query — you'll learn to tell the difference in Level 3.

Schema & sample data for this module

We reuse employees / departments and customers / orders exactly as in Modules 1–2. We add one new table, order_items, so we have a genuine multi-step pipeline (orders → line items → product-level totals) to build chained CTEs on top of.

DDL (recap + new table)

-- employees / departments: identical to Module 1
-- customers / orders: identical to Module 2 (order 5005 has a dangling customer_id = 999)

CREATE TABLE order_items (
    item_id     INT PRIMARY KEY,
    order_id    INT,             -- FK to orders
    product     VARCHAR(50),
    quantity    INT,
    unit_price  DECIMAL(10,2)
);

INSERT INTO order_items VALUES
(9001, 5001, 'Keyboard',  2,  1200.00),
(9002, 5001, 'Mouse',     1,   100.00),
(9003, 5002, 'Monitor',   1,  1200.00),
(9004, 5003, 'Mouse',     2,   400.00),
(9005, 5004, 'Laptop',    1,  4300.00),
(9006, 5005, 'Keyboard',  1,   650.00);  -- belongs to the orphaned order 5005
order_items (6 rows)
item_idorder_idproductqtyunit_price
90015001Keyboard21200.00
90025001Mouse1100.00
90035002Monitor11200.00
90045003Mouse2400.00
90055004Laptop14300.00
90065005Keyboard1650.00
orders (recap, 5 rows — 5005 orphaned)
order_idcustomer_idtotal_amount
50012012500.00
50022021200.00
5003201800.00
50042034300.00
5005999650.00
NoticeOrder_items row 9001+9002 = 2500.00, matching order 5001's total_amount exactly — a sanity check you'll use in the ETL pipeline below. Order 5005's items reference the orphaned customer again on purpose, so the "clean → aggregate → report" pipeline has to deliberately decide what to do with it, just like real ETL.
Level 1 · Beginner

The basic WITH syntax

What it is: WITH cte_name AS ( SELECT ... ) SELECT ... FROM cte_name. The part inside the parentheses runs conceptually first (or is folded into the outer query — see Internals), and everywhere you write cte_name afterward, the database substitutes that result.

AnalogyThink of it as defining a variable at the top of a script before using it, instead of inlining the whole expression every time you need it.

Easy example

Find employees who earn more than the company-wide average salary — same question as Module 2, rewritten with a CTE.

WITH avg_salary AS (
    SELECT AVG(salary) AS avg_sal
    FROM employees
)
SELECT e.emp_name, e.salary
FROM employees e, avg_salary a
WHERE e.salary > a.avg_sal;

Line by line: avg_salary is defined as a named result containing one row, one column (avg_sal). The main query then treats avg_salary exactly like a real table — here it's cross-joined (comma join) with employees, which is safe because avg_salary only ever has exactly one row, so it doesn't multiply anything.

Cleaner version using an explicit JOIN

WITH avg_salary AS (
    SELECT AVG(salary) AS avg_sal FROM employees
)
SELECT e.emp_name, e.salary, a.avg_sal
FROM employees e
JOIN avg_salary a ON TRUE;   -- always-true join condition since avg_salary has 1 row
TipFor a genuinely scalar, single-value CTE like this, it's often simplest to just keep it a plain subquery (WHERE salary > (SELECT AVG(salary) FROM employees)) — the CTE version above exists mainly to teach the syntax. CTEs start earning their keep once there's more than one step, next section.
Level 1 · Beginner

CTE vs subquery vs view

These three all wrap a query, but they solve different problems. Confusing them is a very common beginner (and interview) mistake.

Comparison
FeatureSubqueryCTEView
LifespanOne statementOne statementPermanent (until dropped)
Named?No (anonymous)YesYes
Reusable across queries?NoNoYes
Can reference itself (recursion)?NoYes (WITH RECURSIVE)No (not directly)
Stored in the database?NoNoYes (definition only, unless materialized)
Best forOne-off inline checkMulti-step readable logic, this query onlyLogic reused by many queries/reports
Common mistakeSaying "I'll just make it a CTE so it's faster next time I query it" — a CTE is not cached between separate statements. Every time you run the outer query, the CTE runs again from scratch. If you truly need a precomputed, reusable result, that's a materialized view or a real staging table, not a CTE.
Level 2 · Intermediate

Multiple independent CTEs

What it is: you can define more than one CTE in the same WITH clause, separated by commas. Each one is its own named result. They don't have to depend on each other — you can just define two unrelated helper queries and use both in the main query.

Medium example

For each department, show headcount and, separately, the department's total order revenue from customers who happen to share the department's location as their country name (a contrived but realistic-shaped "combine two unrelated aggregates" report).

WITH dept_counts AS (
    SELECT dept_id, COUNT(*) AS headcount
    FROM employees
    GROUP BY dept_id
),
customer_revenue AS (
    SELECT c.country, SUM(o.total_amount) AS revenue
    FROM customers c
    JOIN orders o ON o.customer_id = c.customer_id
    GROUP BY c.country
)
SELECT d.dept_name, dc.headcount, cr.revenue
FROM departments d
LEFT JOIN dept_counts dc ON d.dept_id = dc.dept_id
LEFT JOIN customer_revenue cr ON d.location = cr.country;   -- location vs country rarely match in real data; illustrative only

Line by line: dept_counts and customer_revenue are two completely independent CTEs — neither references the other. Both are simply available, by name, to the final SELECT, exactly as if they were two real tables.

TipOrder of independent CTEs in the WITH clause doesn't matter for correctness (unless one references another — next section) — but list them in the order a reader would find most natural, top to bottom.
Level 2 · Intermediate

Chained CTEs

What it is: a CTE defined later in the WITH clause can reference an earlier CTE by name. This is where CTEs really shine — each step builds on the previous one, like a pipeline.

AnalogyAn assembly line: station 1 washes the part, station 2 (which only knows about "the washed part," not the raw part) paints it, station 3 (which only knows about "the painted part") packs it. Each station only needs to understand its own input, not the entire history.

Medium example

Build a per-customer revenue report in three clear steps: (1) compute line-item totals, (2) roll them up to order totals, (3) roll those up to customer totals.

WITH item_totals AS (
    SELECT
        order_id,
        SUM(quantity * unit_price) AS line_total
    FROM order_items
    GROUP BY order_id
),
order_totals AS (
    SELECT
        o.customer_id,
        it.order_id,
        it.line_total
    FROM item_totals it
    JOIN orders o ON o.order_id = it.order_id
),
customer_totals AS (
    SELECT
        customer_id,
        SUM(line_total) AS total_revenue
    FROM order_totals
    GROUP BY customer_id
)
SELECT c.customer_name, ct.total_revenue
FROM customer_totals ct
JOIN customers c ON c.customer_id = ct.customer_id
ORDER BY ct.total_revenue DESC;

Line by line: item_totals only touches order_items. order_totals only references item_totals (not the raw table) plus orders, to attach a customer_id. customer_totals only references order_totals. Each step has exactly one job, and you could test each one individually just by running SELECT * FROM item_totals style checks while building the query — this is the single biggest practical advantage of chaining CTEs over nesting subqueries.

NoticeCustomer 204 (Laura Fischer, no orders) and the orphaned order 5005 (customer_id 999, no matching customer) both silently vanish here, because every join above is an INNER JOIN. This is a classic real-world CTE-pipeline bug — each individual step looks correct, but data quietly leaks out at every join. We fix this explicitly in the ETL Pipeline section below.
Level 3 · Advanced

Modular SQL design

Chained CTEs aren't just a syntax trick — they're a design philosophy borrowed from software engineering: break one big, scary problem into small, named, individually-testable steps.

The discipline

  • One CTE, one responsibility. If a CTE's SELECT is doing two unrelated things (filtering AND aggregating AND ranking all at once), split it into two CTEs.
  • Name CTEs like variables, not like "step1/step2". item_totals, order_totals, customer_totals tell the reader what's inside without opening the query.
  • Test each layer independently while building. Comment out everything after a given CTE, add a SELECT * FROM that_cte LIMIT 20, and eyeball it before building the next layer on top.
  • Keep the final SELECT thin. By the time you reach the last SELECT, most of the hard logic should already be done — the final step should read almost like plain English.
Production tipThis is exactly how real ETL/dbt-style pipelines are structured: staging models → intermediate models → mart models. A long chain of CTEs inside one query is the "single query" version of that same layered philosophy — and in tools like dbt, each CTE often literally becomes its own model file later.
Level 3 · Advanced

How CTEs execute internally

This is the single most-tested "gotcha" concept about CTEs in interviews, and it varies by database engine — so understanding the concept matters more than memorizing one engine's behavior.

Two ways an engine can treat a CTE

  • Inlining (a.k.a. "merging"): the engine treats the CTE as if you'd pasted its SQL text directly into every place you referenced it — like a macro. If you reference the CTE three times, its query effectively runs three times. The optimizer can then push filters from the outer query down into the CTE's logic, which can actually make it faster than a naive materialized version.
  • Materialization: the engine runs the CTE's query exactly once, stores the result in a temporary work area (memory or disk), and every reference afterward just reads from that stored result. Good when the CTE is expensive and referenced multiple times; bad when the CTE is huge and only a tiny filtered slice of it is actually needed by the outer query, because the filter can't be pushed down into it.
🧠
Mental model: a non-recursive CTE is a hint about readability, not a performance guarantee. PostgreSQL (12+) inlines by default unless you reference the CTE more than once or add MATERIALIZED explicitly. Some other engines (older Postgres, SQL Server in some plans) tend to materialize automatically, especially when a CTE is referenced multiple times. Always check with EXPLAIN on your actual engine before assuming.

Why this matters

-- If this CTE scans a 500-million-row fact table...
WITH big_scan AS (
    SELECT * FROM order_items
)
-- ...and you only need 3 rows here, inlining lets the filter
-- push down into big_scan's scan. Materialization would force
-- a full scan of order_items FIRST, then filter afterward.
SELECT * FROM big_scan WHERE order_id = 5001;
Production tipIn PostgreSQL you can force either behavior explicitly: WITH big_scan AS MATERIALIZED (...) or WITH big_scan AS NOT MATERIALIZED (...). Use MATERIALIZED when the CTE is genuinely reused multiple times and is expensive to recompute; use NOT MATERIALIZED when you want the optimizer free to push filters down.
Level 3 · Advanced

Scope, reuse & a recursion preview

  • A CTE is only visible within the single statement it's attached to. You cannot reference it from a completely different query afterward — it's gone the moment that statement finishes.
  • A CTE can be referenced multiple times within the same main query (e.g., once to compute a total, once to compute a rank) — this is exactly the situation where "does it inline or materialize" matters most, since each reference re-runs the logic unless it's materialized.
  • By default, a CTE cannot reference itselfWITH cte AS (SELECT ... FROM cte) is an error, unless you explicitly write WITH RECURSIVE. That's a deliberate safety rail, and the entire subject of Module 4.
  • A later CTE can reference an earlier one (chaining, as above), but an earlier CTE can never reference a later one — the dependency graph only flows downward, top to bottom.
Common mistakeAssuming that because a CTE is defined once, it's computed once, even when referenced three times in the main query. Unless your engine materializes it (or you force MATERIALIZED), it may run three separate times — each potentially expensive. If you genuinely need "compute once, use three times" guaranteed, materialize it explicitly or stage it into a real temp table.
Level 3.5 · Data Engineering

Staged ETL pipelines with CTEs

Let's fix the silent data-loss bug from the Chained CTEs section, the way a real data engineer would: make every stage's data-quality decision explicit and visible, instead of letting INNER JOINs quietly drop rows.

WITH cleaned_orders AS (
    -- Stage 1: flag orphaned orders instead of silently dropping them
    SELECT
        o.order_id,
        o.customer_id,
        CASE WHEN c.customer_id IS NULL THEN TRUE ELSE FALSE END AS is_orphaned
    FROM orders o
    LEFT JOIN customers c ON c.customer_id = o.customer_id
),
item_totals AS (
    -- Stage 2: line-item rollup, unrelated to the customer-quality question
    SELECT order_id, SUM(quantity * unit_price) AS line_total
    FROM order_items
    GROUP BY order_id
),
enriched_orders AS (
    -- Stage 3: combine stage 1 + stage 2
    SELECT
        co.order_id,
        co.customer_id,
        co.is_orphaned,
        COALESCE(it.line_total, 0) AS line_total
    FROM cleaned_orders co
    LEFT JOIN item_totals it ON it.order_id = co.order_id
)
SELECT
    order_id,
    customer_id,
    line_total,
    CASE WHEN is_orphaned THEN 'REVIEW: missing customer' ELSE 'OK' END AS data_quality_flag
FROM enriched_orders
ORDER BY is_orphaned DESC;

Order 5005 now appears in the output with an explicit flag, instead of disappearing. This is the difference between a pipeline that fails loudly (good) and one that fails silently (dangerous) — the exact same lesson from the LEFT JOIN production example in Module 1, now applied across a multi-stage pipeline.

Production tipIn real warehouses (dbt, Airflow-orchestrated SQL, Spark SQL), this exact pattern — staging → cleaning/flagging → enrichment → final mart — is usually one physical model per CTE stage. Writing it first as chained CTEs inside one query, then "graduating" each CTE into its own model once the logic is proven, is a very common and efficient workflow.
Level 3.5 · Data Engineering

Deduplication with a CTE

A CTE combined with ROW_NUMBER() (full detail in Module 6, previewed here since it's such a common CTE use case) is the standard production pattern for removing duplicate rows.

WITH ranked AS (
    SELECT
        item_id, order_id, product, quantity, unit_price,
        ROW_NUMBER() OVER (
            PARTITION BY order_id, product   -- "duplicate" means same order + same product
            ORDER BY item_id                 -- keep the earliest inserted row
        ) AS rn
    FROM order_items
)
SELECT item_id, order_id, product, quantity, unit_price
FROM ranked
WHERE rn = 1;

Why a CTE here specifically: ROW_NUMBER() cannot be referenced directly in the same SELECT's WHERE clause (window functions are computed after WHERE conceptually) — so you must compute it in one layer (the CTE) and filter it in the next. This is one of the few cases where a CTE isn't just "for readability," it's functionally required by SQL's rules of execution order.

Common mistakeWriting WHERE ROW_NUMBER() OVER (...) = 1 directly — this is a syntax error in every major engine. Window functions can only be filtered on in an outer layer (CTE, subquery, or QUALIFY in engines that support it, like Snowflake/BigQuery).
Level 3.5 · Data Engineering

When CTEs hurt performance

  • Re-computation on every reference. If a CTE is referenced 4 times in the outer query and the engine inlines it, you've effectively written the same expensive subquery 4 times.
  • Blocking predicate pushdown when forced-materialized. If an engine (or an explicit MATERIALIZED hint) computes the full CTE before any outer filter is applied, a CTE that scans a huge table but is only needed for 10 rows can become a full-table-scan bottleneck.
  • Long chains hiding a bad early step. A mistake in CTE #1 of a 6-CTE chain silently propagates through every later stage — always spot-check the earliest CTEs first when a long pipeline gives wrong output.
  • Over-normalizing a simple query into 5 CTEs "for style" — readability is the goal, not CTE count. If two steps are trivially simple and always used together, merging them can be clearer, not worse.
Optimization tipWhen a chained-CTE query is slow, run EXPLAIN (or EXPLAIN ANALYZE) and look specifically for whether each CTE shows as a separate materialized "CTE Scan" node or is fused into the main plan. That single check tells you exactly where to focus tuning effort.
Level 4 · Hard Interview

How interviewers trick you with CTEs

  • They ask you to filter on a window function result directly in the same SELECT ("just add WHERE rn = 1") to see if you know you need a CTE/subquery layer for that.
  • They ask "does using a CTE make this query faster than a subquery?" — the honest, senior-level answer is "usually not by itself; it depends on inlining vs materialization for this specific engine," not a flat yes or no.
  • They give you a query with the same CTE referenced 3 times and ask you to estimate cost — testing whether you assume "defined once = computed once."
  • They ask you to write a CTE that references itself without saying the word "recursive" — watching whether you catch that it needs WITH RECURSIVE (Module 4) rather than plain WITH.
  • They ask you to convert a deeply nested subquery into CTEs live on a whiteboard — really testing whether you can decompose a problem into clean, independent steps under pressure.
Level 4 · Hard Interview

3 Hard Interview Questions

H1.Using CTEs, find the department with the highest average salary — and explain why a naive single CTE isn't quite enough if you also need the number itself for a later comparison. hard
Expected output

One row: dept_name, avg_salary for the single highest-average department (Engineering, since 185000/145000/145000/98000 averages to 143,250).

Hint

Compute per-department averages in one CTE, then pick the max in the outer query — don't try to do both in one step.

Full solution
WITH dept_avg AS (
    SELECT d.dept_name, AVG(e.salary) AS avg_salary
    FROM departments d
    JOIN employees e ON e.dept_id = d.dept_id
    GROUP BY d.dept_name
)
SELECT dept_name, avg_salary
FROM dept_avg
ORDER BY avg_salary DESC
LIMIT 1;
Alternative solution

Using a second CTE to compute the max explicitly, useful when you need to keep comparing against that number later in a longer pipeline:

WITH dept_avg AS (
    SELECT d.dept_name, AVG(e.salary) AS avg_salary
    FROM departments d
    JOIN employees e ON e.dept_id = d.dept_id
    GROUP BY d.dept_name
),
max_avg AS (
    SELECT MAX(avg_salary) AS top_avg FROM dept_avg
)
SELECT da.dept_name, da.avg_salary
FROM dept_avg da
JOIN max_avg ma ON da.avg_salary = ma.top_avg;
Optimized solution

ORDER BY ... LIMIT 1 is simplest and fastest for "just the top one." The two-CTE MAX version is better when there could be a tie for first place and you want ALL of them returned, not just one arbitrary winner — LIMIT 1 silently picks only one row even if two departments tie exactly.

Time complexity

O(n) to compute the join + aggregate, O(d log d) to sort d departments for the LIMIT version; O(d) for the MAX-join version (no sort needed).

Common mistakes
  • Using LIMIT 1 when the real requirement is "all departments tied for first" — always ask if ties matter before picking this shortcut.
  • Forgetting that empty departments (Legal, 0 employees) simply don't appear here at all, since the JOIN drops them — worth calling out explicitly, mirroring the Module 1 lesson.
Interview follow-ups
  • "What if you need the top 3 departments, with ties fully expanded?" (foreshadows RANK/DENSE_RANK in Module 6.)
H2.Build a three-stage CTE pipeline that reports, per customer, their total revenue AND what percentage that customer contributes to total company revenue. hard
Expected output

customer_name, total_revenue, pct_of_total — percentages summing to ~100% across all customers with orders.

Hint

Stage 1: per-customer totals. Stage 2: grand total (a single scalar-like CTE). Final SELECT: divide stage 1 by stage 2.

Full solution
WITH customer_totals AS (
    SELECT c.customer_name, SUM(o.total_amount) AS total_revenue
    FROM customers c
    JOIN orders o ON o.customer_id = c.customer_id
    GROUP BY c.customer_name
),
grand_total AS (
    SELECT SUM(total_revenue) AS company_total FROM customer_totals
)
SELECT
    ct.customer_name,
    ct.total_revenue,
    ROUND(100.0 * ct.total_revenue / gt.company_total, 2) AS pct_of_total
FROM customer_totals ct
CROSS JOIN grand_total gt
ORDER BY pct_of_total DESC;
Common mistakes
  • Computing grand_total straight from orders instead of from customer_totals — both give the same number here (since every order has a valid customer... except 5005), but building it FROM the earlier CTE keeps the pipeline consistent and is safer if a filter is later added to customer_totals that should also apply to the grand total.
  • Integer division: writing 100 * total_revenue / company_total without the .0 can silently truncate to 0 in some engines if both values are integer types — always force a decimal/float context explicitly.
Interview follow-ups
  • "Now do the same thing but per department instead of per customer, reusing the same pattern." (Tests whether they understood the pattern or just memorized the query.)
H3.You inherit a query with 6 chained CTEs and it returns the wrong result. Describe your debugging process, then simulate it on a 3-CTE version that has a bug. hard
Expected output

A demonstrated debugging methodology, not just a fixed query.

Hint

Debug top-down: materialize and inspect the earliest CTE first with a standalone SELECT * FROM cte1 LIMIT 20, confirm it, then move to the next.

Buggy example
WITH item_totals AS (
    SELECT order_id, SUM(quantity * unit_price) AS line_total
    FROM order_items
    GROUP BY order_id
),
order_with_customer AS (
    SELECT it.order_id, o.customer_id, it.line_total
    FROM item_totals it
    JOIN orders o ON o.order_id = it.order_id   -- BUG lives one layer up, but this looks fine in isolation
),
customer_totals AS (
    SELECT customer_id, SUM(line_total) AS total_revenue
    FROM order_with_customer
    GROUP BY customer_id
)
SELECT c.customer_name, ct.total_revenue
FROM customer_totals ct
JOIN customers c ON c.customer_id = ct.customer_id;
-- Bug: order 5002 (Grace Lin) has no rows in order_items? No — check again: it does (9003).
-- The real bug: order_items has NO row for some future order_id that never got line items yet
-- (e.g. a new order 5006 with $900 total but items not loaded). item_totals silently drops it
-- because GROUP BY only emits rows for order_ids that exist in order_items, and item_totals
-- is INNER JOINed away entirely -- the order's own known total_amount is never used as a fallback.
Debugging process
  1. Run SELECT * FROM item_totals alone — compare row count against SELECT COUNT(DISTINCT order_id) FROM orders. If they don't match, the bug is already at stage 1.
  2. Confirm: any order_id in orders missing from order_items vanishes right here, before later stages even run.
  3. Fix at the source: use o.total_amount as a documented fallback, or LEFT JOIN item_totals onto orders instead of the reverse, so the base table (source of truth for "an order exists") drives the pipeline.
Fixed version
WITH item_totals AS (
    SELECT order_id, SUM(quantity * unit_price) AS line_total
    FROM order_items
    GROUP BY order_id
),
order_with_customer AS (
    SELECT
        o.order_id,
        o.customer_id,
        COALESCE(it.line_total, o.total_amount) AS line_total  -- fallback to header total
    FROM orders o
    LEFT JOIN item_totals it ON it.order_id = o.order_id
)
SELECT customer_id, SUM(line_total) AS total_revenue
FROM order_with_customer
GROUP BY customer_id;
Interview follow-ups
  • "How would you add an automated data-quality check that catches this class of bug before it reaches a dashboard?" (row-count reconciliation between stages, alerting on drift.)
Level 5 · FAANG

2 FAANG-Level Questions

F1.Build a modular "top N per group" report — the single highest-value order per customer — using only CTEs (no window functions, since that's Module 6). faang
Problem statement

This tests whether you can replicate a "top N per group" pattern with only CTEs + aggregation, and whether you understand why this approach doesn't scale to "top 3 per group" cleanly (motivating window functions later).

Hint

Stage 1: max order value per customer. Stage 2: join back to orders to find the row(s) that match that max.

Full solution
WITH max_per_customer AS (
    SELECT customer_id, MAX(total_amount) AS max_amount
    FROM orders
    GROUP BY customer_id
)
SELECT o.customer_id, o.order_id, o.total_amount
FROM orders o
JOIN max_per_customer m
    ON o.customer_id = m.customer_id
   AND o.total_amount = m.max_amount;

Rahul Bansal (201) has two orders, 2500 and 800 — only 2500 (order 5001) survives.

Optimized solution

This "join back to the max" pattern is fine for top-1, but extending it to top-3 per group requires a completely different technique (ranking + filtering), because there's no clean way to say "the 3rd highest" using only MAX/GROUP BY. ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY total_amount DESC) inside a CTE, filtered to rn <= 3, is the production-grade generalization — Module 6.

Time complexity

O(n) to aggregate + O(n) to join back with an index on (customer_id, total_amount); without that index, the join-back is effectively O(n²).

Common mistakes
  • If two orders from the same customer tie exactly on total_amount, this pattern returns BOTH — which is sometimes desired (all-time-highs), sometimes a bug (expecting exactly one row per customer). Always clarify tie-handling requirements explicitly with the interviewer.
Interview follow-ups
  • "Now do top-2 per customer." (Correct answer: "I'd switch to ROW_NUMBER/RANK — this MAX-based pattern doesn't generalize.")
F2.Design a CTE-based query that validates data quality across the whole orders pipeline in one shot: orphaned orders, orders with no line items, and orders where the line-item sum doesn't match the header total. faang
Problem statement

Real warehouse teams run exactly this kind of reconciliation query nightly. It tests whether you can compose several independent quality checks into one modular, extensible report.

Hint

One CTE per quality dimension, each producing a boolean flag, then UNION or combine flags in a final CTE so every order gets one row with all flags visible together.

Full solution
WITH orphan_check AS (
    SELECT o.order_id, (c.customer_id IS NULL) AS is_orphaned
    FROM orders o
    LEFT JOIN customers c ON c.customer_id = o.customer_id
),
item_check AS (
    SELECT o.order_id, COUNT(oi.item_id) AS item_count, COALESCE(SUM(oi.quantity * oi.unit_price), 0) AS computed_total
    FROM orders o
    LEFT JOIN order_items oi ON oi.order_id = o.order_id
    GROUP BY o.order_id
),
combined AS (
    SELECT
        o.order_id,
        oc.is_orphaned,
        ic.item_count = 0 AS has_no_items,
        ABS(ic.computed_total - o.total_amount) > 0.01 AS total_mismatch
    FROM orders o
    JOIN orphan_check oc ON oc.order_id = o.order_id
    JOIN item_check ic ON ic.order_id = o.order_id
)
SELECT *
FROM combined
WHERE is_orphaned OR has_no_items OR total_mismatch;

Order 5005 flags is_orphaned = true. In this dataset every order's items sum matches its header total exactly, so total_mismatch stays false for all — a realistic reminder that most rows in a QA report should be clean; the checks exist for the rare bad ones.

Optimized solution

Each check-CTE here scans orders once independently; at huge scale, computing all checks in a single pass over a pre-joined base CTE (join orders/customers/order_items once, then derive every flag from that one result) avoids redundant scans of the same base tables.

Time complexity

O(n) per check with proper indexes on the join keys; the naive multi-scan version is O(k·n) for k checks, the single-pass version is O(n).

Common mistakes
  • Comparing floating point totals with = instead of an epsilon tolerance (ABS(diff) > 0.01) — exact equality on computed decimals is fragile due to rounding.
  • Using INNER JOIN between combined's sources when an order genuinely has zero items — must be LEFT JOIN or the "has no items" case disappears instead of being flagged.
Interview follow-ups
  • "How would you turn this into an automated daily alert?" (Schedule it, write the flagged row count to a monitoring table, alert on non-zero.)
Level 5 · FAANG

1 Production Scenario

A dbt-style layered mart built from chained CTEs before it ever becomes separate models

You're asked for a "revenue by customer, with data-quality confidence" report for a dashboard. Rather than write one giant query, a senior data engineer sketches it first as a chain of CTEs, in this order, matching a typical staging → intermediate → mart layering:

WITH stg_orders AS (            -- staging: light cleanup only, one-to-one with source
    SELECT order_id, customer_id, total_amount
    FROM orders
),
stg_order_items AS (
    SELECT order_id, quantity, unit_price
    FROM order_items
),
int_order_totals AS (           -- intermediate: business logic, still order-grain
    SELECT
        o.order_id,
        o.customer_id,
        COALESCE(SUM(oi.quantity * oi.unit_price), o.total_amount) AS revenue,
        (SELECT COUNT(*) FROM stg_order_items x WHERE x.order_id = o.order_id) = 0 AS missing_items
    FROM stg_orders o
    LEFT JOIN stg_order_items oi ON TRUE   -- illustrative; real version joins on order_id
    GROUP BY o.order_id, o.customer_id, o.total_amount
),
mart_customer_revenue AS (      -- mart: final grain the dashboard actually queries
    SELECT customer_id, SUM(revenue) AS total_revenue, BOOL_OR(missing_items) AS has_data_gaps
    FROM int_order_totals
    GROUP BY customer_id
)
SELECT * FROM mart_customer_revenue;

Once this chain is proven correct (tested layer by layer, exactly as taught above), each CTE typically graduates into its own dbt model file — stg_orders.sql, int_order_totals.sql, mart_customer_revenue.sql — so each layer can be tested, scheduled, and reused independently across many downstream reports, not just this one query.

Production tipThis "prototype as one query with chained CTEs, then split into models once proven" workflow is extremely common in real data engineering teams — it lets you iterate fast in a SQL editor before committing to the overhead of separate pipeline files.
Practice

10 Practice Questions

Write the SQL yourself before expanding each answer. Use the employees / departments / customers / orders / order_items schema above.

P1.Using a CTE, list all employees earning more than the company-wide average salary. easy
WITH avg_sal AS (
    SELECT AVG(salary) AS avg_salary FROM employees
)
SELECT e.emp_name, e.salary
FROM employees e, avg_sal a
WHERE e.salary > a.avg_salary;
P2.Using a CTE, find the total order revenue per customer. easy
WITH revenue AS (
    SELECT customer_id, SUM(total_amount) AS total_revenue
    FROM orders
    GROUP BY customer_id
)
SELECT c.customer_name, r.total_revenue
FROM customers c
JOIN revenue r ON r.customer_id = c.customer_id;
P3.Using two independent CTEs, show department headcounts alongside the total number of orders in the whole system (a single repeated number next to every department). easy
WITH dept_counts AS (
    SELECT dept_id, COUNT(*) AS headcount FROM employees GROUP BY dept_id
),
order_count AS (
    SELECT COUNT(*) AS total_orders FROM orders
)
SELECT d.dept_name, dc.headcount, oc.total_orders
FROM departments d
LEFT JOIN dept_counts dc ON dc.dept_id = d.dept_id
CROSS JOIN order_count oc;
P4.Chain two CTEs: first compute each order's line-item total, then find orders where that total differs from the order's stored total_amount. medium
WITH item_totals AS (
    SELECT order_id, SUM(quantity * unit_price) AS line_total
    FROM order_items
    GROUP BY order_id
)
SELECT o.order_id, o.total_amount, it.line_total
FROM orders o
JOIN item_totals it ON it.order_id = o.order_id
WHERE o.total_amount <> it.line_total;
-- In our clean dataset this returns 0 rows -- confirming data integrity.
P5.Using a CTE, list every department together with its average salary, including departments with zero employees (AVG should show NULL, not error). medium
WITH dept_avg AS (
    SELECT dept_id, AVG(salary) AS avg_salary
    FROM employees
    GROUP BY dept_id
)
SELECT d.dept_name, da.avg_salary
FROM departments d
LEFT JOIN dept_avg da ON da.dept_id = d.dept_id;
P6.Using a CTE + ROW_NUMBER, find the single most expensive line item in each order. medium
WITH ranked AS (
    SELECT
        order_id, product, quantity, unit_price,
        ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY unit_price DESC) AS rn
    FROM order_items
)
SELECT order_id, product, unit_price
FROM ranked
WHERE rn = 1;
P7.Using a three-stage chained CTE, report each customer's revenue and rank them from highest to lowest spender (rank computed with a self-join style count, no window functions yet). medium
WITH customer_revenue AS (
    SELECT c.customer_id, c.customer_name, SUM(o.total_amount) AS total_revenue
    FROM customers c
    JOIN orders o ON o.customer_id = c.customer_id
    GROUP BY c.customer_id, c.customer_name
),
ranked AS (
    SELECT
        cr1.customer_name,
        cr1.total_revenue,
        (SELECT COUNT(*) FROM customer_revenue cr2 WHERE cr2.total_revenue > cr1.total_revenue) + 1 AS spend_rank
    FROM customer_revenue cr1
)
SELECT * FROM ranked ORDER BY spend_rank;
P8.Using a CTE, find customers who have placed orders but whose total revenue is below the average revenue across all customers who have ordered. hard
WITH customer_revenue AS (
    SELECT customer_id, SUM(total_amount) AS total_revenue
    FROM orders
    GROUP BY customer_id
),
avg_revenue AS (
    SELECT AVG(total_revenue) AS avg_rev FROM customer_revenue
)
SELECT c.customer_name, cr.total_revenue
FROM customer_revenue cr
JOIN customers c ON c.customer_id = cr.customer_id
CROSS JOIN avg_revenue a
WHERE cr.total_revenue < a.avg_rev;
P9.Rewrite question P8 as pure nested subqueries with no CTE, then explain in one sentence which version you'd rather maintain. hard
SELECT c.customer_name, cr.total_revenue
FROM customers c
JOIN (
    SELECT customer_id, SUM(total_amount) AS total_revenue
    FROM orders
    GROUP BY customer_id
) cr ON cr.customer_id = c.customer_id
WHERE cr.total_revenue < (
    SELECT AVG(t.total_revenue) FROM (
        SELECT SUM(total_amount) AS total_revenue
        FROM orders
        GROUP BY customer_id
    ) t
);
-- Note the average subquery duplicates the aggregation subquery entirely -- exactly
-- the readability/duplication cost that CTEs solve.
P10.Using CTEs, build the data-quality style report: for every order, show whether it has a valid customer AND whether its line items exist, as two separate boolean columns in one row per order. hard
WITH customer_check AS (
    SELECT o.order_id, (c.customer_id IS NOT NULL) AS has_valid_customer
    FROM orders o
    LEFT JOIN customers c ON c.customer_id = o.customer_id
),
item_check AS (
    SELECT o.order_id, EXISTS (
        SELECT 1 FROM order_items oi WHERE oi.order_id = o.order_id
    ) AS has_items
    FROM orders o
)
SELECT cc.order_id, cc.has_valid_customer, ic.has_items
FROM customer_check cc
JOIN item_check ic ON ic.order_id = cc.order_id;

Before you say NEXT: make sure you can explain, out loud and without looking, the difference between a CTE and a subquery, why "CTE" does not automatically mean "cached" or "faster," what inlining vs materialization means and which one your database defaults to, and why WHERE ROW_NUMBER() OVER (...) = 1 is illegal in the same SELECT. When you're ready, reply NEXT and we'll move to Module 4: Recursive CTEs.