SQL Joins and Aggregations Deep Dive: INNER, LEFT, CROSS, SELF, GROUP BY and HAVING

Joins and aggregations are the core of relational data querying. Most developers know the basics, but interviews and real analytical queries require a deeper understanding: when FULL OUTER JOIN is the right tool, how SELF JOIN works, what ROLLUP adds, and when to use HAVING vs WHERE.

INNER JOIN

Returns only rows that have matching values in both tables:

sql
-- Orders with their customer names
SELECT
    o.id          AS order_id,
    c.name        AS customer_name,
    o.total,
    o.created_at
FROM orders o
INNER JOIN customers c ON c.id = o.customer_id
WHERE o.total > 100
ORDER BY o.created_at DESC;

-- Customers excluded: those with no orders
-- Orders excluded: those with no matching customer (orphaned rows)

LEFT JOIN (LEFT OUTER JOIN)

Returns all rows from the left table, and matching rows from the right. Non-matching right rows become NULL:

sql
-- All customers, whether or not they have orders
SELECT
    c.name,
    COUNT(o.id)     AS order_count,
    SUM(o.total)    AS total_spent,
    MAX(o.created_at) AS last_order_date
FROM customers c
LEFT JOIN orders o ON o.customer_id = c.id
GROUP BY c.id, c.name
ORDER BY total_spent DESC NULLS LAST;

-- Find customers who have NEVER placed an order
SELECT c.name, c.email
FROM customers c
LEFT JOIN orders o ON o.customer_id = c.id
WHERE o.id IS NULL;  -- NULL on the right side = no match = no orders

RIGHT JOIN

Mirror of LEFT JOIN — returns all rows from the right table, matching rows from the left:

sql
-- All orders, even if the customer was deleted
SELECT o.id, o.total, c.name
FROM customers c
RIGHT JOIN orders o ON c.id = o.customer_id;

-- In practice, RIGHT JOIN is rarely used -- you can always rewrite as LEFT JOIN
-- by swapping the table order. LEFT JOIN is more readable.

FULL OUTER JOIN

Returns all rows from both tables. Non-matching rows on either side get NULLs:

sql
-- Find all mismatches: orders without customers AND customers without orders
SELECT
    c.id   AS customer_id,
    c.name AS customer_name,
    o.id   AS order_id,
    o.total
FROM customers c
FULL OUTER JOIN orders o ON c.id = o.customer_id
WHERE c.id IS NULL OR o.id IS NULL;
-- c.id IS NULL: orders with no matching customer (orphaned)
-- o.id IS NULL: customers with no orders

-- Reconcile two lists: find items in list A, list B, or both
SELECT
    a.product_id AS in_warehouse_a,
    b.product_id AS in_warehouse_b,
    COALESCE(a.quantity, 0) + COALESCE(b.quantity, 0) AS total_quantity
FROM warehouse_a a
FULL OUTER JOIN warehouse_b b ON a.product_id = b.product_id;

CROSS JOIN

Returns the Cartesian product — every row in the left table combined with every row in the right:

sql
-- Generate all possible size/color combinations
SELECT
    s.name AS size,
    c.name AS color,
    s.name || '-' || c.name AS sku_suffix
FROM sizes s
CROSS JOIN colors c;
-- 5 sizes × 8 colors = 40 rows

-- Generate a date range (using generate_series in PostgreSQL)
SELECT
    d::date AS report_date,
    EXTRACT(DOW FROM d) AS day_of_week
FROM generate_series(
    '2025-01-01'::date,
    '2025-12-31'::date,
    '1 day'::interval
) d;

-- Fill gaps in time-series data using CROSS JOIN
WITH dates AS (
    SELECT generate_series(
        date_trunc('month', NOW()) - interval '11 months',
        date_trunc('month', NOW()),
        interval '1 month'
    )::date AS month
),
users AS (
    SELECT DISTINCT user_id FROM events
)
SELECT
    u.user_id,
    d.month,
    COUNT(e.id) AS event_count
FROM users u
CROSS JOIN dates d
LEFT JOIN events e
    ON e.user_id = u.user_id
    AND date_trunc('month', e.created_at) = d.month
GROUP BY u.user_id, d.month
ORDER BY u.user_id, d.month;

SELF JOIN

A table joined to itself. Used for hierarchical data and comparing rows within the same table:

sql
-- Employee-manager hierarchy
SELECT
    e.name         AS employee,
    m.name         AS manager,
    e.department
FROM employees e
LEFT JOIN employees m ON m.id = e.manager_id;
-- e is the employee, m is their manager (same table, different alias)

-- Find employees earning more than their manager
SELECT
    e.name         AS employee,
    e.salary       AS employee_salary,
    m.name         AS manager,
    m.salary       AS manager_salary
FROM employees e
JOIN employees m ON m.id = e.manager_id
WHERE e.salary > m.salary;

-- Find duplicate email addresses
SELECT
    a.email,
    a.id AS id_1,
    b.id AS id_2
FROM users a
JOIN users b ON a.email = b.email AND a.id < b.id;
-- a.id < b.id prevents showing (1,2) and (2,1) -- each pair once

GROUP BY Deep Dive

sql
-- Basic aggregation
SELECT
    department,
    COUNT(*)                              AS headcount,
    AVG(salary)                           AS avg_salary,
    MIN(salary)                           AS min_salary,
    MAX(salary)                           AS max_salary,
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY salary) AS median_salary,
    SUM(salary)                           AS total_payroll
FROM employees
GROUP BY department
ORDER BY total_payroll DESC;

-- GROUP BY multiple columns
SELECT
    department,
    EXTRACT(YEAR FROM hire_date) AS hire_year,
    COUNT(*) AS hired
FROM employees
GROUP BY department, hire_year
ORDER BY department, hire_year;

-- Filter groups with HAVING (applied AFTER aggregation)
SELECT
    customer_id,
    COUNT(*)     AS order_count,
    SUM(total)   AS total_spent
FROM orders
WHERE created_at >= NOW() - INTERVAL '1 year'  -- filter rows BEFORE grouping
GROUP BY customer_id
HAVING COUNT(*) >= 5                             -- filter groups AFTER aggregation
   AND SUM(total) > 500
ORDER BY total_spent DESC;

WHERE vs HAVING

sql
-- WHERE filters individual rows before grouping
-- HAVING filters groups after aggregation

-- This is WRONG (cannot use aggregate in WHERE):
SELECT department, COUNT(*)
FROM employees
WHERE COUNT(*) > 10  -- ERROR: aggregate functions not allowed in WHERE
GROUP BY department;

-- Correct: use HAVING
SELECT department, COUNT(*) AS headcount
FROM employees
GROUP BY department
HAVING COUNT(*) > 10;

-- You CAN use both:
SELECT department, COUNT(*) AS headcount
FROM employees
WHERE salary > 50000          -- filter rows first (only high earners)
GROUP BY department
HAVING COUNT(*) > 5;          -- then filter groups (departments with 5+)

ROLLUP and CUBE

Generate subtotals and grand totals automatically:

sql
-- ROLLUP: subtotals for each level of hierarchy
SELECT
    COALESCE(department, 'All Departments') AS department,
    COALESCE(CAST(EXTRACT(YEAR FROM hire_date) AS TEXT), 'All Years') AS year,
    COUNT(*) AS headcount,
    SUM(salary) AS total_salary
FROM employees
GROUP BY ROLLUP(department, EXTRACT(YEAR FROM hire_date))
ORDER BY department, year;

-- Output includes:
-- (Engineering, 2023): subtotal for Engineering 2023
-- (Engineering, NULL): subtotal for all Engineering
-- (NULL, NULL): grand total for all departments

-- CUBE: all combinations of subtotals
GROUP BY CUBE(department, region, role)
-- Generates subtotals for every combination: all 8 combinations of 3 dimensions

Common Aggregation Patterns

Conditional aggregation (FILTER or CASE)

sql
-- Count by category in one query
SELECT
    COUNT(*) FILTER (WHERE status = 'completed') AS completed,
    COUNT(*) FILTER (WHERE status = 'pending')   AS pending,
    COUNT(*) FILTER (WHERE status = 'cancelled') AS cancelled,
    COUNT(*) AS total
FROM orders;

-- Equivalent with CASE WHEN (portable across databases)
SELECT
    COUNT(CASE WHEN status = 'completed' THEN 1 END) AS completed,
    COUNT(CASE WHEN status = 'pending'   THEN 1 END) AS pending,
    SUM(CASE WHEN status = 'completed' THEN total ELSE 0 END) AS completed_revenue
FROM orders;

-- Pivot: rows to columns
SELECT
    product_id,
    SUM(CASE WHEN EXTRACT(QUARTER FROM sale_date) = 1 THEN amount ELSE 0 END) AS q1,
    SUM(CASE WHEN EXTRACT(QUARTER FROM sale_date) = 2 THEN amount ELSE 0 END) AS q2,
    SUM(CASE WHEN EXTRACT(QUARTER FROM sale_date) = 3 THEN amount ELSE 0 END) AS q3,
    SUM(CASE WHEN EXTRACT(QUARTER FROM sale_date) = 4 THEN amount ELSE 0 END) AS q4
FROM sales
GROUP BY product_id;

Running totals

sql
SELECT
    order_date,
    daily_revenue,
    SUM(daily_revenue) OVER (ORDER BY order_date) AS cumulative_revenue
FROM daily_sales;

Common Interview Questions

Q: What is the difference between WHERE and HAVING?

WHERE filters individual rows before they are grouped — it cannot reference aggregate functions. HAVING filters groups after GROUP BY aggregation — it can reference aggregate functions. Use WHERE to reduce the rows entering aggregation (more efficient), and HAVING to filter the resulting groups.

Q: What does a CROSS JOIN produce and when would you use it?

A CROSS JOIN produces the Cartesian product — every row in the first table paired with every row in the second table. N rows × M rows = N×M results. Use it to generate all combinations (product SKUs from sizes and colors), to fill time-series gaps by joining a date spine with entity lists, or to generate test data.

Q: How does a SELF JOIN work?

A SELF JOIN joins a table to itself by aliasing it with two different names. It is used for hierarchical data (employees and their managers), for finding relationships between rows in the same table (pairs of users in the same city), or for duplicate detection. The join condition typically compares different columns or adds a constraint like a.id < b.id to avoid duplicates.

Practice SQL on Froquiz

Joins and aggregations are core SQL interview topics at every level. Test your SQL knowledge on Froquiz — from basic queries to advanced window functions and CTEs.

Summary

INNER JOIN: only matching rows from both tables
LEFT JOIN: all left rows, matching right rows (NULLs for no match) — use WHERE right.id IS NULL to find unmatched
FULL OUTER JOIN: all rows from both tables — unmatched rows get NULLs on the other side
CROSS JOIN: Cartesian product — every row × every row — use for combinations and date spines
SELF JOIN: table joined to itself with two aliases — use for hierarchies and pair comparisons
WHERE filters rows before grouping; HAVING filters groups after aggregation
ROLLUP generates hierarchical subtotals; CUBE generates all combinations
COUNT(*) FILTER (WHERE condition) is the cleanest conditional aggregation syntax