SQL: A Data Analyst’s Reference

This guide compiles the most recurrent SQL queries you’ll need in your first job as a data analyst. Master these patterns and you’ll make great progress with your analytical skills.

SQL is a programming language used for managing and manipulating data in databases. It’s used for creating, updating, querying, and deleting data—and is an essential tool for data analysis, data migration, and data integration.

Chapter 1: SQL Fundamentals

Before diving into queries, understand the four categories of SQL commands.

DDL - Data Definition Language

A subset of SQL for describing data and its relationships:

Command	Purpose
`CREATE`	Create objects in the database
`ALTER`	Alter the structure of the database
`DROP`	Delete objects (removes entire table and structure)
`TRUNCATE`	Remove data (rows) but preserve the structure
`RENAME`	Rename objects

DML - Data Manipulation Language

Used for adding, deleting, and modifying data:

Command	Purpose
`SELECT`	Retrieve data from the database
`INSERT`	Insert data into a table
`UPDATE`	Update existing data within a table
`DELETE`	Delete records from a table (space for records remains)

DCL - Data Control Language

Used to control access to data (authorization):

Command	Purpose
`GRANT`	Allow specified users to perform specified tasks
`REVOKE`	Cancel previously granted or denied permissions

TCL - Transaction Control Language

Command	Purpose
`COMMIT`	Permanently save any transaction to the database
`ROLLBACK`	Restore the database to last committed state

Chapter 2: Essential Query Patterns

Views

A view is a virtual table that is the result of a saved SQL SELECT statement. You can query them like a normal table.

CREATE VIEW your_view AS
SELECT * FROM your_table;

Filtering with WHERE

SELECT * FROM Clients
WHERE Name = 'Yosua';

SELECT * FROM Clients
WHERE Name = 'Yosua' AND Nick = 'Cerdo' OR Job = 'Dr.Yosua';

Aggregations with GROUP BY

COUNT(CustomerID) — Include records with values only
COUNT(*) — Considers null/missing values too

SELECT Region, COUNT(CustomerID)
FROM Clients
GROUP BY Region
ORDER BY COUNT(CustomerID) DESC;

HAVING vs WHERE

WHERE filters rows before aggregation
HAVING filters groups after aggregation

Use both for precise control over your data:

SELECT Region, COUNT(Boquitas), AVG(Cerdeza)
FROM Clients
WHERE Job LIKE 'Dr.%'
GROUP BY Region
HAVING COUNT(Boquitas) > 5
ORDER BY COUNT(Boquitas) DESC;

Unions

UNION combines results of two or more SELECT queries vertically (appending rows), while JOIN combines rows horizontally (adding columns).

Note: Each SELECT within a UNION must have the same number of columns with similar data types, in the same order.

SELECT CustomerName FROM Sales2019
UNION
SELECT CustomerName FROM Sales2020;

UNION eliminates duplicates (like SELECT DISTINCT)
UNION ALL keeps all rows (faster, no deduplication)

Joins

Joins combine rows from two or more tables horizontally based on a related column:

Join Type	Description
`INNER JOIN`	Returns matching rows from both tables
`LEFT JOIN`	All rows from left table + matching from right
`RIGHT JOIN`	All rows from right table + matching from left
`FULL OUTER JOIN`	All rows from both tables
`CROSS JOIN`	Cartesian product of both tables

Chapter 3: Advanced SQL Techniques

Window Functions

Window functions operate on a set of rows within a result set (a “window”). They’re more flexible than aggregate functions because they calculate over a specific range of rows.

Key Window Functions: RANK(), ROW_NUMBER(), LEAD(), LAG(), OVER(), PARTITION BY

Rolling Sum with SUM()

SELECT o.occurred_at,
       SUM(o.gloss_qty) OVER(ORDER BY o.occurred_at) AS running_gloss_orders
FROM demo.orders o;

Get First Element from Each Group with ROW_NUMBER()

WITH order_ranks AS (
    SELECT o.id,
           o.account_id,
           o.gloss_qty,
           o.gloss_amt_usd,
           ROW_NUMBER() OVER(PARTITION BY o.account_id 
                             ORDER BY o.occurred_at DESC) AS acct_order_rank
    FROM demo.orders o
)
SELECT *
FROM order_ranks
WHERE acct_order_rank = 1;

Look at Previous Rows with LAG()

SELECT o.id,
       o.occurred_at,
       o.gloss_qty,
       LAG(gloss_qty, 1) OVER(ORDER BY o.occurred_at) AS order_lag_1,
       LAG(gloss_qty, 2) OVER(ORDER BY o.occurred_at) AS order_lag_2,
       LAG(gloss_qty, 3) OVER(ORDER BY o.occurred_at) AS order_lag_3
FROM demo.orders o;

Ranking with RANK()

Get the oldest symbol per GICS sector:

WITH RankedSectors AS (
    SELECT 
        Symbol, 
        "GICS Sector", 
        Founded,
        RANK() OVER (PARTITION BY "GICS Sector" ORDER BY Founded ASC) AS Rank
    FROM df
    WHERE Founded IS NOT NULL
)
SELECT Symbol, "GICS Sector", Founded
FROM RankedSectors
WHERE Rank = 1;

Common Table Expressions (CTEs)

A CTE is a temporary result set defined with a WITH clause. It makes complex queries more readable.

WITH CategorySales AS (
    SELECT
        product_category,
        SUM(sales_amount) AS total_sales
    FROM sales
    GROUP BY product_category
)
SELECT *
FROM CategorySales
WHERE total_sales > 10000;

Benefits:

Breaks complex queries into manageable parts
Improves readability
Can be referenced multiple times in the main query

Subqueries

A subquery is a query nested inside another query:

SELECT employee_name, department, salary
FROM employees
WHERE department = 'Sales' 
  AND salary > (
    SELECT AVG(salary)
    FROM employees
    WHERE department = 'Sales'
);

CTE vs Subquery: CTEs are named and defined at the start; subqueries are inline. CTEs are often more readable for complex logic.

Chapter 4: SQL Query Optimization

Understanding Indexes

An index is a data structure that stores a sorted copy of selected columns with references to original rows—enabling fast lookups.

Pros:

Faster data retrieval (reduces disk I/O)
Speeds up WHERE, ORDER BY, and JOIN operations
Enforces unique constraints

Cons:

Consumes additional storage
Slows down INSERT/UPDATE/DELETE operations
Requires periodic maintenance

Best Practices for Query Optimization

Practice	Reason
Use `SELECT <columns>` instead of `SELECT *`	Reduces data transfer
Use `WHERE` instead of `HAVING` when possible	Filters earlier in execution
Use indexes on frequently searched columns	Speeds up lookups
Avoid functions in WHERE clause	Prevents index usage
Use `UNION ALL` instead of `UNION` when duplicates are OK	Skips expensive DISTINCT
Avoid running queries in a loop	Batch operations are faster
Use `EXPLAIN` to view query execution plans	Identify bottlenecks
Simplify joins	Fewer joins = faster queries
Avoid cursors—use set-based operations	Cursors process row-by-row

UNION vs UNION ALL

UNION performs SELECT DISTINCT on results (slower)
UNION ALL keeps all rows including duplicates (faster)

Use UNION ALL when you know there are no duplicates or duplicates don’t matter.

Chapter 5: Data Modeling Concepts for SQL

Normalization vs Denormalization

Aspect	Normalization	Denormalization
Goal	Minimize redundancy	Improve read performance
Structure	Many related tables	Fewer, wider tables
Best for	WRITE operations (OLTP)	READ operations (OLAP)
Data integrity	High	Lower (redundancy)
Query complexity	More joins needed	Fewer joins

Long Tables vs Wide Tables

Long Tables (Normalized): Flexible, smaller file size, more joins for visualization
Wide Tables (Denormalized): Simpler queries, faster BI performance, data redundancy

Facts and Dimensions (Data Warehousing)

Concept	Description	Example
Fact Tables	Store quantitative measures	Sales revenue, quantity sold
Dimension Tables	Store descriptive attributes	Customer, Product, Date, Geography

Star Schema vs Snowflake Schema

Aspect	Star Schema ⭐	Snowflake Schema ❄️
Dimensions	Denormalized (flat)	Normalized (sub-dimensions)
Joins	Fewer	More
Query speed	Faster	Slower
Storage	More redundancy	Less redundancy
Complexity	Simple	Complex

Star schema is preferred for analytics due to its simpler queries and faster performance.

Primary Keys, Foreign Keys, and Indexes

Primary Key: Uniquely identifies each row in a table
Foreign Key: References a primary key in another table (creates relationships)
Index: Speeds up data retrieval on frequently queried columns

FAQ

SQL Style Guide

Follow a consistent style for readable SQL:

sqlstyle.guide

What’s the difference between DELETE, DROP, and TRUNCATE?

Command	Effect
`DELETE`	Removes rows, can use WHERE, space remains
`TRUNCATE`	Removes all rows, keeps structure, faster
`DROP`	Removes entire table including structure

When should I use a CTE vs a Subquery?

CTE: When you need to reference the result multiple times, or for complex multi-step logic
Subquery: For simple, one-off nested conditions

How do I know if I need an index?

Create indexes on columns that are:

Frequently used in WHERE clauses
Used in JOIN conditions
Used in ORDER BY or GROUP BY

Avoid over-indexing tables with heavy write operations.