SQL: A Data Analyst’s Reference
This guide compiles the most recurrent SQL queries you’ll need in your first job as a data analyst. Master these patterns and you’ll make great progress with your analytical skills.
SQL is a programming language used for managing and manipulating data in databases. It’s used for creating, updating, querying, and deleting data—and is an essential tool for data analysis, data migration, and data integration.
Chapter 1: SQL Fundamentals
Before diving into queries, understand the four categories of SQL commands.
DDL - Data Definition Language
A subset of SQL for describing data and its relationships:
| Command | Purpose |
|---|---|
CREATE | Create objects in the database |
ALTER | Alter the structure of the database |
DROP | Delete objects (removes entire table and structure) |
TRUNCATE | Remove data (rows) but preserve the structure |
RENAME | Rename objects |
DML - Data Manipulation Language
Used for adding, deleting, and modifying data:
| Command | Purpose |
|---|---|
SELECT | Retrieve data from the database |
INSERT | Insert data into a table |
UPDATE | Update existing data within a table |
DELETE | Delete records from a table (space for records remains) |
DCL - Data Control Language
Used to control access to data (authorization):
| Command | Purpose |
|---|---|
GRANT | Allow specified users to perform specified tasks |
REVOKE | Cancel previously granted or denied permissions |
TCL - Transaction Control Language
| Command | Purpose |
|---|---|
COMMIT | Permanently save any transaction to the database |
ROLLBACK | Restore the database to last committed state |
Chapter 2: Essential Query Patterns
Views
A view is a virtual table that is the result of a saved SQL SELECT statement. You can query them like a normal table.
CREATE VIEW your_view AS
SELECT * FROM your_table;
Filtering with WHERE
SELECT * FROM Clients
WHERE Name = 'Yosua';
SELECT * FROM Clients
WHERE Name = 'Yosua' AND Nick = 'Cerdo' OR Job = 'Dr.Yosua';
Aggregations with GROUP BY
COUNT(CustomerID)— Include records with values onlyCOUNT(*)— Considers null/missing values too
SELECT Region, COUNT(CustomerID)
FROM Clients
GROUP BY Region
ORDER BY COUNT(CustomerID) DESC;
HAVING vs WHERE
- WHERE filters rows before aggregation
- HAVING filters groups after aggregation
Use both for precise control over your data:
SELECT Region, COUNT(Boquitas), AVG(Cerdeza)
FROM Clients
WHERE Job LIKE 'Dr.%'
GROUP BY Region
HAVING COUNT(Boquitas) > 5
ORDER BY COUNT(Boquitas) DESC;
Unions
UNION combines results of two or more SELECT queries vertically (appending rows), while JOIN combines rows horizontally (adding columns).
Note: Each SELECT within a UNION must have the same number of columns with similar data types, in the same order.
SELECT CustomerName FROM Sales2019
UNION
SELECT CustomerName FROM Sales2020;
UNIONeliminates duplicates (likeSELECT DISTINCT)UNION ALLkeeps all rows (faster, no deduplication)
Joins
Joins combine rows from two or more tables horizontally based on a related column:
| Join Type | Description |
|---|---|
INNER JOIN | Returns matching rows from both tables |
LEFT JOIN | All rows from left table + matching from right |
RIGHT JOIN | All rows from right table + matching from left |
FULL OUTER JOIN | All rows from both tables |
CROSS JOIN | Cartesian product of both tables |
Chapter 3: Advanced SQL Techniques
Window Functions
Window functions operate on a set of rows within a result set (a “window”). They’re more flexible than aggregate functions because they calculate over a specific range of rows.
Key Window Functions: RANK(), ROW_NUMBER(), LEAD(), LAG(), OVER(), PARTITION BY
Rolling Sum with SUM()
SELECT o.occurred_at,
SUM(o.gloss_qty) OVER(ORDER BY o.occurred_at) AS running_gloss_orders
FROM demo.orders o;
Get First Element from Each Group with ROW_NUMBER()
WITH order_ranks AS (
SELECT o.id,
o.account_id,
o.gloss_qty,
o.gloss_amt_usd,
ROW_NUMBER() OVER(PARTITION BY o.account_id
ORDER BY o.occurred_at DESC) AS acct_order_rank
FROM demo.orders o
)
SELECT *
FROM order_ranks
WHERE acct_order_rank = 1;
Look at Previous Rows with LAG()
SELECT o.id,
o.occurred_at,
o.gloss_qty,
LAG(gloss_qty, 1) OVER(ORDER BY o.occurred_at) AS order_lag_1,
LAG(gloss_qty, 2) OVER(ORDER BY o.occurred_at) AS order_lag_2,
LAG(gloss_qty, 3) OVER(ORDER BY o.occurred_at) AS order_lag_3
FROM demo.orders o;
Ranking with RANK()
Get the oldest symbol per GICS sector:
WITH RankedSectors AS (
SELECT
Symbol,
"GICS Sector",
Founded,
RANK() OVER (PARTITION BY "GICS Sector" ORDER BY Founded ASC) AS Rank
FROM df
WHERE Founded IS NOT NULL
)
SELECT Symbol, "GICS Sector", Founded
FROM RankedSectors
WHERE Rank = 1;
Common Table Expressions (CTEs)
A CTE is a temporary result set defined with a WITH clause. It makes complex queries more readable.
WITH CategorySales AS (
SELECT
product_category,
SUM(sales_amount) AS total_sales
FROM sales
GROUP BY product_category
)
SELECT *
FROM CategorySales
WHERE total_sales > 10000;
Benefits:
- Breaks complex queries into manageable parts
- Improves readability
- Can be referenced multiple times in the main query
Subqueries
A subquery is a query nested inside another query:
SELECT employee_name, department, salary
FROM employees
WHERE department = 'Sales'
AND salary > (
SELECT AVG(salary)
FROM employees
WHERE department = 'Sales'
);
CTE vs Subquery: CTEs are named and defined at the start; subqueries are inline. CTEs are often more readable for complex logic.
Chapter 4: SQL Query Optimization
Understanding Indexes
An index is a data structure that stores a sorted copy of selected columns with references to original rows—enabling fast lookups.
Pros:
- Faster data retrieval (reduces disk I/O)
- Speeds up WHERE, ORDER BY, and JOIN operations
- Enforces unique constraints
Cons:
- Consumes additional storage
- Slows down INSERT/UPDATE/DELETE operations
- Requires periodic maintenance
Best Practices for Query Optimization
| Practice | Reason |
|---|---|
Use SELECT <columns> instead of SELECT * | Reduces data transfer |
Use WHERE instead of HAVING when possible | Filters earlier in execution |
| Use indexes on frequently searched columns | Speeds up lookups |
| Avoid functions in WHERE clause | Prevents index usage |
Use UNION ALL instead of UNION when duplicates are OK | Skips expensive DISTINCT |
| Avoid running queries in a loop | Batch operations are faster |
Use EXPLAIN to view query execution plans | Identify bottlenecks |
| Simplify joins | Fewer joins = faster queries |
| Avoid cursors—use set-based operations | Cursors process row-by-row |
UNION vs UNION ALL
- UNION performs
SELECT DISTINCTon results (slower) - UNION ALL keeps all rows including duplicates (faster)
Use
UNION ALLwhen you know there are no duplicates or duplicates don’t matter.
Chapter 5: Data Modeling Concepts for SQL
Normalization vs Denormalization
| Aspect | Normalization | Denormalization |
|---|---|---|
| Goal | Minimize redundancy | Improve read performance |
| Structure | Many related tables | Fewer, wider tables |
| Best for | WRITE operations (OLTP) | READ operations (OLAP) |
| Data integrity | High | Lower (redundancy) |
| Query complexity | More joins needed | Fewer joins |
Long Tables vs Wide Tables
- Long Tables (Normalized): Flexible, smaller file size, more joins for visualization
- Wide Tables (Denormalized): Simpler queries, faster BI performance, data redundancy
Facts and Dimensions (Data Warehousing)
| Concept | Description | Example |
|---|---|---|
| Fact Tables | Store quantitative measures | Sales revenue, quantity sold |
| Dimension Tables | Store descriptive attributes | Customer, Product, Date, Geography |
Star Schema vs Snowflake Schema
| Aspect | Star Schema ⭐ | Snowflake Schema ❄️ |
|---|---|---|
| Dimensions | Denormalized (flat) | Normalized (sub-dimensions) |
| Joins | Fewer | More |
| Query speed | Faster | Slower |
| Storage | More redundancy | Less redundancy |
| Complexity | Simple | Complex |
Star schema is preferred for analytics due to its simpler queries and faster performance.
Primary Keys, Foreign Keys, and Indexes
- Primary Key: Uniquely identifies each row in a table
- Foreign Key: References a primary key in another table (creates relationships)
- Index: Speeds up data retrieval on frequently queried columns
FAQ
SQL Style Guide
Follow a consistent style for readable SQL:
What’s the difference between DELETE, DROP, and TRUNCATE?
| Command | Effect |
|---|---|
DELETE | Removes rows, can use WHERE, space remains |
TRUNCATE | Removes all rows, keeps structure, faster |
DROP | Removes entire table including structure |
When should I use a CTE vs a Subquery?
- CTE: When you need to reference the result multiple times, or for complex multi-step logic
- Subquery: For simple, one-off nested conditions
How do I know if I need an index?
Create indexes on columns that are:
- Frequently used in
WHEREclauses - Used in
JOINconditions - Used in
ORDER BYorGROUP BY
Avoid over-indexing tables with heavy write operations.