JAlcocerTech E-books

SQL: A Data Analyst’s Reference

This guide compiles the most recurrent SQL queries you’ll need in your first job as a data analyst. Master these patterns and you’ll make great progress with your analytical skills.

SQL is a programming language used for managing and manipulating data in databases. It’s used for creating, updating, querying, and deleting data—and is an essential tool for data analysis, data migration, and data integration.


Chapter 1: SQL Fundamentals

Before diving into queries, understand the four categories of SQL commands.

DDL - Data Definition Language

A subset of SQL for describing data and its relationships:

CommandPurpose
CREATECreate objects in the database
ALTERAlter the structure of the database
DROPDelete objects (removes entire table and structure)
TRUNCATERemove data (rows) but preserve the structure
RENAMERename objects

DML - Data Manipulation Language

Used for adding, deleting, and modifying data:

CommandPurpose
SELECTRetrieve data from the database
INSERTInsert data into a table
UPDATEUpdate existing data within a table
DELETEDelete records from a table (space for records remains)

DCL - Data Control Language

Used to control access to data (authorization):

CommandPurpose
GRANTAllow specified users to perform specified tasks
REVOKECancel previously granted or denied permissions

TCL - Transaction Control Language

CommandPurpose
COMMITPermanently save any transaction to the database
ROLLBACKRestore the database to last committed state

Chapter 2: Essential Query Patterns

Views

A view is a virtual table that is the result of a saved SQL SELECT statement. You can query them like a normal table.

CREATE VIEW your_view AS
SELECT * FROM your_table;

Filtering with WHERE

SELECT * FROM Clients
WHERE Name = 'Yosua';
SELECT * FROM Clients
WHERE Name = 'Yosua' AND Nick = 'Cerdo' OR Job = 'Dr.Yosua';

Aggregations with GROUP BY

  • COUNT(CustomerID) — Include records with values only
  • COUNT(*) — Considers null/missing values too
SELECT Region, COUNT(CustomerID)
FROM Clients
GROUP BY Region
ORDER BY COUNT(CustomerID) DESC;

HAVING vs WHERE

  • WHERE filters rows before aggregation
  • HAVING filters groups after aggregation

Use both for precise control over your data:

SELECT Region, COUNT(Boquitas), AVG(Cerdeza)
FROM Clients
WHERE Job LIKE 'Dr.%'
GROUP BY Region
HAVING COUNT(Boquitas) > 5
ORDER BY COUNT(Boquitas) DESC;

Unions

UNION combines results of two or more SELECT queries vertically (appending rows), while JOIN combines rows horizontally (adding columns).

Note: Each SELECT within a UNION must have the same number of columns with similar data types, in the same order.

SELECT CustomerName FROM Sales2019
UNION
SELECT CustomerName FROM Sales2020;
  • UNION eliminates duplicates (like SELECT DISTINCT)
  • UNION ALL keeps all rows (faster, no deduplication)

Joins

Joins combine rows from two or more tables horizontally based on a related column:

Join TypeDescription
INNER JOINReturns matching rows from both tables
LEFT JOINAll rows from left table + matching from right
RIGHT JOINAll rows from right table + matching from left
FULL OUTER JOINAll rows from both tables
CROSS JOINCartesian product of both tables

Chapter 3: Advanced SQL Techniques

Window Functions

Window functions operate on a set of rows within a result set (a “window”). They’re more flexible than aggregate functions because they calculate over a specific range of rows.

Key Window Functions: RANK(), ROW_NUMBER(), LEAD(), LAG(), OVER(), PARTITION BY

Rolling Sum with SUM()

SELECT o.occurred_at,
       SUM(o.gloss_qty) OVER(ORDER BY o.occurred_at) AS running_gloss_orders
FROM demo.orders o;

Get First Element from Each Group with ROW_NUMBER()

WITH order_ranks AS (
    SELECT o.id,
           o.account_id,
           o.gloss_qty,
           o.gloss_amt_usd,
           ROW_NUMBER() OVER(PARTITION BY o.account_id 
                             ORDER BY o.occurred_at DESC) AS acct_order_rank
    FROM demo.orders o
)
SELECT *
FROM order_ranks
WHERE acct_order_rank = 1;

Look at Previous Rows with LAG()

SELECT o.id,
       o.occurred_at,
       o.gloss_qty,
       LAG(gloss_qty, 1) OVER(ORDER BY o.occurred_at) AS order_lag_1,
       LAG(gloss_qty, 2) OVER(ORDER BY o.occurred_at) AS order_lag_2,
       LAG(gloss_qty, 3) OVER(ORDER BY o.occurred_at) AS order_lag_3
FROM demo.orders o;

Ranking with RANK()

Get the oldest symbol per GICS sector:

WITH RankedSectors AS (
    SELECT 
        Symbol, 
        "GICS Sector", 
        Founded,
        RANK() OVER (PARTITION BY "GICS Sector" ORDER BY Founded ASC) AS Rank
    FROM df
    WHERE Founded IS NOT NULL
)
SELECT Symbol, "GICS Sector", Founded
FROM RankedSectors
WHERE Rank = 1;

Common Table Expressions (CTEs)

A CTE is a temporary result set defined with a WITH clause. It makes complex queries more readable.

WITH CategorySales AS (
    SELECT
        product_category,
        SUM(sales_amount) AS total_sales
    FROM sales
    GROUP BY product_category
)
SELECT *
FROM CategorySales
WHERE total_sales > 10000;

Benefits:

  • Breaks complex queries into manageable parts
  • Improves readability
  • Can be referenced multiple times in the main query

Subqueries

A subquery is a query nested inside another query:

SELECT employee_name, department, salary
FROM employees
WHERE department = 'Sales' 
  AND salary > (
    SELECT AVG(salary)
    FROM employees
    WHERE department = 'Sales'
);

CTE vs Subquery: CTEs are named and defined at the start; subqueries are inline. CTEs are often more readable for complex logic.


Chapter 4: SQL Query Optimization

Understanding Indexes

An index is a data structure that stores a sorted copy of selected columns with references to original rows—enabling fast lookups.

Pros:

  • Faster data retrieval (reduces disk I/O)
  • Speeds up WHERE, ORDER BY, and JOIN operations
  • Enforces unique constraints

Cons:

  • Consumes additional storage
  • Slows down INSERT/UPDATE/DELETE operations
  • Requires periodic maintenance

Best Practices for Query Optimization

PracticeReason
Use SELECT <columns> instead of SELECT *Reduces data transfer
Use WHERE instead of HAVING when possibleFilters earlier in execution
Use indexes on frequently searched columnsSpeeds up lookups
Avoid functions in WHERE clausePrevents index usage
Use UNION ALL instead of UNION when duplicates are OKSkips expensive DISTINCT
Avoid running queries in a loopBatch operations are faster
Use EXPLAIN to view query execution plansIdentify bottlenecks
Simplify joinsFewer joins = faster queries
Avoid cursors—use set-based operationsCursors process row-by-row

UNION vs UNION ALL

  • UNION performs SELECT DISTINCT on results (slower)
  • UNION ALL keeps all rows including duplicates (faster)

Use UNION ALL when you know there are no duplicates or duplicates don’t matter.


Chapter 5: Data Modeling Concepts for SQL

Normalization vs Denormalization

AspectNormalizationDenormalization
GoalMinimize redundancyImprove read performance
StructureMany related tablesFewer, wider tables
Best forWRITE operations (OLTP)READ operations (OLAP)
Data integrityHighLower (redundancy)
Query complexityMore joins neededFewer joins

Long Tables vs Wide Tables

  • Long Tables (Normalized): Flexible, smaller file size, more joins for visualization
  • Wide Tables (Denormalized): Simpler queries, faster BI performance, data redundancy

Facts and Dimensions (Data Warehousing)

ConceptDescriptionExample
Fact TablesStore quantitative measuresSales revenue, quantity sold
Dimension TablesStore descriptive attributesCustomer, Product, Date, Geography

Star Schema vs Snowflake Schema

AspectStar Schema ⭐Snowflake Schema ❄️
DimensionsDenormalized (flat)Normalized (sub-dimensions)
JoinsFewerMore
Query speedFasterSlower
StorageMore redundancyLess redundancy
ComplexitySimpleComplex

Star schema is preferred for analytics due to its simpler queries and faster performance.

Primary Keys, Foreign Keys, and Indexes

  • Primary Key: Uniquely identifies each row in a table
  • Foreign Key: References a primary key in another table (creates relationships)
  • Index: Speeds up data retrieval on frequently queried columns

FAQ

SQL Style Guide

Follow a consistent style for readable SQL:

What’s the difference between DELETE, DROP, and TRUNCATE?

CommandEffect
DELETERemoves rows, can use WHERE, space remains
TRUNCATERemoves all rows, keeps structure, faster
DROPRemoves entire table including structure

When should I use a CTE vs a Subquery?

  • CTE: When you need to reference the result multiple times, or for complex multi-step logic
  • Subquery: For simple, one-off nested conditions

How do I know if I need an index?

Create indexes on columns that are:

  • Frequently used in WHERE clauses
  • Used in JOIN conditions
  • Used in ORDER BY or GROUP BY

Avoid over-indexing tables with heavy write operations.