Data Warehousing

Data warehousing is the process of collecting, storing, and managing large volumes of data from various sources to support business intelligence and decision-making.

What is a Data Warehouse?

A data warehouse is a large, centralized repository designed to store and analyze data from multiple sources:

Exists on top of several databases
Provides a layer for data analytics
Works with ETL processes
Outputs BI reports and dashboards

Examples: Microsoft Azure Synapse, Google BigQuery, Snowflake, Amazon Redshift

Operational vs Analytical Systems

System Type	Purpose	Focus	Examples
Operational (OLTP)	Day-to-day business operations	Efficiency, real-time processing	CRM, ERP, order management
Analytical (OLAP)	Analysis and decision-making	Historical context, patterns	Data warehouses, OLAP cubes

Data Warehouse Layers

DWH typically have three main layers:

1. Staging Layer

Temporary storage for raw data from source systems
Data cleansing, transformation, deduplication
No business logic applied yet

2. Integration Layer

Data transformed and stored in consistent format
Star or snowflake schema applied
Fact and dimension tables created

3. Presentation Layer

Business-ready data for reporting and analysis
Summary tables and aggregations
Data marts for specific departments

DWH Architecture Approaches

Inmon (Top-Down)

Build centralized, normalized Enterprise Data Warehouse (EDW)
Single source of truth for organization
Data marts created from EDW
Pros: Consistent, enterprise-wide view
Cons: Longer implementation, higher upfront cost

Kimball (Bottom-Up)

Build individual data marts based on business processes
Each mart uses dimensional model
Marts integrated later into larger warehouse
Pros: Faster delivery, iterative
Cons: Potential inconsistencies between marts

Stand-Alone Data Mart

Independent repository for specific business function
Faster and less expensive
Risk: Data silos and inconsistencies

Dimensional Modeling Concepts

Facts

Store quantitative data (measures)
Numerical values for analysis
Examples: sales amount, quantity, revenue

Dimensions

Store descriptive data (context)
Provide “who, what, where, when, how”
Examples: product, customer, date, geography

Granularity

The level of detail in fact tables:

Level	Description	Trade-offs
Fine	Individual transactions	More storage, flexible analysis
Coarse	Daily/weekly aggregates	Less storage, limited drill-down

Keys

Key Type	Description
Surrogate Key	System-generated unique ID (recommended for DWH)
Natural Key	Business identifier from source system

ETL vs ELT

Aspect	ETL	ELT
Process	Extract → Transform → Load	Extract → Load → Transform
Where transformation happens	ETL tool/staging area	Target data warehouse
Best for	On-premise, structured data	Cloud, big data
Scalability	Limited by ETL server	Leverages DWH compute power
Flexibility	Schema defined upfront	Schema-on-read possible

ETL’s Role in DWH

Moves data between staging, integration, and presentation layers
Ensures data quality and consistency
Manages data lineage and transformation logic

Slowly Changing Dimensions (SCD)

SCDs handle changes to dimension data over time:

Type	Strategy	History	Use Case
Type 0	Never update	Original only	Static reference data
Type 1	Overwrite	No history	When history doesn’t matter
Type 2	Add new row	Full history	Most common, full audit trail
Type 3	Add column	Limited history	Current + previous value only

Type 2 Example

When customer address changes:

CustomerKey	CustomerID	Address	StartDate	EndDate	IsCurrent
1001	C123	123 Main St	2020-01-01	2023-06-30	No
1002	C123	456 Oak Ave	2023-07-01	9999-12-31	Yes

Data Marts

A data mart is a subset of a data warehouse focused on a specific business function:

Aspect	Data Warehouse	Data Mart
Scope	Enterprise-wide	Department/function
Size	Large	Smaller
Users	All analysts	Specific team
Focus	Comprehensive	Targeted

Examples: Sales data mart, Marketing data mart, Finance data mart

Cloud Data Warehouses

Snowflake

Cloud-based data warehousing platform:

Architecture: Multi-cluster shared data (compute and storage separate)
Cloud-agnostic: Works on AWS, Azure, GCP
Features: Automatic optimization, semi-structured data support (JSON, Parquet)

Key Concepts:

Clustering: Physical data organization for query performance
Snowpipe: Real-time data ingestion
Streams: Capture real-time data changes
Table Types: Standard, External, Temporary, Transient

BigQuery

Google’s serverless data warehouse:

Architecture: Distributed columnar storage
GCP-specific: Native integration with Google Cloud
Pricing: Pay-per-query or flat-rate

Comparison

Feature	Snowflake	BigQuery
Deployment	Multi-cloud	GCP only
Architecture	Shared data	Distributed columnar
Pricing model	Compute + Storage	Query-based or flat
Separation	Full compute/storage	Storage, compute managed

Why Use a Data Warehouse?

Benefit	Description
Better decisions	Single source of truth for analytics
Data quality	Cleaned, consistent, validated data
Historical analysis	Trend analysis and forecasting
Performance	Offload analytics from operational systems
Self-service	Enable business users to query data

Key Takeaways

DWH sits on top of databases to enable analytics
Three layers: Staging → Integration → Presentation
Kimball (bottom-up) is faster; Inmon (top-down) is more comprehensive
ETL for on-premise, ELT for cloud (generally)
SCDs track dimension changes over time (Type 2 is most common)
Data marts are focused subsets for specific teams