Data Warehousing
Data warehousing is the process of collecting, storing, and managing large volumes of data from various sources to support business intelligence and decision-making.
What is a Data Warehouse?
A data warehouse is a large, centralized repository designed to store and analyze data from multiple sources:
- Exists on top of several databases
- Provides a layer for data analytics
- Works with ETL processes
- Outputs BI reports and dashboards
Examples: Microsoft Azure Synapse, Google BigQuery, Snowflake, Amazon Redshift
Operational vs Analytical Systems
| System Type | Purpose | Focus | Examples |
|---|---|---|---|
| Operational (OLTP) | Day-to-day business operations | Efficiency, real-time processing | CRM, ERP, order management |
| Analytical (OLAP) | Analysis and decision-making | Historical context, patterns | Data warehouses, OLAP cubes |
Data Warehouse Layers
DWH typically have three main layers:
1. Staging Layer
- Temporary storage for raw data from source systems
- Data cleansing, transformation, deduplication
- No business logic applied yet
2. Integration Layer
- Data transformed and stored in consistent format
- Star or snowflake schema applied
- Fact and dimension tables created
3. Presentation Layer
- Business-ready data for reporting and analysis
- Summary tables and aggregations
- Data marts for specific departments
DWH Architecture Approaches
Inmon (Top-Down)
- Build centralized, normalized Enterprise Data Warehouse (EDW)
- Single source of truth for organization
- Data marts created from EDW
- Pros: Consistent, enterprise-wide view
- Cons: Longer implementation, higher upfront cost
Kimball (Bottom-Up)
- Build individual data marts based on business processes
- Each mart uses dimensional model
- Marts integrated later into larger warehouse
- Pros: Faster delivery, iterative
- Cons: Potential inconsistencies between marts
Stand-Alone Data Mart
- Independent repository for specific business function
- Faster and less expensive
- Risk: Data silos and inconsistencies
Dimensional Modeling Concepts
Facts
- Store quantitative data (measures)
- Numerical values for analysis
- Examples: sales amount, quantity, revenue
Dimensions
- Store descriptive data (context)
- Provide “who, what, where, when, how”
- Examples: product, customer, date, geography
Granularity
The level of detail in fact tables:
| Level | Description | Trade-offs |
|---|---|---|
| Fine | Individual transactions | More storage, flexible analysis |
| Coarse | Daily/weekly aggregates | Less storage, limited drill-down |
Keys
| Key Type | Description |
|---|---|
| Surrogate Key | System-generated unique ID (recommended for DWH) |
| Natural Key | Business identifier from source system |
ETL vs ELT
| Aspect | ETL | ELT |
|---|---|---|
| Process | Extract → Transform → Load | Extract → Load → Transform |
| Where transformation happens | ETL tool/staging area | Target data warehouse |
| Best for | On-premise, structured data | Cloud, big data |
| Scalability | Limited by ETL server | Leverages DWH compute power |
| Flexibility | Schema defined upfront | Schema-on-read possible |
ETL’s Role in DWH
- Moves data between staging, integration, and presentation layers
- Ensures data quality and consistency
- Manages data lineage and transformation logic
Slowly Changing Dimensions (SCD)
SCDs handle changes to dimension data over time:
| Type | Strategy | History | Use Case |
|---|---|---|---|
| Type 0 | Never update | Original only | Static reference data |
| Type 1 | Overwrite | No history | When history doesn’t matter |
| Type 2 | Add new row | Full history | Most common, full audit trail |
| Type 3 | Add column | Limited history | Current + previous value only |
Type 2 Example
When customer address changes:
| CustomerKey | CustomerID | Address | StartDate | EndDate | IsCurrent |
|---|---|---|---|---|---|
| 1001 | C123 | 123 Main St | 2020-01-01 | 2023-06-30 | No |
| 1002 | C123 | 456 Oak Ave | 2023-07-01 | 9999-12-31 | Yes |
Data Marts
A data mart is a subset of a data warehouse focused on a specific business function:
| Aspect | Data Warehouse | Data Mart |
|---|---|---|
| Scope | Enterprise-wide | Department/function |
| Size | Large | Smaller |
| Users | All analysts | Specific team |
| Focus | Comprehensive | Targeted |
Examples: Sales data mart, Marketing data mart, Finance data mart
Cloud Data Warehouses
Snowflake
Cloud-based data warehousing platform:
- Architecture: Multi-cluster shared data (compute and storage separate)
- Cloud-agnostic: Works on AWS, Azure, GCP
- Features: Automatic optimization, semi-structured data support (JSON, Parquet)
Key Concepts:
- Clustering: Physical data organization for query performance
- Snowpipe: Real-time data ingestion
- Streams: Capture real-time data changes
- Table Types: Standard, External, Temporary, Transient
BigQuery
Google’s serverless data warehouse:
- Architecture: Distributed columnar storage
- GCP-specific: Native integration with Google Cloud
- Pricing: Pay-per-query or flat-rate
Comparison
| Feature | Snowflake | BigQuery |
|---|---|---|
| Deployment | Multi-cloud | GCP only |
| Architecture | Shared data | Distributed columnar |
| Pricing model | Compute + Storage | Query-based or flat |
| Separation | Full compute/storage | Storage, compute managed |
Why Use a Data Warehouse?
| Benefit | Description |
|---|---|
| Better decisions | Single source of truth for analytics |
| Data quality | Cleaned, consistent, validated data |
| Historical analysis | Trend analysis and forecasting |
| Performance | Offload analytics from operational systems |
| Self-service | Enable business users to query data |
Key Takeaways
- DWH sits on top of databases to enable analytics
- Three layers: Staging → Integration → Presentation
- Kimball (bottom-up) is faster; Inmon (top-down) is more comprehensive
- ETL for on-premise, ELT for cloud (generally)
- SCDs track dimension changes over time (Type 2 is most common)
- Data marts are focused subsets for specific teams