JAlcocerTech E-books

Data Warehousing

Data warehousing is the process of collecting, storing, and managing large volumes of data from various sources to support business intelligence and decision-making.


What is a Data Warehouse?

A data warehouse is a large, centralized repository designed to store and analyze data from multiple sources:

  • Exists on top of several databases
  • Provides a layer for data analytics
  • Works with ETL processes
  • Outputs BI reports and dashboards

Examples: Microsoft Azure Synapse, Google BigQuery, Snowflake, Amazon Redshift


Operational vs Analytical Systems

System TypePurposeFocusExamples
Operational (OLTP)Day-to-day business operationsEfficiency, real-time processingCRM, ERP, order management
Analytical (OLAP)Analysis and decision-makingHistorical context, patternsData warehouses, OLAP cubes

Data Warehouse Layers

DWH typically have three main layers:

1. Staging Layer

  • Temporary storage for raw data from source systems
  • Data cleansing, transformation, deduplication
  • No business logic applied yet

2. Integration Layer

  • Data transformed and stored in consistent format
  • Star or snowflake schema applied
  • Fact and dimension tables created

3. Presentation Layer

  • Business-ready data for reporting and analysis
  • Summary tables and aggregations
  • Data marts for specific departments

DWH Architecture Approaches

Inmon (Top-Down)

  • Build centralized, normalized Enterprise Data Warehouse (EDW)
  • Single source of truth for organization
  • Data marts created from EDW
  • Pros: Consistent, enterprise-wide view
  • Cons: Longer implementation, higher upfront cost

Kimball (Bottom-Up)

  • Build individual data marts based on business processes
  • Each mart uses dimensional model
  • Marts integrated later into larger warehouse
  • Pros: Faster delivery, iterative
  • Cons: Potential inconsistencies between marts

Stand-Alone Data Mart

  • Independent repository for specific business function
  • Faster and less expensive
  • Risk: Data silos and inconsistencies

Dimensional Modeling Concepts

Facts

  • Store quantitative data (measures)
  • Numerical values for analysis
  • Examples: sales amount, quantity, revenue

Dimensions

  • Store descriptive data (context)
  • Provide “who, what, where, when, how”
  • Examples: product, customer, date, geography

Granularity

The level of detail in fact tables:

LevelDescriptionTrade-offs
FineIndividual transactionsMore storage, flexible analysis
CoarseDaily/weekly aggregatesLess storage, limited drill-down

Keys

Key TypeDescription
Surrogate KeySystem-generated unique ID (recommended for DWH)
Natural KeyBusiness identifier from source system

ETL vs ELT

AspectETLELT
ProcessExtract → Transform → LoadExtract → Load → Transform
Where transformation happensETL tool/staging areaTarget data warehouse
Best forOn-premise, structured dataCloud, big data
ScalabilityLimited by ETL serverLeverages DWH compute power
FlexibilitySchema defined upfrontSchema-on-read possible

ETL’s Role in DWH

  • Moves data between staging, integration, and presentation layers
  • Ensures data quality and consistency
  • Manages data lineage and transformation logic

Slowly Changing Dimensions (SCD)

SCDs handle changes to dimension data over time:

TypeStrategyHistoryUse Case
Type 0Never updateOriginal onlyStatic reference data
Type 1OverwriteNo historyWhen history doesn’t matter
Type 2Add new rowFull historyMost common, full audit trail
Type 3Add columnLimited historyCurrent + previous value only

Type 2 Example

When customer address changes:

CustomerKeyCustomerIDAddressStartDateEndDateIsCurrent
1001C123123 Main St2020-01-012023-06-30No
1002C123456 Oak Ave2023-07-019999-12-31Yes

Data Marts

A data mart is a subset of a data warehouse focused on a specific business function:

AspectData WarehouseData Mart
ScopeEnterprise-wideDepartment/function
SizeLargeSmaller
UsersAll analystsSpecific team
FocusComprehensiveTargeted

Examples: Sales data mart, Marketing data mart, Finance data mart


Cloud Data Warehouses

Snowflake

Cloud-based data warehousing platform:

  • Architecture: Multi-cluster shared data (compute and storage separate)
  • Cloud-agnostic: Works on AWS, Azure, GCP
  • Features: Automatic optimization, semi-structured data support (JSON, Parquet)

Key Concepts:

  • Clustering: Physical data organization for query performance
  • Snowpipe: Real-time data ingestion
  • Streams: Capture real-time data changes
  • Table Types: Standard, External, Temporary, Transient

BigQuery

Google’s serverless data warehouse:

  • Architecture: Distributed columnar storage
  • GCP-specific: Native integration with Google Cloud
  • Pricing: Pay-per-query or flat-rate

Comparison

FeatureSnowflakeBigQuery
DeploymentMulti-cloudGCP only
ArchitectureShared dataDistributed columnar
Pricing modelCompute + StorageQuery-based or flat
SeparationFull compute/storageStorage, compute managed

Why Use a Data Warehouse?

BenefitDescription
Better decisionsSingle source of truth for analytics
Data qualityCleaned, consistent, validated data
Historical analysisTrend analysis and forecasting
PerformanceOffload analytics from operational systems
Self-serviceEnable business users to query data

Key Takeaways

  1. DWH sits on top of databases to enable analytics
  2. Three layers: Staging → Integration → Presentation
  3. Kimball (bottom-up) is faster; Inmon (top-down) is more comprehensive
  4. ETL for on-premise, ELT for cloud (generally)
  5. SCDs track dimension changes over time (Type 2 is most common)
  6. Data marts are focused subsets for specific teams