JAlcocerTech E-books

Data Lakes & Lakehouses

Modern data architectures have evolved beyond traditional data warehouses to handle diverse data types at massive scale.


Data Lakes

A data lake is a storage repository that holds vast amounts of raw, unstructured data in its native format.

Key Characteristics

AspectData LakeData Warehouse
Data formatRaw, any formatStructured, predefined schema
SchemaSchema-on-readSchema-on-write
Data typesStructured + semi-structured + unstructuredPrimarily structured
ProcessingProcess when neededProcess before loading
Use casesData science, ML, explorationBI, reporting

Data Lake Storage

Cloud Object Storage:

  • Amazon S3
  • Azure Data Lake Storage
  • Google Cloud Storage

On-Premise:

  • Hadoop Distributed File System (HDFS)

Pros and Cons

ProsCons
Store any data typeCan become “data swamp” without governance
Cost-effective storageHarder to query without preparation
Flexible for data scienceNo transactional guarantees
Scales horizontallySchema management challenges

Data Lakehouse

A data lakehouse combines the best features of data lakes and data warehouses:

Data Lakehouse = Data Lake flexibility + Data Warehouse structure

Key Benefits

FeatureDescription
Unified platformOne platform for BI and data science
ACID transactionsReliability of data warehouses on lake storage
Schema enforcementStructure when you need it
Open formatsParquet, Delta, Iceberg
Cost efficiencyObject storage is cheaper than DWH storage

How It Works

  1. Storage Layer: Object storage (S3, ADLS, GCS)
  2. Table Format Layer: Delta Lake, Apache Iceberg, or Apache Hudi
  3. Query Engine: Spark, Trino, Databricks SQL

Delta Lake

Delta Lake adds a structured transactional layer on top of data lakes:

Features

FeatureDescription
ACID TransactionsReliable read/write operations
Schema EnforcementPrevent bad data from entering
Schema EvolutionAdd columns without rewriting data
Time TravelQuery historical versions of data
Unified Batch + StreamingSame table for both workloads

Time Travel Example

# Read data as it was on a specific date
df = spark.read \
    .format("delta") \
    .option("timestampAsOf", "2025-12-29") \
    .load("/path/to/my/table")

Compatible Tools

Delta Lake works with: Apache Spark, Hive, Presto/Trino, and many BI tools.


Databricks

Databricks is a unified data analytics platform built on Apache Spark:

What It Provides

CapabilityDescription
Managed SparkNo cluster management needed
Collaborative notebooksPython, Scala, R, SQL in one place
Delta LakeNative integration
MLflowML lifecycle management
Unity CatalogData governance

Use Cases

  • Big data processing
  • Machine learning at scale
  • Real-time analytics
  • Data engineering pipelines

Databricks vs Lakehouse

Databricks is not the entire lakehouse architecture—it’s a platform that can be a component:

  • Integrates with S3, Azure Data Lake, GCS for storage
  • Can connect to Snowflake, BigQuery, etc.
  • Provides compute and processing layer

Cloud Lakehouse Options

Each cloud provider has services for building lakehouses:

Google Cloud

ServiceRole
Cloud StorageObject storage foundation
BigQueryAnalytics and querying
DataprocManaged Spark/Hadoop

Microsoft Azure

ServiceRole
Azure Data Lake StorageObject storage
Azure Synapse AnalyticsAnalytics and DWH
Azure DatabricksSpark processing

Amazon Web Services

ServiceRole
Amazon S3Object storage
Amazon RedshiftData warehouse queries
AWS GlueETL and cataloging
Amazon EMRManaged Spark/Hadoop

Data Lake vs Warehouse vs Lakehouse

FeatureData LakeData WarehouseData Lakehouse
Data typesAnyStructuredAny
SchemaOn-readOn-writeBoth
ACIDNoYesYes
CostLowHighMedium
Query performanceVariableFastGood
Best forData science, MLBI, reportingUnified analytics

Medallion Architecture

A common pattern for organizing data in a lakehouse:

Three Layers

LayerNamePurposeFormat
BronzeRawLanding zone, original dataAny (often Avro)
SilverCleanedCleansed, validated, integratedParquet
GoldCuratedBusiness-ready, aggregatedParquet/Delta

Data Flow

Sources → Bronze (Raw) → Silver (Clean) → Gold (Business) → BI/ML

Format Recommendations

LayerFormatReason
BronzeAvroFast writes, schema evolution
Silver/GoldParquetFast reads, columnar, compression

Object Storage: Blob, S3, and HDFS

When building a data lake, choosing the right storage system is critical. While they all store large amounts of unstructured data, they differ in origin and typical use cases.

Cloud-Based Object Storage

  • AWS S3 (Simple Storage Service): The de facto standard for cloud object storage. It uses “buckets” to store immutable objects.
  • Azure Blob Storage: Microsoft’s equivalent to S3. Data is stored as “blobs” in containers.
  • Google Cloud Storage (GCS): Google’s scalable object storage for the GCP ecosystem.

On-Premise & Distributed Systems

  • HDFS (Hadoop Distributed File System): A distributed file system designed to run on commodity hardware. Unlike object storage, it tightly couples storage and compute, optimized for “data locality.”
  • MinIO: A high-performance, open-source object storage system that is S3-compatible. It is ideal for building private clouds or hybrid architectures.
  • Ceph: A unified, highly scalable storage platform that provides object, block, and file storage.

Storage System Comparison

FeatureHDFSS3 & Blob StorageMinIO & Ceph
EnvironmentOn-premises, self-managedCloud-based, managedOn-premises, self-managed
ScalabilityTied to physical clusterMassively elasticHardware-bound but flexible
APIPOSIX-like file systemRESTful (Object-based)RESTful (S3-compatible)
CostUpfront hardwarePay-as-you-goHardware + Maintenance
Best ForBatch (MapReduce)Cloud-native Data LakesPrivate Cloud / Hybrid

Key Takeaways

  1. Data Lakes store raw data in any format (flexible but can be messy)
  2. Data Lakehouses add structure and transactions to lakes
  3. Delta Lake enables ACID transactions and time travel on object storage
  4. Databricks is a platform for building lakehouses (not the lakehouse itself)
  5. Medallion architecture (Bronze → Silver → Gold) organizes lakehouse data
  6. Choose based on needs: Lake for exploration, Warehouse for BI, Lakehouse for both