Data Lakes & Lakehouses

Modern data architectures have evolved beyond traditional data warehouses to handle diverse data types at massive scale.

Data Lakes

A data lake is a storage repository that holds vast amounts of raw, unstructured data in its native format.

Key Characteristics

Aspect	Data Lake	Data Warehouse
Data format	Raw, any format	Structured, predefined schema
Schema	Schema-on-read	Schema-on-write
Data types	Structured + semi-structured + unstructured	Primarily structured
Processing	Process when needed	Process before loading
Use cases	Data science, ML, exploration	BI, reporting

Data Lake Storage

Cloud Object Storage:

Amazon S3
Azure Data Lake Storage
Google Cloud Storage

On-Premise:

Hadoop Distributed File System (HDFS)

Pros and Cons

Pros	Cons
Store any data type	Can become “data swamp” without governance
Cost-effective storage	Harder to query without preparation
Flexible for data science	No transactional guarantees
Scales horizontally	Schema management challenges

Data Lakehouse

A data lakehouse combines the best features of data lakes and data warehouses:

Data Lakehouse = Data Lake flexibility + Data Warehouse structure

Key Benefits

Feature	Description
Unified platform	One platform for BI and data science
ACID transactions	Reliability of data warehouses on lake storage
Schema enforcement	Structure when you need it
Open formats	Parquet, Delta, Iceberg
Cost efficiency	Object storage is cheaper than DWH storage

How It Works

Storage Layer: Object storage (S3, ADLS, GCS)
Table Format Layer: Delta Lake, Apache Iceberg, or Apache Hudi
Query Engine: Spark, Trino, Databricks SQL

Delta Lake

Delta Lake adds a structured transactional layer on top of data lakes:

Features

Feature	Description
ACID Transactions	Reliable read/write operations
Schema Enforcement	Prevent bad data from entering
Schema Evolution	Add columns without rewriting data
Time Travel	Query historical versions of data
Unified Batch + Streaming	Same table for both workloads

Time Travel Example

# Read data as it was on a specific date
df = spark.read \
    .format("delta") \
    .option("timestampAsOf", "2025-12-29") \
    .load("/path/to/my/table")

Compatible Tools

Delta Lake works with: Apache Spark, Hive, Presto/Trino, and many BI tools.

Databricks

Databricks is a unified data analytics platform built on Apache Spark:

What It Provides

Capability	Description
Managed Spark	No cluster management needed
Collaborative notebooks	Python, Scala, R, SQL in one place
Delta Lake	Native integration
MLflow	ML lifecycle management
Unity Catalog	Data governance

Use Cases

Big data processing
Machine learning at scale
Real-time analytics
Data engineering pipelines

Databricks vs Lakehouse

Databricks is not the entire lakehouse architecture—it’s a platform that can be a component:

Integrates with S3, Azure Data Lake, GCS for storage
Can connect to Snowflake, BigQuery, etc.
Provides compute and processing layer

Cloud Lakehouse Options

Each cloud provider has services for building lakehouses:

Google Cloud

Service	Role
Cloud Storage	Object storage foundation
BigQuery	Analytics and querying
Dataproc	Managed Spark/Hadoop

Microsoft Azure

Service	Role
Azure Data Lake Storage	Object storage
Azure Synapse Analytics	Analytics and DWH
Azure Databricks	Spark processing

Amazon Web Services

Service	Role
Amazon S3	Object storage
Amazon Redshift	Data warehouse queries
AWS Glue	ETL and cataloging
Amazon EMR	Managed Spark/Hadoop

Data Lake vs Warehouse vs Lakehouse

Feature	Data Lake	Data Warehouse	Data Lakehouse
Data types	Any	Structured	Any
Schema	On-read	On-write	Both
ACID	No	Yes	Yes
Cost	Low	High	Medium
Query performance	Variable	Fast	Good
Best for	Data science, ML	BI, reporting	Unified analytics

Medallion Architecture

A common pattern for organizing data in a lakehouse:

Three Layers

Layer	Name	Purpose	Format
Bronze	Raw	Landing zone, original data	Any (often Avro)
Silver	Cleaned	Cleansed, validated, integrated	Parquet
Gold	Curated	Business-ready, aggregated	Parquet/Delta

Data Flow

Sources → Bronze (Raw) → Silver (Clean) → Gold (Business) → BI/ML

Format Recommendations

Layer	Format	Reason
Bronze	Avro	Fast writes, schema evolution
Silver/Gold	Parquet	Fast reads, columnar, compression

Object Storage: Blob, S3, and HDFS

When building a data lake, choosing the right storage system is critical. While they all store large amounts of unstructured data, they differ in origin and typical use cases.

Cloud-Based Object Storage

AWS S3 (Simple Storage Service): The de facto standard for cloud object storage. It uses “buckets” to store immutable objects.
Azure Blob Storage: Microsoft’s equivalent to S3. Data is stored as “blobs” in containers.
Google Cloud Storage (GCS): Google’s scalable object storage for the GCP ecosystem.

On-Premise & Distributed Systems

HDFS (Hadoop Distributed File System): A distributed file system designed to run on commodity hardware. Unlike object storage, it tightly couples storage and compute, optimized for “data locality.”
MinIO: A high-performance, open-source object storage system that is S3-compatible. It is ideal for building private clouds or hybrid architectures.
Ceph: A unified, highly scalable storage platform that provides object, block, and file storage.

Storage System Comparison

Feature	HDFS	S3 & Blob Storage	MinIO & Ceph
Environment	On-premises, self-managed	Cloud-based, managed	On-premises, self-managed
Scalability	Tied to physical cluster	Massively elastic	Hardware-bound but flexible
API	POSIX-like file system	RESTful (Object-based)	RESTful (S3-compatible)
Cost	Upfront hardware	Pay-as-you-go	Hardware + Maintenance
Best For	Batch (MapReduce)	Cloud-native Data Lakes	Private Cloud / Hybrid

Key Takeaways

Data Lakes store raw data in any format (flexible but can be messy)
Data Lakehouses add structure and transactions to lakes
Delta Lake enables ACID transactions and time travel on object storage
Databricks is a platform for building lakehouses (not the lakehouse itself)
Medallion architecture (Bronze → Silver → Gold) organizes lakehouse data
Choose based on needs: Lake for exploration, Warehouse for BI, Lakehouse for both