Data Lakes & Lakehouses
Modern data architectures have evolved beyond traditional data warehouses to handle diverse data types at massive scale.
Data Lakes
A data lake is a storage repository that holds vast amounts of raw, unstructured data in its native format.
Key Characteristics
| Aspect | Data Lake | Data Warehouse |
|---|---|---|
| Data format | Raw, any format | Structured, predefined schema |
| Schema | Schema-on-read | Schema-on-write |
| Data types | Structured + semi-structured + unstructured | Primarily structured |
| Processing | Process when needed | Process before loading |
| Use cases | Data science, ML, exploration | BI, reporting |
Data Lake Storage
Cloud Object Storage:
- Amazon S3
- Azure Data Lake Storage
- Google Cloud Storage
On-Premise:
- Hadoop Distributed File System (HDFS)
Pros and Cons
| Pros | Cons |
|---|---|
| Store any data type | Can become “data swamp” without governance |
| Cost-effective storage | Harder to query without preparation |
| Flexible for data science | No transactional guarantees |
| Scales horizontally | Schema management challenges |
Data Lakehouse
A data lakehouse combines the best features of data lakes and data warehouses:
Data Lakehouse = Data Lake flexibility + Data Warehouse structure
Key Benefits
| Feature | Description |
|---|---|
| Unified platform | One platform for BI and data science |
| ACID transactions | Reliability of data warehouses on lake storage |
| Schema enforcement | Structure when you need it |
| Open formats | Parquet, Delta, Iceberg |
| Cost efficiency | Object storage is cheaper than DWH storage |
How It Works
- Storage Layer: Object storage (S3, ADLS, GCS)
- Table Format Layer: Delta Lake, Apache Iceberg, or Apache Hudi
- Query Engine: Spark, Trino, Databricks SQL
Delta Lake
Delta Lake adds a structured transactional layer on top of data lakes:
Features
| Feature | Description |
|---|---|
| ACID Transactions | Reliable read/write operations |
| Schema Enforcement | Prevent bad data from entering |
| Schema Evolution | Add columns without rewriting data |
| Time Travel | Query historical versions of data |
| Unified Batch + Streaming | Same table for both workloads |
Time Travel Example
# Read data as it was on a specific date
df = spark.read \
.format("delta") \
.option("timestampAsOf", "2025-12-29") \
.load("/path/to/my/table")
Compatible Tools
Delta Lake works with: Apache Spark, Hive, Presto/Trino, and many BI tools.
Databricks
Databricks is a unified data analytics platform built on Apache Spark:
What It Provides
| Capability | Description |
|---|---|
| Managed Spark | No cluster management needed |
| Collaborative notebooks | Python, Scala, R, SQL in one place |
| Delta Lake | Native integration |
| MLflow | ML lifecycle management |
| Unity Catalog | Data governance |
Use Cases
- Big data processing
- Machine learning at scale
- Real-time analytics
- Data engineering pipelines
Databricks vs Lakehouse
Databricks is not the entire lakehouse architecture—it’s a platform that can be a component:
- Integrates with S3, Azure Data Lake, GCS for storage
- Can connect to Snowflake, BigQuery, etc.
- Provides compute and processing layer
Cloud Lakehouse Options
Each cloud provider has services for building lakehouses:
Google Cloud
| Service | Role |
|---|---|
| Cloud Storage | Object storage foundation |
| BigQuery | Analytics and querying |
| Dataproc | Managed Spark/Hadoop |
Microsoft Azure
| Service | Role |
|---|---|
| Azure Data Lake Storage | Object storage |
| Azure Synapse Analytics | Analytics and DWH |
| Azure Databricks | Spark processing |
Amazon Web Services
| Service | Role |
|---|---|
| Amazon S3 | Object storage |
| Amazon Redshift | Data warehouse queries |
| AWS Glue | ETL and cataloging |
| Amazon EMR | Managed Spark/Hadoop |
Data Lake vs Warehouse vs Lakehouse
| Feature | Data Lake | Data Warehouse | Data Lakehouse |
|---|---|---|---|
| Data types | Any | Structured | Any |
| Schema | On-read | On-write | Both |
| ACID | No | Yes | Yes |
| Cost | Low | High | Medium |
| Query performance | Variable | Fast | Good |
| Best for | Data science, ML | BI, reporting | Unified analytics |
Medallion Architecture
A common pattern for organizing data in a lakehouse:
Three Layers
| Layer | Name | Purpose | Format |
|---|---|---|---|
| Bronze | Raw | Landing zone, original data | Any (often Avro) |
| Silver | Cleaned | Cleansed, validated, integrated | Parquet |
| Gold | Curated | Business-ready, aggregated | Parquet/Delta |
Data Flow
Sources → Bronze (Raw) → Silver (Clean) → Gold (Business) → BI/ML
Format Recommendations
| Layer | Format | Reason |
|---|---|---|
| Bronze | Avro | Fast writes, schema evolution |
| Silver/Gold | Parquet | Fast reads, columnar, compression |
Object Storage: Blob, S3, and HDFS
When building a data lake, choosing the right storage system is critical. While they all store large amounts of unstructured data, they differ in origin and typical use cases.
Cloud-Based Object Storage
- AWS S3 (Simple Storage Service): The de facto standard for cloud object storage. It uses “buckets” to store immutable objects.
- Azure Blob Storage: Microsoft’s equivalent to S3. Data is stored as “blobs” in containers.
- Google Cloud Storage (GCS): Google’s scalable object storage for the GCP ecosystem.
On-Premise & Distributed Systems
- HDFS (Hadoop Distributed File System): A distributed file system designed to run on commodity hardware. Unlike object storage, it tightly couples storage and compute, optimized for “data locality.”
- MinIO: A high-performance, open-source object storage system that is S3-compatible. It is ideal for building private clouds or hybrid architectures.
- Ceph: A unified, highly scalable storage platform that provides object, block, and file storage.
Storage System Comparison
| Feature | HDFS | S3 & Blob Storage | MinIO & Ceph |
|---|---|---|---|
| Environment | On-premises, self-managed | Cloud-based, managed | On-premises, self-managed |
| Scalability | Tied to physical cluster | Massively elastic | Hardware-bound but flexible |
| API | POSIX-like file system | RESTful (Object-based) | RESTful (S3-compatible) |
| Cost | Upfront hardware | Pay-as-you-go | Hardware + Maintenance |
| Best For | Batch (MapReduce) | Cloud-native Data Lakes | Private Cloud / Hybrid |
Key Takeaways
- Data Lakes store raw data in any format (flexible but can be messy)
- Data Lakehouses add structure and transactions to lakes
- Delta Lake enables ACID transactions and time travel on object storage
- Databricks is a platform for building lakehouses (not the lakehouse itself)
- Medallion architecture (Bronze → Silver → Gold) organizes lakehouse data
- Choose based on needs: Lake for exploration, Warehouse for BI, Lakehouse for both