The Modern Data & Analytics Stack
A comprehensive recap of SQL, PySpark, data modeling, and the most popular big data stacks you’ll encounter in your Data & Analytics career.
Core Concepts for Data Analytics
Data Warehouse vs Data Lake vs Data Lakehouse
Data Warehouse (DWH):
- Purpose: Structured data and business intelligence (BI) reporting
- Data Structure: Schema-on-write (data transformed before loading)
- Data Types: Primarily structured data
- Use Cases: Traditional BI reporting, dashboards, analytical queries
- Limitations: Limited flexibility with unstructured data, costly to scale
Data Lake:
- Purpose: Store vast amounts of raw data in native format
- Data Structure: Schema-on-read (structure applied when reading)
- Data Types: Structured, semi-structured, and unstructured
- Use Cases: Big data analytics, machine learning, data exploration
- Challenges: Can become a “data swamp” without proper governance
Data Lakehouse:
- Purpose: Combine best of data lakes and warehouses
- Data Structure: Supports both schema-on-read and schema-on-write
- Data Types: All types (structured, semi-structured, unstructured)
- Key Features:
- ACID transactions
- Data versioning
- Schema enforcement
- Time travel capabilities
- Technologies: Apache Iceberg, Delta Lake, Apache Hudi
- Use Cases: BI, AI/ML, advanced analytics, real-time processing
Medallion Architecture:
The modern data lakehouse typically follows a medallion architecture:
graph LR
A[Data Sources] --> B[Landing: JSON Raw, as-is ingestion]
B --> C[Bronze: Avro Structured raw, schema enforcement]
C --> D[Silver: Parquet Optimized for analytics]
D --> E[Gold: Parquet Business-ready aggregates]
- Landing: Raw data as-is (JSON, CSV, XML)
- Bronze: Structured raw with schema enforcement (Avro)
- Silver: Cleaned and enriched data (Parquet)
- Gold: Business-ready aggregates (Parquet with Delta Lake)
Data Processing Engines
Batch Processing:
- Apache Hadoop MapReduce: Classic batch processing
- Apache Spark (Spark SQL): Modern batch processing with SQL interface
Stream Processing:
- Apache Kafka Streams: Stream processing on Kafka
- Apache Flink: Powerful stream processing framework
- Apache Spark Streaming: Spark’s real-time capabilities
Interactive Query Engines:
- Apache Spark SQL: Interactive queries and batch processing
- Trino (formerly Presto): Distributed SQL query engine
- Apache Impala: MPP SQL query engine
Data Modeling Fundamentals
Key Concepts:
graph LR
A[Data Profiling: Understand As-Is] --> B(Data Modeling: Design To-Be)
B --> C{Prepare Design Documentation}
C --> D[Data Model]
C --> E[Data Lineage]
C --> F[Data Quality Rules]
Normalization vs Denormalization:
| Aspect | Normalization | Denormalization |
|---|---|---|
| Purpose | Minimize redundancy | Improve read performance |
| Best For | WRITE operations (OLTP) | READ operations (OLAP) |
| Redundancy | Low | High |
| Integrity | High | Requires maintenance |
| Joins | Many | Few |
Table Structures:
-
Long Tables (Normalized):
- Smaller file size
- Flexible for updates
- Complex for visualizations
- Best for WRITE-heavy workflows
-
Wide Tables (Denormalized):
- Faster queries
- Pre-aggregated metrics
- Increases redundancy
- Best for READ-heavy workflows
Data Warehousing Concepts:
- Fact Tables: Contain measures (numeric data, metrics)
- Dimension Tables: Contain attributes for filtering/grouping
- Star Schema: Denormalized dimensions, fewer joins
- Snowflake Schema: Normalized dimensions, more joins
- Granularity: Level of detail in fact tables
Essential Tools
SQL
SQL is fundamental across all data platforms, though implementations vary:
Common Variants:
- MySQL
- PostgreSQL
- Trino SQL
- BigQuery SQL
- Snowflake SQL
Key Concepts:
Primary Keys, Foreign Keys, Indexes:
- Primary Key: Uniquely identifies each row
- Foreign Key: References primary key in another table
- Indexes: Improve query performance
Normalization Levels:
- 1NF: Atomic values, no repeating groups
- 2NF: No partial dependencies
- 3NF: No transitive dependencies
Star vs Snowflake Schema:
- Star: Denormalized dimensions (simpler queries)
- Snowflake: Normalized dimensions (less redundancy)
PySpark
PySpark is Python’s interface to Apache Spark for distributed computing.
Core Concepts:
RDD vs DataFrame:
| Feature | RDD | DataFrame |
|---|---|---|
| Structure | Unstructured collection | Structured with schema |
| Optimization | Manual | Catalyst Optimizer |
| Ease of Use | Complex | SQL-like API |
| Type Safety | Compile-time | Runtime |
File Formats:
- JSON: Landing/ingestion layer
- Avro: Bronze layer (schema enforcement, write-heavy)
- Parquet: Silver/Gold layers (columnar, read-optimized)
- Delta Lake: Transactional layer on Parquet
Two Methods for Data Manipulation:
1. DataFrame API (Pythonic):
from pyspark.sql import functions as F
df = spark.read.parquet("data.parquet") \
.filter(F.col("age") > 25) \
.groupBy("city") \
.agg(F.count("*").alias("count"))
2. SQL Queries:
df.createOrReplaceTempView("people")
result = spark.sql("""
SELECT city, COUNT(*) as count
FROM people
WHERE age > 25
GROUP BY city
""")
Key Features:
- Lazy evaluation (optimizes execution plan)
- Distributed processing
- In-memory computing
- Fault tolerance
Databricks
Databricks is a unified analytics platform built on Apache Spark.
Advantages over Local Spark:
- Managed Spark environment
- Scalability and elasticity
- Collaborative notebooks
- Integrated tools (MLflow, Delta Lake)
- Performance optimization
- Auto-termination (cost savings)
Use Cases:
- Large-scale data engineering
- Machine learning workflows
- Real-time analytics
- Collaborative data science
Big Data Tech Stacks
1. Google Cloud Platform (GCP) Stack
Fully managed, serverless, integrated stack.
| Layer | Component | Users | Purpose |
|---|---|---|---|
| Data Lake | Cloud Storage (GCS) | Data Engineers | Raw data storage |
| Transformation | Dataform | Data Engineers | SQLX transformations |
| Orchestration | Cloud Composer | Data Engineers | Workflow management |
| Data Warehouse | BigQuery | All | Serverless analytics |
| BI | Looker | Analysts | LookML semantic layer |
2. On-Premise Stack
Classic open-source stack with maximum control.
| Layer | Component | Users | Purpose |
|---|---|---|---|
| Data Lake | HDFS | Data Engineers | Distributed file system |
| Transformation | PySpark | Data Engineers | Distributed processing |
| Orchestration | Airflow | Data Engineers | Workflow scheduling |
| Monitoring | Graphite + Grafana | DevOps | Metrics and dashboards |
3. Open-Source Cloud/Hybrid Stack
Best-of-both-worlds approach.
| Layer | Component | Users | Purpose |
|---|---|---|---|
| Data Lake | MinIO/S3 | Data Engineers | Object storage |
| Data Warehouse | Snowflake | All | Cloud data warehouse |
| Transformation | dbt | Analytics Engineers | SQL transformations |
| Orchestration | Airflow | Analytics Engineers | Workflow management |
| BI | Superset/Redash | Analysts | Open-source BI |
4. Modern Data Lakehouse (Nessie-Based)
Advanced open-source lakehouse with version control.
| Layer | Component | Users | Purpose |
|---|---|---|---|
| Data Lake | MinIO/S3 | Data Engineers | Scalable storage |
| Table Format | Apache Iceberg | Data Engineers | ACID transactions |
| Data Catalog | Project Nessie | All | Git-like versioning |
| Query Engine | Trino/PySpark | All | Distributed queries |
| BI | Superset/Redash | Analysts | Visualization |
Key Features:
- Decoupled compute and storage
- Version control for data
- No vendor lock-in
- Portable across clouds
Data Engineering Tools
Workflow Orchestration
Apache Airflow:
- DAG-based workflows
- Rich UI for monitoring
- Extensive integrations
- Self-managed or cloud (Cloud Composer)
Jenkins:
- CI/CD automation
- Plugin ecosystem
- Self-managed
- More code-focused than data-focused
Comparison:
| Feature | Airflow | Jenkins | Cloud Composer |
|---|---|---|---|
| Purpose | Data workflows | Software CI/CD | Managed Airflow |
| Management | Self/Cloud | Self | Fully managed |
| Focus | ETL/ML | Build/Deploy | Data pipelines |
Transformation Tools
dbt (data build tool):
- SQL-based transformations
- Modular table logic
- Built-in testing
- Version control integration
- Works with Snowflake, BigQuery, Redshift
Dataform:
- SQLX (SQL + JavaScript)
- Dependency management
- Native GCP integration
- Similar to dbt but GCP-focused
Azure Data Factory (ADF):
- Visual, low-code ETL
- Data movement orchestration
- Azure-native
- Can trigger dbt jobs
BI & Visualization Tools
Self-Hosted Options:
- Apache Superset: Enterprise-ready, 40+ chart types
- Metabase: User-friendly, self-service
- Redash: SQL-focused, 35+ data sources
- Grafana: Monitoring and dashboards
Cloud Options:
- Looker (GCP): LookML semantic layer
- Power BI (Microsoft): Excel-like interface
- Tableau (Salesforce): Advanced visualizations
Data Catalog & Governance
Apache Iceberg:
- Open table format
- ACID transactions
- Time travel
- Schema evolution
Project Nessie:
- Git-like data catalog
- Branch and merge data
- Version control
- Multi-table transactions
Relationship:
- Iceberg: Table format (single-table versioning)
- Nessie: Catalog (multi-table version control)
Best Practices
Data Quality
- Profile your data (understand as-is)
- Model your data (design to-be)
- Document everything (data lineage, quality rules)
- Implement validation gates (Bronze → Silver → Gold)
- Monitor continuously (track metrics, alert on failures)
Performance Optimization
For SQL:
- Filter early in queries
- Use partitioning and clustering
- Avoid SELECT *
- Leverage indexes
For PySpark:
- Use DataFrame API for optimization
- Partition data appropriately
- Cache intermediate results
- Broadcast small tables in joins
Cost Management
- Right-size resources (don’t over-provision)
- Use auto-termination (stop idle clusters)
- Leverage spot/preemptible instances
- Monitor query costs (set budgets and alerts)
- Optimize storage (compress, partition, lifecycle policies)
Emerging Trends
AI-Assisted Analytics
Code Assistants:
- GitHub Copilot
- Google Code Assist (Gemini)
- Windsurf (Codeium)
- Tabby (self-hosted)
Python Libraries:
- PandasAI: Natural language queries on DataFrames
- Sketch: AI-powered data exploration
Modern Data Stack Evolution
From ETL to ELT:
- Extract → Load → Transform (in warehouse)
- Leverages warehouse compute power
- Faster time to insights
From Batch to Streaming:
- Real-time data processing
- Event-driven architectures
- Kafka, Flink, Spark Streaming
From Monolithic to Modular:
- Best-of-breed tools
- Open standards (Iceberg, Arrow)
- Avoid vendor lock-in
Conclusion
The modern data & analytics landscape offers many choices:
Choose based on:
- Scale: How much data?
- Speed: Batch or real-time?
- Skills: What does your team know?
- Budget: Cloud or on-premise?
- Governance: How strict are requirements?
Remember:
- Start simple: Don’t over-engineer
- Iterate: Build, measure, improve
- Document: Future you will thank you
- Automate: Reduce manual work
- Monitor: Know what’s happening
Key Takeaways:
- SQL is fundamental across all platforms
- PySpark enables distributed processing
- Data modeling drives good architecture
- Choose tools that fit your use case
- Modern stacks favor modularity and open standards
The data landscape continues to evolve, but these fundamentals remain constant.