JAlcocerTech E-books

The Modern Data & Analytics Stack

A comprehensive recap of SQL, PySpark, data modeling, and the most popular big data stacks you’ll encounter in your Data & Analytics career.

Core Concepts for Data Analytics

Data Warehouse vs Data Lake vs Data Lakehouse

Data Warehouse (DWH):

  • Purpose: Structured data and business intelligence (BI) reporting
  • Data Structure: Schema-on-write (data transformed before loading)
  • Data Types: Primarily structured data
  • Use Cases: Traditional BI reporting, dashboards, analytical queries
  • Limitations: Limited flexibility with unstructured data, costly to scale

Data Lake:

  • Purpose: Store vast amounts of raw data in native format
  • Data Structure: Schema-on-read (structure applied when reading)
  • Data Types: Structured, semi-structured, and unstructured
  • Use Cases: Big data analytics, machine learning, data exploration
  • Challenges: Can become a “data swamp” without proper governance

Data Lakehouse:

  • Purpose: Combine best of data lakes and warehouses
  • Data Structure: Supports both schema-on-read and schema-on-write
  • Data Types: All types (structured, semi-structured, unstructured)
  • Key Features:
    • ACID transactions
    • Data versioning
    • Schema enforcement
    • Time travel capabilities
  • Technologies: Apache Iceberg, Delta Lake, Apache Hudi
  • Use Cases: BI, AI/ML, advanced analytics, real-time processing

Medallion Architecture:

The modern data lakehouse typically follows a medallion architecture:

graph LR
    A[Data Sources] --> B[Landing: JSON Raw, as-is ingestion]
    B --> C[Bronze: Avro Structured raw, schema enforcement]
    C --> D[Silver: Parquet Optimized for analytics]
    D --> E[Gold: Parquet Business-ready aggregates]
  • Landing: Raw data as-is (JSON, CSV, XML)
  • Bronze: Structured raw with schema enforcement (Avro)
  • Silver: Cleaned and enriched data (Parquet)
  • Gold: Business-ready aggregates (Parquet with Delta Lake)

Data Processing Engines

Batch Processing:

  • Apache Hadoop MapReduce: Classic batch processing
  • Apache Spark (Spark SQL): Modern batch processing with SQL interface

Stream Processing:

  • Apache Kafka Streams: Stream processing on Kafka
  • Apache Flink: Powerful stream processing framework
  • Apache Spark Streaming: Spark’s real-time capabilities

Interactive Query Engines:

  • Apache Spark SQL: Interactive queries and batch processing
  • Trino (formerly Presto): Distributed SQL query engine
  • Apache Impala: MPP SQL query engine

Data Modeling Fundamentals

Key Concepts:

graph LR
    A[Data Profiling: Understand As-Is] --> B(Data Modeling: Design To-Be)
    B --> C{Prepare Design Documentation}
    C --> D[Data Model]
    C --> E[Data Lineage]
    C --> F[Data Quality Rules]

Normalization vs Denormalization:

AspectNormalizationDenormalization
PurposeMinimize redundancyImprove read performance
Best ForWRITE operations (OLTP)READ operations (OLAP)
RedundancyLowHigh
IntegrityHighRequires maintenance
JoinsManyFew

Table Structures:

  • Long Tables (Normalized):

    • Smaller file size
    • Flexible for updates
    • Complex for visualizations
    • Best for WRITE-heavy workflows
  • Wide Tables (Denormalized):

    • Faster queries
    • Pre-aggregated metrics
    • Increases redundancy
    • Best for READ-heavy workflows

Data Warehousing Concepts:

  • Fact Tables: Contain measures (numeric data, metrics)
  • Dimension Tables: Contain attributes for filtering/grouping
  • Star Schema: Denormalized dimensions, fewer joins
  • Snowflake Schema: Normalized dimensions, more joins
  • Granularity: Level of detail in fact tables

Essential Tools

SQL

SQL is fundamental across all data platforms, though implementations vary:

Common Variants:

  • MySQL
  • PostgreSQL
  • Trino SQL
  • BigQuery SQL
  • Snowflake SQL

Key Concepts:

Primary Keys, Foreign Keys, Indexes:

  • Primary Key: Uniquely identifies each row
  • Foreign Key: References primary key in another table
  • Indexes: Improve query performance

Normalization Levels:

  • 1NF: Atomic values, no repeating groups
  • 2NF: No partial dependencies
  • 3NF: No transitive dependencies

Star vs Snowflake Schema:

  • Star: Denormalized dimensions (simpler queries)
  • Snowflake: Normalized dimensions (less redundancy)

PySpark

PySpark is Python’s interface to Apache Spark for distributed computing.

Core Concepts:

RDD vs DataFrame:

FeatureRDDDataFrame
StructureUnstructured collectionStructured with schema
OptimizationManualCatalyst Optimizer
Ease of UseComplexSQL-like API
Type SafetyCompile-timeRuntime

File Formats:

  • JSON: Landing/ingestion layer
  • Avro: Bronze layer (schema enforcement, write-heavy)
  • Parquet: Silver/Gold layers (columnar, read-optimized)
  • Delta Lake: Transactional layer on Parquet

Two Methods for Data Manipulation:

1. DataFrame API (Pythonic):

from pyspark.sql import functions as F

df = spark.read.parquet("data.parquet") \
    .filter(F.col("age") > 25) \
    .groupBy("city") \
    .agg(F.count("*").alias("count"))

2. SQL Queries:

df.createOrReplaceTempView("people")
result = spark.sql("""
    SELECT city, COUNT(*) as count
    FROM people
    WHERE age > 25
    GROUP BY city
""")

Key Features:

  • Lazy evaluation (optimizes execution plan)
  • Distributed processing
  • In-memory computing
  • Fault tolerance

Databricks

Databricks is a unified analytics platform built on Apache Spark.

Advantages over Local Spark:

  • Managed Spark environment
  • Scalability and elasticity
  • Collaborative notebooks
  • Integrated tools (MLflow, Delta Lake)
  • Performance optimization
  • Auto-termination (cost savings)

Use Cases:

  • Large-scale data engineering
  • Machine learning workflows
  • Real-time analytics
  • Collaborative data science

Big Data Tech Stacks

1. Google Cloud Platform (GCP) Stack

Fully managed, serverless, integrated stack.

LayerComponentUsersPurpose
Data LakeCloud Storage (GCS)Data EngineersRaw data storage
TransformationDataformData EngineersSQLX transformations
OrchestrationCloud ComposerData EngineersWorkflow management
Data WarehouseBigQueryAllServerless analytics
BILookerAnalystsLookML semantic layer

2. On-Premise Stack

Classic open-source stack with maximum control.

LayerComponentUsersPurpose
Data LakeHDFSData EngineersDistributed file system
TransformationPySparkData EngineersDistributed processing
OrchestrationAirflowData EngineersWorkflow scheduling
MonitoringGraphite + GrafanaDevOpsMetrics and dashboards

3. Open-Source Cloud/Hybrid Stack

Best-of-both-worlds approach.

LayerComponentUsersPurpose
Data LakeMinIO/S3Data EngineersObject storage
Data WarehouseSnowflakeAllCloud data warehouse
TransformationdbtAnalytics EngineersSQL transformations
OrchestrationAirflowAnalytics EngineersWorkflow management
BISuperset/RedashAnalystsOpen-source BI

4. Modern Data Lakehouse (Nessie-Based)

Advanced open-source lakehouse with version control.

LayerComponentUsersPurpose
Data LakeMinIO/S3Data EngineersScalable storage
Table FormatApache IcebergData EngineersACID transactions
Data CatalogProject NessieAllGit-like versioning
Query EngineTrino/PySparkAllDistributed queries
BISuperset/RedashAnalystsVisualization

Key Features:

  • Decoupled compute and storage
  • Version control for data
  • No vendor lock-in
  • Portable across clouds

Data Engineering Tools

Workflow Orchestration

Apache Airflow:

  • DAG-based workflows
  • Rich UI for monitoring
  • Extensive integrations
  • Self-managed or cloud (Cloud Composer)

Jenkins:

  • CI/CD automation
  • Plugin ecosystem
  • Self-managed
  • More code-focused than data-focused

Comparison:

FeatureAirflowJenkinsCloud Composer
PurposeData workflowsSoftware CI/CDManaged Airflow
ManagementSelf/CloudSelfFully managed
FocusETL/MLBuild/DeployData pipelines

Transformation Tools

dbt (data build tool):

  • SQL-based transformations
  • Modular table logic
  • Built-in testing
  • Version control integration
  • Works with Snowflake, BigQuery, Redshift

Dataform:

  • SQLX (SQL + JavaScript)
  • Dependency management
  • Native GCP integration
  • Similar to dbt but GCP-focused

Azure Data Factory (ADF):

  • Visual, low-code ETL
  • Data movement orchestration
  • Azure-native
  • Can trigger dbt jobs

BI & Visualization Tools

Self-Hosted Options:

  • Apache Superset: Enterprise-ready, 40+ chart types
  • Metabase: User-friendly, self-service
  • Redash: SQL-focused, 35+ data sources
  • Grafana: Monitoring and dashboards

Cloud Options:

  • Looker (GCP): LookML semantic layer
  • Power BI (Microsoft): Excel-like interface
  • Tableau (Salesforce): Advanced visualizations

Data Catalog & Governance

Apache Iceberg:

  • Open table format
  • ACID transactions
  • Time travel
  • Schema evolution

Project Nessie:

  • Git-like data catalog
  • Branch and merge data
  • Version control
  • Multi-table transactions

Relationship:

  • Iceberg: Table format (single-table versioning)
  • Nessie: Catalog (multi-table version control)

Best Practices

Data Quality

  1. Profile your data (understand as-is)
  2. Model your data (design to-be)
  3. Document everything (data lineage, quality rules)
  4. Implement validation gates (Bronze → Silver → Gold)
  5. Monitor continuously (track metrics, alert on failures)

Performance Optimization

For SQL:

  • Filter early in queries
  • Use partitioning and clustering
  • Avoid SELECT *
  • Leverage indexes

For PySpark:

  • Use DataFrame API for optimization
  • Partition data appropriately
  • Cache intermediate results
  • Broadcast small tables in joins

Cost Management

  1. Right-size resources (don’t over-provision)
  2. Use auto-termination (stop idle clusters)
  3. Leverage spot/preemptible instances
  4. Monitor query costs (set budgets and alerts)
  5. Optimize storage (compress, partition, lifecycle policies)

AI-Assisted Analytics

Code Assistants:

  • GitHub Copilot
  • Google Code Assist (Gemini)
  • Windsurf (Codeium)
  • Tabby (self-hosted)

Python Libraries:

  • PandasAI: Natural language queries on DataFrames
  • Sketch: AI-powered data exploration

Modern Data Stack Evolution

From ETL to ELT:

  • Extract → Load → Transform (in warehouse)
  • Leverages warehouse compute power
  • Faster time to insights

From Batch to Streaming:

  • Real-time data processing
  • Event-driven architectures
  • Kafka, Flink, Spark Streaming

From Monolithic to Modular:

  • Best-of-breed tools
  • Open standards (Iceberg, Arrow)
  • Avoid vendor lock-in

Conclusion

The modern data & analytics landscape offers many choices:

Choose based on:

  1. Scale: How much data?
  2. Speed: Batch or real-time?
  3. Skills: What does your team know?
  4. Budget: Cloud or on-premise?
  5. Governance: How strict are requirements?

Remember:

  • Start simple: Don’t over-engineer
  • Iterate: Build, measure, improve
  • Document: Future you will thank you
  • Automate: Reduce manual work
  • Monitor: Know what’s happening

Key Takeaways:

  • SQL is fundamental across all platforms
  • PySpark enables distributed processing
  • Data modeling drives good architecture
  • Choose tools that fit your use case
  • Modern stacks favor modularity and open standards

The data landscape continues to evolve, but these fundamentals remain constant.