JAlcocerTech E-books

Advanced: Batch, Stream, and Derived Data

Derived data is data computed from other data: indexes, caches, search views, materialized views, analytics tables, recommendations, and denormalized read models.

The source data is the system of record. Derived data can usually be rebuilt, but only if the source history and transformation logic are available.

Batch Processing

Batch jobs process bounded datasets. They are useful for analytics, ETL, backfills, reindexing, and large-scale transformations.

The MapReduce model popularized a simple shape: map records into intermediate key-value pairs, shuffle by key, then reduce each group. Modern systems expand on this idea with richer execution engines and optimization.

Unix Philosophy

The book uses Unix tools as a model for composable data processing: programs read input, write output, and avoid hidden shared state. This keeps pipelines debuggable and lets intermediate results be inspected.

The same principle applies to data systems: explicit inputs and outputs are easier to reason about than hidden mutation.

Stream Processing

Streams process unbounded data as it arrives. A stream is often a log of events. Unlike batch jobs, stream processors must handle time, ordering, retries, late events, and partial progress.

Streams are useful for notifications, monitoring, fraud detection, real-time indexes, metrics, and event-driven workflows.

Logs as Integration Backbone

A durable append-only log can decouple producers from consumers. Producers publish facts; consumers build their own derived views. New consumers can replay history if the log is retained.

This model turns data integration from “call all services synchronously” into “publish events and let consumers maintain state.”

Exactly-Once Processing

Exactly-once semantics are difficult because failures can happen after producing side effects but before recording progress. Many practical systems approximate exactly-once behavior by combining idempotent operations, transactional writes, deterministic replay, and offset tracking.

The core question is not the marketing label but whether duplicates, retries, and crashes preserve the application’s invariants.

Practical Takeaway

For derived data pipelines:

  • keep the source of truth clear
  • make transformations deterministic where possible
  • design consumers to be idempotent
  • track offsets and output commits together when correctness requires it
  • plan for replay and backfill
  • treat stream processing as incremental batch processing over an unbounded input