Advanced: Reliability, Scalability, Maintainability

This is a compact, concept-oriented distillation of Martin Kleppmann’s Designing Data-Intensive Applications.

It is meant as a working reference for system design, not a replacement for the book.

The Core Idea

Data-intensive applications are shaped less by CPU cycles and more by the movement, storage, transformation, and coordination of data. The hard questions are usually about correctness, durability, latency, operability, and how the system behaves when parts of it fail.

The book’s central habit is to ask: what guarantees does this system actually provide, what assumptions does it rely on, and what tradeoff is being made?

Three Design Goals

Reliability

The system should continue to work correctly despite faults. Faults include crashed processes, bad networks, full disks, overloaded dependencies, bugs, operator mistakes, and corrupt input. Reliability is not the absence of faults; it is tolerance of faults.

Scalability

Scalability means preserving acceptable performance as load grows. It requires defining load, measuring the relevant percentiles, and choosing architecture based on actual bottlenecks rather than generic “scale” claims.

Maintainability

Systems live longer than first designs. Maintainable systems are easier to understand, operate, evolve, and debug. Simplicity, observability, and explicit guarantees are engineering features.

The Big Map

Data models define how applications see and query information.
Storage engines decide how bytes are laid out and retrieved.
Encoding formats let systems evolve without lockstep deployments.
Replication and partitioning spread data across machines.
Transactions and consistency models define what concurrent users can rely on.
Distributed systems force explicit thinking about delay, failure, clocks, and consensus.
Batch and stream processing build derived views from source-of-truth data.

Recurring Questions

What is the source of truth?
Which data is derived and can be rebuilt?
What happens if this request is retried?
What happens if this node pauses, crashes, or is partitioned?
Is the guarantee local to one node, one partition, or the whole system?
Does the system prefer availability, latency, freshness, or strong correctness?
Can old and new versions of code/data coexist safely?