Advanced: Distributed Systems, Failure, and Consensus
Distributed systems are hard because communication is unreliable, time is uncertain, and failures are partial. One node may continue running while another cannot tell whether it is slow, disconnected, crashed, or merely delayed.
Networks
A network can drop, delay, duplicate, or reorder messages. A timeout does not prove that a remote operation failed; it only proves that no response arrived in time.
This uncertainty forces systems to design for retries, duplicate requests, idempotency, and ambiguity.
Clocks
Physical clocks are useful but dangerous. They can drift, jump, or disagree across machines. Clock synchronization improves matters but does not make distributed machines share a perfect notion of time.
Logical clocks and version vectors can capture ordering relationships without pretending wall-clock time is authoritative.
Process Pauses
A node can pause for garbage collection, overload, paging, scheduling, or stop-the-world maintenance. Other nodes may declare it dead even though it later resumes.
This is why leases, locks, and leadership based only on local time can be unsafe unless carefully designed.
Fault Models
Crash-stop systems assume failed nodes stop forever. Crash-recovery systems assume nodes may come back. Byzantine systems assume nodes may lie or behave maliciously.
Most data systems assume crash-recovery and non-Byzantine behavior, but even that model is difficult because delays and partitions are normal.
Consensus
Consensus lets nodes agree on one value despite failures. It underpins leader election, metadata changes, distributed locks, and replicated logs.
Algorithms such as Paxos, Raft, Zab, and Viewstamped Replication are different ways of implementing the same kind of core guarantee: a group chooses a sequence of decisions safely, even when some nodes fail.
Two-Phase Commit
Two-phase commit coordinates atomic commit across participants, but it can block if the coordinator fails at the wrong time. It solves atomicity, not general consensus, and it creates operational coupling between participants.
Practical Takeaway
Distributed correctness requires explicit assumptions:
- timeouts are guesses, not proof
- retries can duplicate side effects
- clocks cannot be blindly trusted for ordering
- locks need fencing tokens or equivalent protection
- consensus is expensive but necessary for some decisions
- avoid distributed transactions unless the invariant truly requires them