Raft — Core Mechanisms
| Item | Description |
Leader Election | One leader handles all client requests. Term numbers increase monotonically. Candidate requests votes; majority wins. |
Log Replication | Leader appends entries to its log, sends AppendEntries RPCs to followers. Entry is committed once replicated to a majority. |
Safety | Leader never overwrites committed entries. At most one leader per term. Election restriction: candidate's log must be at least as up-to-date as voter's. |
Membership Changes | Joint consensus: transition through intermediate configuration that includes both old and new servers. Two-phase approach prevents split-brain. |
Log Compaction | Snapshotting: periodically write entire state to disk, discard old log entries. InstallSnapshot RPC for slow followers. |
Heartbeat Timer | Election timeout randomized (150-300 ms typical). Leader sends heartbeats every `timeout/2` to prevent spurious elections. |
Paxos — The Classic
| Item | Description |
Basic Paxos | Three roles: Proposers, Acceptors, Learners. Two-phase protocol: Prepare/Promise, then Accept/Accepted. Quorum size = majority (2f+1 for f failures). |
Multi-Paxos | Elect distinguished proposer (leader) to skip Prepare phase for most commands. Leader issues Accept directly with monotonically increasing instance IDs. |
Phase 1: Prepare | Proposer picks proposal number `n` higher than any it has used. Sends Prepare(n) to all acceptors. Each acceptor promises to reject proposals < n. |
Phase 2: Accept | If proposer receives promises from majority, sends Accept(n, value). Value is either its own or the highest-numbered value from Promise replies. |
Liveness vs Safety | Paxos guarantees safety (no two values chosen for same instance) even with failures. Liveness requires distinguished leader to prevent dueling proposers. |
Practical Variants | Cheap Paxos (fewer acceptors), Fast Paxos (bypass leader for uncontended writes), EPaxos (leaderless, low-latency across WAN). |
Consensus Algorithm Comparison
| Algorithm | Fault Model | Leader? | Performance | Complexity | Use Cases |
| Raft | Crash-fault | Yes — strong | High throughput | Moderate | etcd, Consul, TiKV, CockroachDB |
| Multi-Paxos | Crash-fault | Yes — weak | High throughput | High | Chubby, Spanner, Google File System |
| Basic Paxos | Crash-fault | No | Low (2 rounds) | High | Theoretical foundation, teaching |
| PBFT | Byzantine (f < n/3) | Rotating | Moderate (3 phases) | Very High | Hyperledger Fabric, Zyzzyva |
| Viewstamped Replication | Crash-fault | Yes | Moderate | Moderate | NuoDB, academic systems |
| EPaxos | Crash-fault | No (leaderless) | Low latency WAN | High | Geo-distributed databases |
Byzantine Fault Tolerance
| Item | Description |
PBFT (Practical BFT) | Tolerates up to f < n/3 Byzantine nodes. Three phases per request: Pre-Prepare, Prepare, Commit. Each requires 2f+1 matching messages. |
View Change | If primary is suspected faulty, replicas initiate view change. New primary = v+1 mod n. Must carry over prepared requests to new view. |
Tendermint / HotStuff | Modern BFT: leaders propose blocks, validators vote. Pipelined consensus for high throughput. Used by Cosmos blockchain and Facebook's Libra/Diem. |
Blockchain Consensus | Nakamoto consensus (PoW): probabilistic finality, open membership. PoS (Ethereum 2.0): validators stake, slashing conditions for safety. |
BFT Limitations | 3x replication minimum. All-to-all communication O(n²) per round. Latency-sensitive — single slow node delays the whole system. |
Asynchronous BFT | FLP impossibility: deterministic consensus impossible with even one faulty node in async model. Practical systems use partial synchrony (eventual message delivery). |
Pro Tip: Raft was designed for understandability — if you're learning consensus from scratch, start there. Paxos is the theoretical foundation, Raft is the practical implementation.