Fault Tolerance Patterns Cheat Sheet

Fault tolerance patterns for highly available systems — redundancy, retry with backoff, bulkheads, timeouts, failover, load shedding, graceful degradation, and ch.

Last Updated: May 1, 2025

Core Resilience Patterns

Pattern	What It Does	Failure Mode It Handles	Trade-off
Retry	Retry failed operation, often with backoff	Transient failures (network blip, deadlock retry)	Adds latency; can overwhelm if unbounded
Circuit Breaker	Stop calling failing service; fast-fail	Persistent downstream failures	Rejects valid requests during OPEN state
Bulkhead	Isolate resources per-service (thread pools, connections)	One slow service saturating all resources	More idle resources; higher infrastructure cost
Timeout	Abort operation after N seconds	Slow/unresponsive downstream	Setting too low = false failures; too high = resource exhaustion
Rate Limiter	Reject requests above threshold (per-user/per-service)	Traffic spikes, abuse, noisy neighbor	Goodput under limit; rejected requests are lost revenue
Load Shedding	Drop low-priority work when system overloaded	System overload, cascading failures	Degraded experience for non-critical features

Retry Strategies

Item	Description
`Exponential Backoff`	Wait t = min(cap, base * 2^attempt). Base=100ms, cap=30s. Wait times: 100ms, 200ms, 400ms, 800ms, 1.6s, 3.2s... Always add jitter (±25%).
`Jitter`	Randomize retry delays to prevent thundering herd. Full jitter: sleep(random(0, cap)). Decorrelated jitter: sleep(min(cap, random(prev, prev*3))).
`Retry Budget`	Percentage of total requests allowed as retries (e.g., 10%). If retry ratio exceeds budget → stop retrying, fail fast. Prevents retry storms from cascading.
`Idempotent Retries Required`	Retry without idempotency = duplicate operations. Every retry-able endpoint MUST be idempotent. Use Idempotency-Key header or natural idempotency (PUT).
`Max Retries`	3 is the magic number. AWS SDK defaults to 3. More than 5 rarely helps — if a service is down for 30s+, retries just add load to an already-failing system.

Bulkhead & Timeout Patterns

Item	Description
`Thread Pool Bulkhead`	Assign separate thread pools per downstream service. Service A gets 20 threads, Service B gets 10. If Service B hangs, it only exhausts its own pool, not A's.
`Connection Pool Bulkhead`	Separate connection pools per database/shard. Write-heavy DB has 50 connections, read-replica has 100. A slow write query can't starve reads.
`Semaphore Bulkhead`	Limit concurrent calls to N. Simpler than thread pools. Resilience4j: maxConcurrentCalls=10. Fast fail when limit hit — no queuing.
`Timeout Hierarchy`	Total request timeout > upstream timeout > downstream timeout. Example: API gateway (30s) > service A (20s) > database (10s). Prevents hanging resources.
`Deadlines (gRPC)`	Propagate deadline across service calls. Client sets timeout=5s → propagated via grpc-timeout header. Every service in call chain respects it. No zombie requests.

Failover & Redundancy

Item	Description
`Active-Passive`	Primary handles all traffic. Standby is idle/hot. On failure, standby promoted. Fast failover (seconds). AWS RDS Multi-AZ: synchronous replication to standby.
`Active-Active`	All instances serve traffic simultaneously. Load balanced. If one fails, others absorb its load. Requires careful design for data consistency (conflict resolution).
`Leader Election`	Use consensus (Raft/Paxos) to elect a single leader for write operations. On leader failure, new election within election-timeout (150-300ms typical).
`Geo-Redundancy`	Deploy across regions. DNS failover (Route53, Cloudflare LB). Active-active cross-region with geo-routed traffic. RTO: minutes. Trade-off: data consistency across regions.
`Health Checks`	Active: ping /health every 5s. Passive: observe error rates on actual requests. Combine both: active for liveness, passive for quality. Unhealthy → remove from load balancer.
`Chaos Engineering`	Inject failures intentionally: kill pods (Chaos Monkey), add latency (Chaos Kong), corrupt packets. Find unknown failure modes BEFORE they find you in production.

Pro Tip: Everything fails eventually. The question isn't if a component will fail — it's how your system behaves when it does. Design for failure modes, not happy paths.