Fault Tolerance Patterns Cheat Sheet

Fault tolerance patterns for highly available systems — redundancy, retry with backoff, bulkheads, timeouts, failover, load shedding, graceful degradation, and ch.

Last Updated: May 1, 2025

Core Resilience Patterns

PatternWhat It DoesFailure Mode It HandlesTrade-off
RetryRetry failed operation, often with backoffTransient failures (network blip, deadlock retry)Adds latency; can overwhelm if unbounded
Circuit BreakerStop calling failing service; fast-failPersistent downstream failuresRejects valid requests during OPEN state
BulkheadIsolate resources per-service (thread pools, connections)One slow service saturating all resourcesMore idle resources; higher infrastructure cost
TimeoutAbort operation after N secondsSlow/unresponsive downstreamSetting too low = false failures; too high = resource exhaustion
Rate LimiterReject requests above threshold (per-user/per-service)Traffic spikes, abuse, noisy neighborGoodput under limit; rejected requests are lost revenue
Load SheddingDrop low-priority work when system overloadedSystem overload, cascading failuresDegraded experience for non-critical features

Retry Strategies

ItemDescription
Exponential BackoffWait t = min(cap, base * 2^attempt). Base=100ms, cap=30s. Wait times: 100ms, 200ms, 400ms, 800ms, 1.6s, 3.2s... Always add jitter (±25%).
JitterRandomize retry delays to prevent thundering herd. Full jitter: sleep(random(0, cap)). Decorrelated jitter: sleep(min(cap, random(prev, prev*3))).
Retry BudgetPercentage of total requests allowed as retries (e.g., 10%). If retry ratio exceeds budget → stop retrying, fail fast. Prevents retry storms from cascading.
Idempotent Retries RequiredRetry without idempotency = duplicate operations. Every retry-able endpoint MUST be idempotent. Use Idempotency-Key header or natural idempotency (PUT).
Max Retries3 is the magic number. AWS SDK defaults to 3. More than 5 rarely helps — if a service is down for 30s+, retries just add load to an already-failing system.

Bulkhead & Timeout Patterns

ItemDescription
Thread Pool BulkheadAssign separate thread pools per downstream service. Service A gets 20 threads, Service B gets 10. If Service B hangs, it only exhausts its own pool, not A's.
Connection Pool BulkheadSeparate connection pools per database/shard. Write-heavy DB has 50 connections, read-replica has 100. A slow write query can't starve reads.
Semaphore BulkheadLimit concurrent calls to N. Simpler than thread pools. Resilience4j: maxConcurrentCalls=10. Fast fail when limit hit — no queuing.
Timeout HierarchyTotal request timeout > upstream timeout > downstream timeout. Example: API gateway (30s) > service A (20s) > database (10s). Prevents hanging resources.
Deadlines (gRPC)Propagate deadline across service calls. Client sets timeout=5s → propagated via grpc-timeout header. Every service in call chain respects it. No zombie requests.

Failover & Redundancy

ItemDescription
Active-PassivePrimary handles all traffic. Standby is idle/hot. On failure, standby promoted. Fast failover (seconds). AWS RDS Multi-AZ: synchronous replication to standby.
Active-ActiveAll instances serve traffic simultaneously. Load balanced. If one fails, others absorb its load. Requires careful design for data consistency (conflict resolution).
Leader ElectionUse consensus (Raft/Paxos) to elect a single leader for write operations. On leader failure, new election within election-timeout (150-300ms typical).
Geo-RedundancyDeploy across regions. DNS failover (Route53, Cloudflare LB). Active-active cross-region with geo-routed traffic. RTO: minutes. Trade-off: data consistency across regions.
Health ChecksActive: ping /health every 5s. Passive: observe error rates on actual requests. Combine both: active for liveness, passive for quality. Unhealthy → remove from load balancer.
Chaos EngineeringInject failures intentionally: kill pods (Chaos Monkey), add latency (Chaos Kong), corrupt packets. Find unknown failure modes BEFORE they find you in production.
Pro Tip: Everything fails eventually. The question isn't if a component will fail — it's how your system behaves when it does. Design for failure modes, not happy paths.