Core Resilience Patterns
| Pattern | What It Does | Failure Mode It Handles | Trade-off |
| Retry | Retry failed operation, often with backoff | Transient failures (network blip, deadlock retry) | Adds latency; can overwhelm if unbounded |
| Circuit Breaker | Stop calling failing service; fast-fail | Persistent downstream failures | Rejects valid requests during OPEN state |
| Bulkhead | Isolate resources per-service (thread pools, connections) | One slow service saturating all resources | More idle resources; higher infrastructure cost |
| Timeout | Abort operation after N seconds | Slow/unresponsive downstream | Setting too low = false failures; too high = resource exhaustion |
| Rate Limiter | Reject requests above threshold (per-user/per-service) | Traffic spikes, abuse, noisy neighbor | Goodput under limit; rejected requests are lost revenue |
| Load Shedding | Drop low-priority work when system overloaded | System overload, cascading failures | Degraded experience for non-critical features |
Retry Strategies
| Item | Description |
Exponential Backoff | Wait t = min(cap, base * 2^attempt). Base=100ms, cap=30s. Wait times: 100ms, 200ms, 400ms, 800ms, 1.6s, 3.2s... Always add jitter (±25%). |
Jitter | Randomize retry delays to prevent thundering herd. Full jitter: sleep(random(0, cap)). Decorrelated jitter: sleep(min(cap, random(prev, prev*3))). |
Retry Budget | Percentage of total requests allowed as retries (e.g., 10%). If retry ratio exceeds budget → stop retrying, fail fast. Prevents retry storms from cascading. |
Idempotent Retries Required | Retry without idempotency = duplicate operations. Every retry-able endpoint MUST be idempotent. Use Idempotency-Key header or natural idempotency (PUT). |
Max Retries | 3 is the magic number. AWS SDK defaults to 3. More than 5 rarely helps — if a service is down for 30s+, retries just add load to an already-failing system. |
Bulkhead & Timeout Patterns
| Item | Description |
Thread Pool Bulkhead | Assign separate thread pools per downstream service. Service A gets 20 threads, Service B gets 10. If Service B hangs, it only exhausts its own pool, not A's. |
Connection Pool Bulkhead | Separate connection pools per database/shard. Write-heavy DB has 50 connections, read-replica has 100. A slow write query can't starve reads. |
Semaphore Bulkhead | Limit concurrent calls to N. Simpler than thread pools. Resilience4j: maxConcurrentCalls=10. Fast fail when limit hit — no queuing. |
Timeout Hierarchy | Total request timeout > upstream timeout > downstream timeout. Example: API gateway (30s) > service A (20s) > database (10s). Prevents hanging resources. |
Deadlines (gRPC) | Propagate deadline across service calls. Client sets timeout=5s → propagated via grpc-timeout header. Every service in call chain respects it. No zombie requests. |
Failover & Redundancy
| Item | Description |
Active-Passive | Primary handles all traffic. Standby is idle/hot. On failure, standby promoted. Fast failover (seconds). AWS RDS Multi-AZ: synchronous replication to standby. |
Active-Active | All instances serve traffic simultaneously. Load balanced. If one fails, others absorb its load. Requires careful design for data consistency (conflict resolution). |
Leader Election | Use consensus (Raft/Paxos) to elect a single leader for write operations. On leader failure, new election within election-timeout (150-300ms typical). |
Geo-Redundancy | Deploy across regions. DNS failover (Route53, Cloudflare LB). Active-active cross-region with geo-routed traffic. RTO: minutes. Trade-off: data consistency across regions. |
Health Checks | Active: ping /health every 5s. Passive: observe error rates on actual requests. Combine both: active for liveness, passive for quality. Unhealthy → remove from load balancer. |
Chaos Engineering | Inject failures intentionally: kill pods (Chaos Monkey), add latency (Chaos Kong), corrupt packets. Find unknown failure modes BEFORE they find you in production. |
Pro Tip: Everything fails eventually. The question isn't if a component will fail — it's how your system behaves when it does. Design for failure modes, not happy paths.