Capacity Planning Cheat Sheet

Capacity planning for distributed systems — load forecasting, bottleneck analysis, Little's Law, back-of-the-envelope estimation, scaling strategies, and cost opt.

Last Updated: May 1, 2025

Back-of-the-Envelope Numbers

OperationLatencyThroughput/SecNotes
L1 cache reference1 ns3 CPU cycles; 64 bytes typical cache line
L2 cache reference4 ns~14 CPU cycles
Main memory (RAM) reference100 ns~200 CPU cycles; DDR5 at 4800 MT/s
SSD random read (NVMe)10-100 µs500K-1M IOPSNVMe Gen4: 7 GB/s sequential
HDD disk seek2-10 ms100-200 IOPSRotational latency dominates
Network round trip (same DC)0.1-0.5 msWithin availability zone
Network round trip (cross-region)50-200 msNY↔London; speed of light in fiber = ~5 µs/km
Compress 1 KB (Snappy)~1 µs~1MNegligible for most workloads

Little's Law & Queuing Theory

ItemDescription
Little's LawL = λ × W. Long-term average number of customers in system (L) = arrival rate (λ) × average time in system (W). Universal — applies to any stable system.
Application to Web ServersIf your server handles 100 req/s (λ) and average response time is 200ms (W=0.2s), you need L=20 concurrent connections. Size your thread pool accordingly.
Queue Depth and LatencyQueue depth grows exponentially as utilization approaches 100%. At 50% utilization → 2× service time average wait. At 90% → 10×. At 99% → 100×.
Utilization CeilingTarget 50-70% utilization for good latency. Above 80%, queuing delay dominates. This is why autoscaling triggers at 70% CPU, not 90%.
M/M/1 Queue AssumptionsPoisson arrivals, exponential service times, single server. Real systems are more complex, but M/M/1 is a useful bounding model for back-of-the-envelope.

Capacity Planning Methodology

ItemDescription
1. Define SLOsWhat are you promising? p99 latency < 200ms, availability 99.95% (4.38h downtime/year), throughput 10K req/s. Everything flows from SLOs.
2. Measure Current BaselinePer-request CPU time, memory allocation, I/O operations. Use profiling (pprof, async-profiler, perf). Don't guess — measure at steady state under load.
3. Identify BottleneckCPU-bound? Memory-bound? I/O-bound? Lock contention? Use USL (Universal Scalability Law) to model contention and coherency penalties from benchmarking data.
4. Forecast LoadSeasonality (Black Friday 3-5×), trend (10% MoM growth), one-time events (Super Bowl ad). Use 99th percentile peak, not average. Plan for 2-3× your 95th percentile peak.
5. Calculate Required CapacityApply growth factor to baseline. Add 30% headroom for variance. Round up to nearest instance size. Validate with load test at 2× projected peak.

Scaling Strategies

ItemDescription
Vertical Scaling (Scale Up)Bigger instances: more CPU, RAM, network. Simple — no code changes. Hits ceiling (largest available instance). AWS: x1e.32xlarge = 128 vCPU, 3.9TB RAM.
Horizontal Scaling (Scale Out)More instances behind a load balancer. Near-linear scaling for stateless services. Requires stateless design (shared-nothing). Cost-effective but adds operational complexity.
Database ScalingRead replicas for read-heavy workloads (10:1 read:write ratio = 10 replicas). Sharding by tenant/user ID for write scaling. Connection pooling: PgBouncer for PostgreSQL.
Cache EverythingCDN for static assets (99%+ cache hit rate). Redis/Memcached for hot data (DB queries down 90%+). Application-level caching with TTL-based invalidation.
Asynchronous ProcessingMove non-critical work to background queues. User request returns in 200ms → order confirmation email sent asynchronously. Reduces peak load by 40-60%.
Auto-ScalingScale based on metrics: CPU > 70% (add 1), CPU < 30% (remove 1). Cooldown: 300s scale-in to prevent flapping. Schedule pre-scaling for known events (daily 9 AM spike).
Pro Tip: Every system has a bottleneck. Find yours before your users do. Use back-of-the-envelope math: a 1 Gbps link can handle ~10K requests/sec at 100KB each. Start there, then measure.