Capacity Planning Cheat Sheet

Capacity planning for distributed systems — load forecasting, bottleneck analysis, Little's Law, back-of-the-envelope estimation, scaling strategies, and cost opt.

Last Updated: May 1, 2025

Back-of-the-Envelope Numbers

Operation	Latency	Throughput/Sec	Notes
L1 cache reference	1 ns	—	3 CPU cycles; 64 bytes typical cache line
L2 cache reference	4 ns	—	~14 CPU cycles
Main memory (RAM) reference	100 ns	—	~200 CPU cycles; DDR5 at 4800 MT/s
SSD random read (NVMe)	10-100 µs	500K-1M IOPS	NVMe Gen4: 7 GB/s sequential
HDD disk seek	2-10 ms	100-200 IOPS	Rotational latency dominates
Network round trip (same DC)	0.1-0.5 ms	—	Within availability zone
Network round trip (cross-region)	50-200 ms	—	NY↔London; speed of light in fiber = ~5 µs/km
Compress 1 KB (Snappy)	~1 µs	~1M	Negligible for most workloads

Little's Law & Queuing Theory

Item	Description
`Little's Law`	L = λ × W. Long-term average number of customers in system (L) = arrival rate (λ) × average time in system (W). Universal — applies to any stable system.
`Application to Web Servers`	If your server handles 100 req/s (λ) and average response time is 200ms (W=0.2s), you need L=20 concurrent connections. Size your thread pool accordingly.
`Queue Depth and Latency`	Queue depth grows exponentially as utilization approaches 100%. At 50% utilization → 2× service time average wait. At 90% → 10×. At 99% → 100×.
`Utilization Ceiling`	Target 50-70% utilization for good latency. Above 80%, queuing delay dominates. This is why autoscaling triggers at 70% CPU, not 90%.
`M/M/1 Queue Assumptions`	Poisson arrivals, exponential service times, single server. Real systems are more complex, but M/M/1 is a useful bounding model for back-of-the-envelope.

Capacity Planning Methodology

Item	Description
`1. Define SLOs`	What are you promising? p99 latency < 200ms, availability 99.95% (4.38h downtime/year), throughput 10K req/s. Everything flows from SLOs.
`2. Measure Current Baseline`	Per-request CPU time, memory allocation, I/O operations. Use profiling (pprof, async-profiler, perf). Don't guess — measure at steady state under load.
`3. Identify Bottleneck`	CPU-bound? Memory-bound? I/O-bound? Lock contention? Use USL (Universal Scalability Law) to model contention and coherency penalties from benchmarking data.
`4. Forecast Load`	Seasonality (Black Friday 3-5×), trend (10% MoM growth), one-time events (Super Bowl ad). Use 99th percentile peak, not average. Plan for 2-3× your 95th percentile peak.
`5. Calculate Required Capacity`	Apply growth factor to baseline. Add 30% headroom for variance. Round up to nearest instance size. Validate with load test at 2× projected peak.

Scaling Strategies

Item	Description
`Vertical Scaling (Scale Up)`	Bigger instances: more CPU, RAM, network. Simple — no code changes. Hits ceiling (largest available instance). AWS: x1e.32xlarge = 128 vCPU, 3.9TB RAM.
`Horizontal Scaling (Scale Out)`	More instances behind a load balancer. Near-linear scaling for stateless services. Requires stateless design (shared-nothing). Cost-effective but adds operational complexity.
`Database Scaling`	Read replicas for read-heavy workloads (10:1 read:write ratio = 10 replicas). Sharding by tenant/user ID for write scaling. Connection pooling: PgBouncer for PostgreSQL.
`Cache Everything`	CDN for static assets (99%+ cache hit rate). Redis/Memcached for hot data (DB queries down 90%+). Application-level caching with TTL-based invalidation.
`Asynchronous Processing`	Move non-critical work to background queues. User request returns in 200ms → order confirmation email sent asynchronously. Reduces peak load by 40-60%.
`Auto-Scaling`	Scale based on metrics: CPU > 70% (add 1), CPU < 30% (remove 1). Cooldown: 300s scale-in to prevent flapping. Schedule pre-scaling for known events (daily 9 AM spike).

Pro Tip: Every system has a bottleneck. Find yours before your users do. Use back-of-the-envelope math: a 1 Gbps link can handle ~10K requests/sec at 100KB each. Start there, then measure.