Back-of-the-Envelope Numbers
| Operation | Latency | Throughput/Sec | Notes |
| L1 cache reference | 1 ns | — | 3 CPU cycles; 64 bytes typical cache line |
| L2 cache reference | 4 ns | — | ~14 CPU cycles |
| Main memory (RAM) reference | 100 ns | — | ~200 CPU cycles; DDR5 at 4800 MT/s |
| SSD random read (NVMe) | 10-100 µs | 500K-1M IOPS | NVMe Gen4: 7 GB/s sequential |
| HDD disk seek | 2-10 ms | 100-200 IOPS | Rotational latency dominates |
| Network round trip (same DC) | 0.1-0.5 ms | — | Within availability zone |
| Network round trip (cross-region) | 50-200 ms | — | NY↔London; speed of light in fiber = ~5 µs/km |
| Compress 1 KB (Snappy) | ~1 µs | ~1M | Negligible for most workloads |
Little's Law & Queuing Theory
| Item | Description |
Little's Law | L = λ × W. Long-term average number of customers in system (L) = arrival rate (λ) × average time in system (W). Universal — applies to any stable system. |
Application to Web Servers | If your server handles 100 req/s (λ) and average response time is 200ms (W=0.2s), you need L=20 concurrent connections. Size your thread pool accordingly. |
Queue Depth and Latency | Queue depth grows exponentially as utilization approaches 100%. At 50% utilization → 2× service time average wait. At 90% → 10×. At 99% → 100×. |
Utilization Ceiling | Target 50-70% utilization for good latency. Above 80%, queuing delay dominates. This is why autoscaling triggers at 70% CPU, not 90%. |
M/M/1 Queue Assumptions | Poisson arrivals, exponential service times, single server. Real systems are more complex, but M/M/1 is a useful bounding model for back-of-the-envelope. |
Capacity Planning Methodology
| Item | Description |
1. Define SLOs | What are you promising? p99 latency < 200ms, availability 99.95% (4.38h downtime/year), throughput 10K req/s. Everything flows from SLOs. |
2. Measure Current Baseline | Per-request CPU time, memory allocation, I/O operations. Use profiling (pprof, async-profiler, perf). Don't guess — measure at steady state under load. |
3. Identify Bottleneck | CPU-bound? Memory-bound? I/O-bound? Lock contention? Use USL (Universal Scalability Law) to model contention and coherency penalties from benchmarking data. |
4. Forecast Load | Seasonality (Black Friday 3-5×), trend (10% MoM growth), one-time events (Super Bowl ad). Use 99th percentile peak, not average. Plan for 2-3× your 95th percentile peak. |
5. Calculate Required Capacity | Apply growth factor to baseline. Add 30% headroom for variance. Round up to nearest instance size. Validate with load test at 2× projected peak. |
Scaling Strategies
| Item | Description |
Vertical Scaling (Scale Up) | Bigger instances: more CPU, RAM, network. Simple — no code changes. Hits ceiling (largest available instance). AWS: x1e.32xlarge = 128 vCPU, 3.9TB RAM. |
Horizontal Scaling (Scale Out) | More instances behind a load balancer. Near-linear scaling for stateless services. Requires stateless design (shared-nothing). Cost-effective but adds operational complexity. |
Database Scaling | Read replicas for read-heavy workloads (10:1 read:write ratio = 10 replicas). Sharding by tenant/user ID for write scaling. Connection pooling: PgBouncer for PostgreSQL. |
Cache Everything | CDN for static assets (99%+ cache hit rate). Redis/Memcached for hot data (DB queries down 90%+). Application-level caching with TTL-based invalidation. |
Asynchronous Processing | Move non-critical work to background queues. User request returns in 200ms → order confirmation email sent asynchronously. Reduces peak load by 40-60%. |
Auto-Scaling | Scale based on metrics: CPU > 70% (add 1), CPU < 30% (remove 1). Cooldown: 300s scale-in to prevent flapping. Schedule pre-scaling for known events (daily 9 AM spike). |
Pro Tip: Every system has a bottleneck. Find yours before your users do. Use back-of-the-envelope math: a 1 Gbps link can handle ~10K requests/sec at 100KB each. Start there, then measure.