The Memory Wall Limits Everything
The dominant bottleneck in modern AI workloads is not computation but memory bandwidth the speed at which data can be moved to and from the processors that need it.
"A huge chunk of the time in large model training/inference is not spent computing matrix multiplies, but rather waiting for data to get to the compute resources. The obvious question is why don't architects put more memory closer to the compute. The answer is $$$." Dylan Patel
Even in 2018, purely compute-bound workloads made up 99.8% of FLOPS but only 61% of runtime. Normalization and pointwise operations achieve 250x and 700x fewer FLOPS than matrix multiplications, yet they consumed nearly 40% of the model's runtime. The reason is memory bandwidth: every operation requires reading data from DRAM, computing, and writing results back. When an operation does very little math per byte moved, you spend all your time shipping data around.
The economics of memory create a brutal hierarchy. SRAM on chip is fast but costs hundreds of dollars per gigabyte. HBM provides massive bandwidth through 3D-stacked DRAM but runs $10-20 per GB including packaging. Standard DRAM is cheap at a few dollars per GB but far too slow. From NVIDIA's P100 to the H100, compute (FP16 FLOPS) increased 46x, but memory capacity only grew 5x. This widening gap means that even with a $25,000+ GPU, you routinely achieve only 60% FLOPS utilization the rest of the time the processor sits idle, waiting for data.
The primary weapon against the memory wall is operator fusion: instead of writing intermediate results back to DRAM between each operation, you chain multiple operations together in a single pass. This is why Flash Attention, Triton kernels, and PyTorch 2.0's compiler exist they are all fundamentally about reducing memory round-trips. Understanding whether you are compute-bound, memory-bound, or overhead-bound is the single most important diagnostic in ML systems engineering.
Takeaway: Compute is cheap and getting cheaper; moving data is expensive and getting relatively more expensive so the winning architectures are the ones that minimize data movement.
See also: CUDA Is a Moat Not Just a Library | Dennard Scaling Ended and Everything Changed | Goodput Matters More Than Throughput
Linked from
- CUDA Is a Moat Not Just a Library
- Dennard Scaling Ended and Everything Changed
- Distributed Training Is a Systems Problem Not an ML Problem
- Inference Cost Dominates Training at Scale
- Operator Fusion Is the Most Important Optimization in Deep Learning
- Quantization Trades Precision for Accessibility
- Scaling Laws Open New Dimensions When Old Ones Stall
- Triton Democratizes GPU Programming
- x86 Pays an Architectural Tax That ARM Does Not