A 3B model that runs on a phone won’t beat a 405B in a data center on benchmarks — but it’s only ~5 points behind on most chat use cases, and it runs offline on $50 of silicon. Two skills get you the rest of the way: distillation (how the 3B was trained to punch above its weight) and speculative decoding (how to make the 3B feel 2× faster at inference time without changing the weights).