“A quantized 400 billion parameter model running natively on the iPhone 17 Pro at usable token rates.”
A model that needed an H100 cluster two years ago now runs on a phone in your pocket. The pace of compression and quantization research is the underrated story of the LLM era. The flagship cloud APIs are going to lose a lot of their moat over the next three years, and Apple is the company best positioned to take advantage of it. They have been quietly building the on-device stack the whole time.