“A 2-million token context window that works natively across text, image, audio, and video, without transcription intermediaries.”

Two million tokens of native multimodal context. No transcription step for audio and video. Google is throwing raw capacity at the problem while everyone else optimizes for cost. The question is whether anyone has a use case that actually needs two million tokens or if this is just an arms race benchmark that sounds good in a press release.