All it took to prove me wrong was a single major company deciding to take a shot at a technically-feasible task
Google/YouTube were working heavily on multimodal models >18 months ago, including for music/non-speech audio. Some of those have been publicly demonstrated over the last year, e.g. text → music. Knowing this, I fully expected Gemini to be highly multimodal (I was actually expecting more music related capabilities than they’ve mentioned so far in their publicity materials).
Google/YouTube were working heavily on multimodal models >18 months ago, including for music/non-speech audio. Some of those have been publicly demonstrated over the last year, e.g. text → music. Knowing this, I fully expected Gemini to be highly multimodal (I was actually expecting more music related capabilities than they’ve mentioned so far in their publicity materials).