Caleb Biddulph comments on Frontier AI Models Still Fail at Basic Physical Tasks: A Manufacturing Case Study

Caleb Biddulph 14 Apr 2025 22:11 UTC
9 points
2
I believe image processing used to be done by a separate AI that would generate a text description and pass it to the LLM. Nowadays, most frontier models are “natively multimodal,” meaning the same model is pretrained to understand both text and images. Models like GPT-4o can even do image generation natively now: https://openai.com/index/introducing-4o-image-generation. Even though making 4o “watch in real time” is not currently an option as far as I’m aware, uploading a single image to ChatGPT should do basically the same thing.
It’s true that frontier models are still much worse at understanding images than text, though.
- Neel Nanda 15 Apr 2025 5:51 UTC
  8 points
  3
  Parent
  My understanding is that there was a separate image model in historical vlms like flamingo but that it passed on a vector representation of the image not text
  - Davidmanheim 15 Apr 2025 13:19 UTC
    4 points
    1
    Parent
    I understood, very much secondhand, that current LLMs are still using a separately trained part of the model’s input space for images. I’m very unsure how the model weights are integrating the different types of thinking, but am by default skeptical that it integrates cleanly into other parts of reasoning.
    That said, I’m also skeptical that this is fundamentally a had part of the problem, as simulation and generated data seems like a very tractable route to improving this, if/once model developers see it as a critical bottleneck for tens of billions of dollars in revenue.