[Later edit: I acknowledge this is largely wrong! :-) ]
Have you researched or thought about how the models are dealing with visual information?
When ChatGPT or Gemini generates an image at a user’s request, they are evidently generating a prompt based on accumulated instructions and then passing it to a specialized visual AI like DALLE-3 or Imagen-3. When they process an uploaded image (e.g. provide a description of it), something similar must be occurring.
On the other hand, when they answer your request, “how can I make the object in this picture”, the reply comes from the more verbal intelligence, the LLM proper, and it will be responding on the basis of a verbal description of the picture supplied by its visual coprocessor. The quality of the response is therefore limited by the quality of the verbal description of the image—which easily leaves out details that may turn out to be important.
I would be surprised if the LLM even has the capacity to tell the visual AI, something like “pay special attention to detail”. My impression of the visual AIs in use is that they generate their description of an image, take it or leave it. It would be possible to train a visual AI whose processing of an image is dependent on context, like an instruction to pay attention to detail or to look for extra details, but I haven’t noticed any evidence of this yet.
The one model that I might expect to have a more sophisticated interaction between verbal and visual components, is 4o interacting as in the original demo, watching and listening in real time. I haven’t had the opportunity to interact with 4o in that fashion, but there must be some special architecture and training to give it the ability of real-time interaction, even if there’s also some core that’s still the same as in 4o when accessed via text chat. (I wonder to what extent Sora the video AI has features in common with the video processing that 4o does.)
I believe image processing used to be done by a separate AI that would generate a text description and pass it to the LLM. Nowadays, most frontier models are “natively multimodal,” meaning the same model is pretrained to understand both text and images. Models like GPT-4o can even do image generation natively now: https://openai.com/index/introducing-4o-image-generation. Even though making 4o “watch in real time” is not currently an option as far as I’m aware, uploading a single image to ChatGPT should do basically the same thing.
It’s true that frontier models are still much worse at understanding images than text, though.
My understanding is that there was a separate image model in historical vlms like flamingo but that it passed on a vector representation of the image not text
I understood, very much secondhand, that current LLMs are still using a separately trained part of the model’s input space for images. I’m very unsure how the model weights are integrating the different types of thinking, but am by default skeptical that it integrates cleanly into other parts of reasoning.
That said, I’m also skeptical that this is fundamentally a had part of the problem, as simulation and generated data seems like a very tractable route to improving this, if/once model developers see it as a critical bottleneck for tens of billions of dollars in revenue.
[Later edit: I acknowledge this is largely wrong! :-) ]
Have you researched or thought about how the models are dealing with visual information?
When ChatGPT or Gemini generates an image at a user’s request, they are evidently generating a prompt based on accumulated instructions and then passing it to a specialized visual AI like DALLE-3 or Imagen-3. When they process an uploaded image (e.g. provide a description of it), something similar must be occurring.
On the other hand, when they answer your request, “how can I make the object in this picture”, the reply comes from the more verbal intelligence, the LLM proper, and it will be responding on the basis of a verbal description of the picture supplied by its visual coprocessor. The quality of the response is therefore limited by the quality of the verbal description of the image—which easily leaves out details that may turn out to be important.
I would be surprised if the LLM even has the capacity to tell the visual AI, something like “pay special attention to detail”. My impression of the visual AIs in use is that they generate their description of an image, take it or leave it. It would be possible to train a visual AI whose processing of an image is dependent on context, like an instruction to pay attention to detail or to look for extra details, but I haven’t noticed any evidence of this yet.
The one model that I might expect to have a more sophisticated interaction between verbal and visual components, is 4o interacting as in the original demo, watching and listening in real time. I haven’t had the opportunity to interact with 4o in that fashion, but there must be some special architecture and training to give it the ability of real-time interaction, even if there’s also some core that’s still the same as in 4o when accessed via text chat. (I wonder to what extent Sora the video AI has features in common with the video processing that 4o does.)
I believe image processing used to be done by a separate AI that would generate a text description and pass it to the LLM. Nowadays, most frontier models are “natively multimodal,” meaning the same model is pretrained to understand both text and images. Models like GPT-4o can even do image generation natively now: https://openai.com/index/introducing-4o-image-generation. Even though making 4o “watch in real time” is not currently an option as far as I’m aware, uploading a single image to ChatGPT should do basically the same thing.
It’s true that frontier models are still much worse at understanding images than text, though.
My understanding is that there was a separate image model in historical vlms like flamingo but that it passed on a vector representation of the image not text
I understood, very much secondhand, that current LLMs are still using a separately trained part of the model’s input space for images. I’m very unsure how the model weights are integrating the different types of thinking, but am by default skeptical that it integrates cleanly into other parts of reasoning.
That said, I’m also skeptical that this is fundamentally a had part of the problem, as simulation and generated data seems like a very tractable route to improving this, if/once model developers see it as a critical bottleneck for tens of billions of dollars in revenue.